1. 首页 > Oracle教程 > 正文

Oracle教程FG252-Oracle GoldenGate监控与告警实战

本文档风哥主要介绍Oracle GoldenGate监控与告警的实战操作,包括Oracle GoldenGate监控的概念、监控项、告警机制、监控脚本、告警实现等内容,由风哥教程参考Oracle官方文档GoldenGate文档,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。

Part01-基础概念与理论知识

1.1 Oracle GoldenGate监控概念

Oracle GoldenGate监控是指通过收集和分析GoldenGate运行状态、性能指标、日志信息等数据,实时了解GoldenGate运行状况的过程。监控是GoldenGate运维管理的基础,能够及时发现异常并预防故障。更多视频教程www.fgedu.net.cn

Oracle GoldenGate监控的主要目标:

  • 实时了解进程运行状态
  • 监控数据复制延迟
  • 发现潜在的性能问题
  • 及时发现故障和异常
  • 为容量规划提供数据支持

1.2 Oracle GoldenGate监控项

Oracle GoldenGate主要监控项:

  • 进程状态:Manager、Extract、Replicat进程的运行状态
  • 复制延迟:数据从源端到目标端的延迟时间
  • 检查点信息:进程的读取和应用位置
  • 统计信息:处理的事务数、记录数等
  • Trail文件:Trail文件的大小、数量、增长情况
  • 系统资源:CPU、内存、磁盘、网络等资源使用情况
  • 错误日志:进程报告和日志文件中的错误信息

1.3 Oracle GoldenGate告警概念

Oracle GoldenGate告警是指当监控指标超过预设阈值时,通过邮件、短信、钉钉等方式通知相关人员的过程。告警是监控的延伸,能够确保问题得到及时处理。

风哥提示:告警设置要遵循”合理阈值、及时通知、分级处理”的原则。阈值设置过高会导致漏报,设置过低会导致误报。学习交流加群风哥微信: itpux-com

Part02-生产环境规划与建议

2.1 Oracle GoldenGate监控规划

Oracle GoldenGate监控规划要点:

# 监控频率规划
实时监控:进程状态、复制延迟
– 监控频率:每5分钟
– 监控工具:自动化脚本

定时监控:统计信息、资源使用
– 监控频率:每小时
– 监控工具:定时任务

日常监控:趋势分析、容量评估
– 监控频率:每天
– 监控工具:报表系统

# 监控范围规划
源端监控:
– Extract进程状态
– 数据库日志生成情况
– 系统资源使用

目标端监控:
– Replicat进程状态
– 数据库写入性能
– 系统资源使用

网络监控:
– 网络带宽使用
– 网络延迟
– 连接状态

2.2 Oracle GoldenGate告警规划

Oracle GoldenGate告警阈值规划:

# 告警级别规划
严重告警(Critical):
– 进程ABENDED或STOPPED
– 延迟超过30分钟
– 数据库连接失败
– 通知方式:短信+邮件+电话

重要告警(Warning):
– 延迟超过10分钟
– Trail文件空间不足
– 系统资源使用超过80%
– 通知方式:邮件+钉钉

一般告警(Info):
– 延迟超过5分钟
– 统计信息异常
– 通知方式:邮件

# 告警阈值设置
LAG_CRITICAL=1800 # 30分钟
LAG_WARNING=600 # 10分钟
LAG_INFO=300 # 5分钟

DISK_CRITICAL=90 # 磁盘使用率90%
DISK_WARNING=80 # 磁盘使用率80%

CPU_CRITICAL=90 # CPU使用率90%
CPU_WARNING=80 # CPU使用率80%

2.3 Oracle GoldenGate监控最佳实践

Oracle GoldenGate监控最佳实践:

  • 全面覆盖:监控所有关键指标,不留死角
  • 合理阈值:根据实际情况设置合理的告警阈值
  • 及时响应:建立快速响应机制,及时处理告警
  • 定期回顾:定期回顾监控数据,优化监控策略
  • 文档记录:记录监控配置和处理流程
  • 自动化:尽量使用自动化脚本,减少人工操作
生产环境建议:建立完善的监控体系,包括实时监控、定时监控和趋势分析。同时建立分级告警机制,确保重要问题能够及时得到处理。学习交流加群风哥QQ113257174

Part03-生产环境项目实施方案

3.1 Oracle GoldenGate GGSCI监控

3.1.1 进程状态监控

# 1. 查看所有进程状态
GGSCI (fgedu.net.cn) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:00 00:00:03
EXTRACT RUNNING DP01 00:00:00 00:00:05

# 2. 查看指定进程详细信息
GGSCI (fgedu.net.cn) 2> info extract ext01, detail

EXTRACT EXT01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
Log Read Checkpoint Oracle Integrated Redo Logs
2026-04-03 10:00:00 SCN 12345678
Log position: 12345678

# 3. 查看进程检查点
GGSCI (fgedu.net.cn) 3> info extract ext01, showch

EXTRACT EXT01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)

Read Checkpoint #1
Oracle Integrated Redo Logs
Startup Checkpoint (starting position in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 08:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log

Current Checkpoint (position of last record read in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 10:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log

3.1.2 延迟监控

# 1. 查看Extract延迟
GGSCI (fgedu.net.cn) 1> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)

# 2. 查看Replicat延迟
GGSCI (fgedu-target.net.cn) 1> info replicat rep01

REPLICAT REP01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)

# 3. 查看详细延迟信息
GGSCI (fgedu-target.net.cn) 2> lag replicat rep01

Sending LAG request to REPLICAT REP01 …

Last record lag: 2 seconds.
At EOF, no more records to process.

# 4. 查看统计信息
GGSCI (fgedu-target.net.cn) 3> stats replicat rep01

Sending STATS request to REPLICAT REP01 …

Start of Statistics at 2026-04-03 10:30:00.

Replicating from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
*** Total statistics since 2026-04-03 08:00:00 ***
Total inserts 10000.00
Total updates 5000.00
Total deletes 1000.00
Total operations 16000.00

*** Daily statistics since 2026-04-03 00:00:00 ***
Total inserts 50000.00
Total updates 25000.00
Total deletes 5000.00
Total operations 80000.00

*** Hourly statistics since 2026-04-03 10:00:00 ***
Total inserts 2000.00
Total updates 1000.00
Total deletes 200.00
Total operations 3200.00

3.2 Oracle GoldenGate脚本监控

3.2.1 进程状态监控脚本

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_status_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 配置
GG_HOME=”/oracle/goldengate”
LOG_FILE=”$GG_HOME/monitor/status_$(date +%Y%m%d).log”
ALERT_FILE=”$GG_HOME/monitor/alert.log”

# 检查Manager状态
check_manager() {
STATUS=$(echo “info mgr” | $GG_HOME/ggsci | grep “Manager” | awk ‘{print $2}’)
if [ “$STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Manager is $STATUS” >> $ALERT_FILE
return 1
fi
return 0
}

# 检查Extract状态
check_extract() {
EXTRACT_LIST=”ext01 ext02″
for EXTRACT in $EXTRACT_LIST; do
STATUS=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci | grep “EXTRACT” | awk ‘{print $2}’)
if [ “$STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Extract $EXTRACT is $STATUS” >> $ALERT_FILE
return 1
fi
done
return 0
}

# 检查Replicat状态
check_replicat() {
REPLICAT_LIST=”rep01 rep02″
for REPLICAT in $REPLICAT_LIST; do
STATUS=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci | grep “REPLICAT” | awk ‘{print $2}’)
if [ “$STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Replicat $REPLICAT is $STATUS” >> $ALERT_FILE
return 1
fi
done
return 0
}

# 主程序
main() {
echo “$(date): Starting GoldenGate status check” >> $LOG_FILE

check_manager
check_extract
check_replicat

echo “$(date): GoldenGate status check completed” >> $LOG_FILE
}

main

3.2.2 延迟监控脚本

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_lag_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 配置
GG_HOME=”/oracle/goldengate”
LOG_FILE=”$GG_HOME/monitor/lag_$(date +%Y%m%d).log”
ALERT_FILE=”$GG_HOME/monitor/lag_alert.log”

# 阈值(秒)
LAG_CRITICAL=1800
LAG_WARNING=600

# 检查Extract延迟
check_extract_lag() {
EXTRACT=$1
LAG_INFO=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci | grep “Checkpoint Lag” | awk ‘{print $3}’)

# 转换延迟为秒
LAG_SECONDS=$(echo $LAG_INFO | awk -F: ‘{print $1*3600 + $2*60 + $3}’)

echo “$(date): Extract $EXTRACT lag is $LAG_INFO ($LAG_SECONDS seconds)” >> $LOG_FILE

if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
echo “$(date): CRITICAL – Extract $EXTRACT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 2
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
echo “$(date): WARNING – Extract $EXTRACT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 1
fi
return 0
}

# 检查Replicat延迟
check_replicat_lag() {
REPLICAT=$1
LAG_INFO=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci | grep “Checkpoint Lag” | awk ‘{print $3}’)

# 转换延迟为秒
LAG_SECONDS=$(echo $LAG_INFO | awk -F: ‘{print $1*3600 + $2*60 + $3}’)

echo “$(date): Replicat $REPLICAT lag is $LAG_INFO ($LAG_SECONDS seconds)” >> $LOG_FILE

if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
echo “$(date): CRITICAL – Replicat $REPLICAT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 2
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
echo “$(date): WARNING – Replicat $REPLICAT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 1
fi
return 0
}

# 主程序
main() {
echo “$(date): Starting GoldenGate lag check” >> $LOG_FILE

check_extract_lag “ext01”
check_extract_lag “ext02”
check_replicat_lag “rep01”
check_replicat_lag “rep02”

echo “$(date): GoldenGate lag check completed” >> $LOG_FILE
}

main

3.3 Oracle GoldenGate告警实现

3.3.1 邮件告警

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_email_alert.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 配置
ALERT_FILE=”/oracle/goldengate/monitor/alert.log”
EMAIL_LIST=”dba@fgedu.net.cn,support@fgedu.net.cn”
SMTP_SERVER=”smtp.fgedu.net.cn”
SMTP_USER=”alert@fgedu.net.cn”
SMTP_PASS=”password”

# 发送邮件告警
send_email_alert() {
SUBJECT=”GoldenGate Alert – $(date +%Y%m%d%H%M%S)”
BODY=$(cat $ALERT_FILE)

# 使用sendmail发送邮件
{
echo “To: $EMAIL_LIST”
echo “Subject: $SUBJECT”
echo “Content-Type: text/plain; charset=UTF-8”
echo “”
echo “$BODY”
} | /usr/sbin/sendmail -t

echo “$(date): Email alert sent” >> /oracle/goldengate/monitor/alert.log
}

# 检查是否有告警
if [ -s $ALERT_FILE ]; then
send_email_alert
# 清空告警文件
> $ALERT_FILE
fi

3.3.2 钉钉告警

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_dingtalk_alert.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 配置
ALERT_FILE=”/oracle/goldengate/monitor/alert.log”
DINGTALK_WEBHOOK=”https://oapi.dingtalk.com/robot/send?access_token=your_token”

# 发送钉钉告警
send_dingtalk_alert() {
MESSAGE=$(cat $ALERT_FILE)

# 构造JSON消息
JSON_DATA=”{
\”msgtype\”: \”text\”,
\”text\”: {
\”content\”: \”GoldenGate Alert\\n$MESSAGE\”
},
\”at\”: {
\”isAtAll\”: false
}
}”

# 发送HTTP请求
curl -s -X POST \
-H “Content-Type: application/json” \
-d “$JSON_DATA” \
$DINGTALK_WEBHOOK

echo “$(date): DingTalk alert sent” >> /oracle/goldengate/monitor/alert.log
}

# 检查是否有告警
if [ -s $ALERT_FILE ]; then
send_dingtalk_alert
# 清空告警文件
> $ALERT_FILE
fi

风哥提示:告警通知要及时但不过度。建议设置告警抑制机制,避免相同问题重复告警。同时要建立告警升级机制,确保严重问题能够得到及时处理。更多学习教程公众号风哥教程itpux_com

Part04-生产案例与实战讲解

4.1 Oracle GoldenGate进程状态监控案例

以下是一个完整的进程状态监控案例:

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_process_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 配置
GG_HOME=”/oracle/goldengate”
LOG_DIR=”$GG_HOME/monitor”
LOG_FILE=”$LOG_DIR/process_$(date +%Y%m%d).log”
ALERT_FILE=”$LOG_DIR/process_alert.log”
PID_FILE=”$LOG_DIR/gg_monitor.pid”

# 进程列表
MANAGER_PID=””
EXTRACT_LIST=”ext01 ext02″
REPLICAT_LIST=”rep01 rep02″

# 检查Manager
check_manager() {
echo “$(date): Checking Manager status” >> $LOG_FILE

# 获取Manager状态
MANAGER_STATUS=$(echo “info mgr” | $GG_HOME/ggsci 2>/dev/null | grep “Manager” | awk ‘{print $2}’)

if [ “$MANAGER_STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Manager is $MANAGER_STATUS” >> $ALERT_FILE
echo “$(date): Attempting to start Manager” >> $LOG_FILE
echo “start mgr” | $GG_HOME/ggsci
sleep 5

# 再次检查
MANAGER_STATUS=$(echo “info mgr” | $GG_HOME/ggsci 2>/dev/null | grep “Manager” | awk ‘{print $2}’)
if [ “$MANAGER_STATUS” != “RUNNING” ]; then
echo “$(date): CRITICAL – Failed to start Manager” >> $ALERT_FILE
return 2
fi
fi

echo “$(date): Manager is RUNNING” >> $LOG_FILE
return 0
}

# 检查Extract
check_extract() {
EXTRACT=$1
echo “$(date): Checking Extract $EXTRACT status” >> $LOG_FILE

EXTRACT_STATUS=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci 2>/dev/null | grep “EXTRACT” | head -1 | awk ‘{print $2}’)

if [ “$EXTRACT_STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Extract $EXTRACT is $EXTRACT_STATUS” >> $ALERT_FILE

if [ “$EXTRACT_STATUS” == “ABENDED” ]; then
echo “$(date): Attempting to restart Extract $EXTRACT” >> $LOG_FILE
echo “start extract $EXTRACT” | $GG_HOME/ggsci
sleep 5

# 再次检查
EXTRACT_STATUS=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci 2>/dev/null | grep “EXTRACT” | head -1 | awk ‘{print $2}’)
if [ “$EXTRACT_STATUS” != “RUNNING” ]; then
echo “$(date): CRITICAL – Failed to restart Extract $EXTRACT” >> $ALERT_FILE
return 2
fi
fi
fi

echo “$(date): Extract $EXTRACT is RUNNING” >> $LOG_FILE
return 0
}

# 检查Replicat
check_replicat() {
REPLICAT=$1
echo “$(date): Checking Replicat $REPLICAT status” >> $LOG_FILE

REPLICAT_STATUS=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci 2>/dev/null | grep “REPLICAT” | head -1 | awk ‘{print $2}’)

if [ “$REPLICAT_STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Replicat $REPLICAT is $REPLICAT_STATUS” >> $ALERT_FILE

if [ “$REPLICAT_STATUS” == “ABENDED” ]; then
echo “$(date): Attempting to restart Replicat $REPLICAT” >> $LOG_FILE
echo “start replicat $REPLICAT” | $GG_HOME/ggsci
sleep 5

# 再次检查
REPLICAT_STATUS=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci 2>/dev/null | grep “REPLICAT” | head -1 | awk ‘{print $2}’)
if [ “$REPLICAT_STATUS” != “RUNNING” ]; then
echo “$(date): CRITICAL – Failed to restart Replicat $REPLICAT” >> $ALERT_FILE
return 2
fi
fi
fi

echo “$(date): Replicat $REPLICAT is RUNNING” >> $LOG_FILE
return 0
}

# 主程序
main() {
# 创建日志目录
mkdir -p $LOG_DIR

# 检查是否已在运行
if [ -f $PID_FILE ]; then
OLD_PID=$(cat $PID_FILE)
if ps -p $OLD_PID > /dev/null 2>&1; then
echo “Monitor is already running with PID $OLD_PID”
exit 1
fi
fi

# 记录PID
echo $$ > $PID_FILE

echo “$(date): Starting GoldenGate process monitor” >> $LOG_FILE

# 检查Manager
check_manager

# 检查Extract
for EXTRACT in $EXTRACT_LIST; do
check_extract $EXTRACT
done

# 检查Replicat
for REPLICAT in $REPLICAT_LIST; do
check_replicat $REPLICAT
done

echo “$(date): GoldenGate process monitor completed” >> $LOG_FILE

# 发送告警
if [ -s $ALERT_FILE ]; then
/oracle/goldengate/monitor/gg_email_alert.sh
/oracle/goldengate/monitor/gg_dingtalk_alert.sh
fi

# 清理PID文件
rm -f $PID_FILE
}

main

4.2 Oracle GoldenGate延迟监控案例

以下是一个完整的延迟监控案例:

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_lag_monitor_advanced.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 配置
GG_HOME=”/oracle/goldengate”
LOG_DIR=”$GG_HOME/monitor”
LOG_FILE=”$LOG_DIR/lag_$(date +%Y%m%d).log”
ALERT_FILE=”$LOG_DIR/lag_alert.log”
HISTORY_FILE=”$LOG_DIR/lag_history.csv”

# 阈值(秒)
LAG_CRITICAL=1800
LAG_WARNING=600
LAG_INFO=300

# 转换为秒的函数
time_to_seconds() {
TIME_STR=$1
HOURS=$(echo $TIME_STR | cut -d: -f1)
MINUTES=$(echo $TIME_STR | cut -d: -f2)
SECONDS=$(echo $TIME_STR | cut -d: -f3)
echo $((HOURS*3600 + MINUTES*60 + SECONDS))
}

# 监控Extract延迟
monitor_extract_lag() {
EXTRACT=$1
TIMESTAMP=$(date +%Y%m%d%H%M%S)

# 获取延迟信息
LAG_INFO=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci 2>/dev/null | grep “Checkpoint Lag” | awk ‘{print $3}’)

if [ -z “$LAG_INFO” ]; then
echo “$TIMESTAMP,$EXTRACT,EXTRACT,UNKNOWN,-1,ERROR” >> $HISTORY_FILE
echo “$(date): ERROR – Cannot get lag info for Extract $EXTRACT” >> $LOG_FILE
return 1
fi

# 转换为秒
LAG_SECONDS=$(time_to_seconds $LAG_INFO)

# 确定告警级别
ALERT_LEVEL=”NORMAL”
if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
ALERT_LEVEL=”CRITICAL”
echo “$(date): CRITICAL – Extract $EXTRACT lag $LAG_INFO exceeds critical threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
ALERT_LEVEL=”WARNING”
echo “$(date): WARNING – Extract $EXTRACT lag $LAG_INFO exceeds warning threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_INFO ]; then
ALERT_LEVEL=”INFO”
fi

# 记录历史数据
echo “$TIMESTAMP,$EXTRACT,EXTRACT,$LAG_INFO,$LAG_SECONDS,$ALERT_LEVEL” >> $HISTORY_FILE
echo “$(date): Extract $EXTRACT lag is $LAG_INFO ($LAG_SECONDS seconds) – $ALERT_LEVEL” >> $LOG_FILE

return 0
}

# 监控Replicat延迟
monitor_replicat_lag() {
REPLICAT=$1
TIMESTAMP=$(date +%Y%m%d%H%M%S)

# 获取延迟信息
LAG_INFO=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci 2>/dev/null | grep “Checkpoint Lag” | awk ‘{print $3}’)

if [ -z “$LAG_INFO” ]; then
echo “$TIMESTAMP,$REPLICAT,REPLICAT,UNKNOWN,-1,ERROR” >> $HISTORY_FILE
echo “$(date): ERROR – Cannot get lag info for Replicat $REPLICAT” >> $LOG_FILE
return 1
fi

# 转换为秒
LAG_SECONDS=$(time_to_seconds $LAG_INFO)

# 确定告警级别
ALERT_LEVEL=”NORMAL”
if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
ALERT_LEVEL=”CRITICAL”
echo “$(date): CRITICAL – Replicat $REPLICAT lag $LAG_INFO exceeds critical threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
ALERT_LEVEL=”WARNING”
echo “$(date): WARNING – Replicat $REPLICAT lag $LAG_INFO exceeds warning threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_INFO ]; then
ALERT_LEVEL=”INFO”
fi

# 记录历史数据
echo “$TIMESTAMP,$REPLICAT,REPLICAT,$LAG_INFO,$LAG_SECONDS,$ALERT_LEVEL” >> $HISTORY_FILE
echo “$(date): Replicat $REPLICAT lag is $LAG_INFO ($LAG_SECONDS seconds) – $ALERT_LEVEL” >> $LOG_FILE

return 0
}

# 生成延迟趋势报告
generate_lag_report() {
REPORT_FILE=”$LOG_DIR/lag_report_$(date +%Y%m%d).html”

cat > $REPORT_FILE << EOF
GoldenGate Lag Report – $(date +%Y%m%d)

GoldenGate Lag Report – $(date +%Y%m%d)

EOF

tail -100 $HISTORY_FILE | while IFS=, read TIMESTAMP PROCESS TYPE LAG SECONDS LEVEL; do
echo “

” >> $REPORT_FILE
done

echo “

Time Process Type Lag Seconds Level
$TIMESTAMP $PROCESS $TYPE $LAG $SECONDS $LEVEL

” >> $REPORT_FILE

echo “$(date): Lag report generated: $REPORT_FILE” >> $LOG_FILE
}

# 主程序
main() {
mkdir -p $LOG_DIR

echo “$(date): Starting GoldenGate lag monitor” >> $LOG_FILE

# 监控Extract
monitor_extract_lag “ext01”
monitor_extract_lag “ext02”

# 监控Replicat
monitor_replicat_lag “rep01”
monitor_replicat_lag “rep02”

# 生成报告
generate_lag_report

echo “$(date): GoldenGate lag monitor completed” >> $LOG_FILE

# 发送告警
if [ -s $ALERT_FILE ]; then
/oracle/goldengate/monitor/gg_email_alert.sh
fi
}

main

4.3 Oracle GoldenGate综合监控平台案例

以下是一个综合监控平台的完整案例:

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_monitor_platform.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# 综合监控平台主控脚本
# 整合进程监控、延迟监控、资源监控、告警通知等功能

# 配置
GG_HOME=”/oracle/goldengate”
LOG_DIR=”$GG_HOME/monitor”
CONFIG_FILE=”$GG_HOME/monitor/monitor.conf”

# 加载配置
source $CONFIG_FILE

# 初始化日志目录
init_log_dir() {
mkdir -p $LOG_DIR/{status,lag,resource,alert,report}

# 创建历史数据目录
mkdir -p $LOG_DIR/history/{daily,hourly}
}

# 进程状态监控模块
module_process() {
echo “$(date): Running process monitoring module”

LOG_FILE=”$LOG_DIR/status/process_$(date +%Y%m%d%H).log”

# 获取所有进程状态
echo “info all” | $GG_HOME/ggsci > $LOG_DIR/status/current_status.txt

# 分析状态
while read LINE; do
if echo “$LINE” | grep -q “EXTRACT\|REPLICAT”; then
PROCESS=$(echo $LINE | awk ‘{print $1}’)
STATUS=$(echo $LINE | awk ‘{print $2}’)
GROUP=$(echo $LINE | awk ‘{print $3}’)
LAG=$(echo $LINE | awk ‘{print $4}’)

TIMESTAMP=$(date +%Y%m%d%H%M%S)
echo “$TIMESTAMP,$PROCESS,$GROUP,$STATUS,$LAG” >> $LOG_DIR/history/process_history.csv

# 检查异常状态
if [ “$STATUS” != “RUNNING” ]; then
echo “$TIMESTAMP,$PROCESS,$GROUP,$STATUS,ALERT” >> $LOG_DIR/alert/process_alert.csv
fi
fi
done < $LOG_DIR/status/current_status.txt } # 延迟监控模块 module_lag() { echo "$(date): Running lag monitoring module" # 获取所有Extract和Replicat列表 EXTRACT_LIST=$(echo "info all" | $GG_HOME/ggsci | grep "EXTRACT" | awk '{print $3}') REPLICAT_LIST=$(echo "info all" | $GG_HOME/ggsci | grep "REPLICAT" | awk '{print $3}') # 监控Extract延迟 for EXTRACT in $EXTRACT_LIST; do LAG_INFO=$(echo "info extract $EXTRACT" | $GG_HOME/ggsci | grep "Checkpoint Lag" | awk '{print $3}') TIMESTAMP=$(date +%Y%m%d%H%M%S) echo "$TIMESTAMP,$EXTRACT,EXTRACT,$LAG_INFO" >> $LOG_DIR/history/lag_history.csv
done

# 监控Replicat延迟
for REPLICAT in $REPLICAT_LIST; do
LAG_INFO=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci | grep “Checkpoint Lag” | awk ‘{print $3}’)
TIMESTAMP=$(date +%Y%m%d%H%M%S)
echo “$TIMESTAMP,$REPLICAT,REPLICAT,$LAG_INFO” >> $LOG_DIR/history/lag_history.csv
done
}

# 资源监控模块
module_resource() {
echo “$(date): Running resource monitoring module”

TIMESTAMP=$(date +%Y%m%d%H%M%S)

# CPU使用率
CPU_USAGE=$(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’ | cut -d’%’ -f1)

# 内存使用率
MEM_USAGE=$(free | grep Mem | awk ‘{printf “%.2f”, $3/$2 * 100.0}’)

# 磁盘使用率
DISK_USAGE=$(df -h $GG_HOME | tail -1 | awk ‘{print $5}’ | cut -d’%’ -f1)

# GoldenGate进程资源使用
GG_CPU=$(ps -eo pid,comm,pcpu | grep “extract\|replicat” | awk ‘{sum+=$3} END {print sum}’)
GG_MEM=$(ps -eo pid,comm,pmem | grep “extract\|replicat” | awk ‘{sum+=$3} END {print sum}’)

# 记录数据
echo “$TIMESTAMP,$CPU_USAGE,$MEM_USAGE,$DISK_USAGE,$GG_CPU,$GG_MEM” >> $LOG_DIR/history/resource_history.csv

# 检查阈值
if [ ${DISK_USAGE%.*} -gt 80 ]; then
echo “$TIMESTAMP,DISK,$DISK_USAGE%,WARNING” >> $LOG_DIR/alert/resource_alert.csv
fi
}

# 告警处理模块
module_alert() {
echo “$(date): Running alert processing module”

ALERT_FILE=”$LOG_DIR/alert/all_alerts_$(date +%Y%m%d).log”

# 合并所有告警
cat $LOG_DIR/alert/*.csv > $ALERT_FILE 2>/dev/null

# 发送告警通知
if [ -s $ALERT_FILE ]; then
# 邮件告警
if [ “$EMAIL_ALERT” == “Y” ]; then
/oracle/goldengate/monitor/gg_email_alert.sh
fi

# 钉钉告警
if [ “$DINGTALK_ALERT” == “Y” ]; then
/oracle/goldengate/monitor/gg_dingtalk_alert.sh
fi

# 短信告警(严重告警)
if [ “$SMS_ALERT” == “Y” ] && grep -q “CRITICAL” $ALERT_FILE; then
/oracle/goldengate/monitor/gg_sms_alert.sh
fi
fi
}

# 报告生成模块
module_report() {
echo “$(date): Running report generation module”

REPORT_FILE=”$LOG_DIR/report/daily_report_$(date +%Y%m%d).html”

# 生成HTML报告
cat > $REPORT_FILE << 'EOF'

GoldenGate Daily Report – $(date +%Y-%m-%d)

Process Status

EOF

# 添加进程状态数据
tail -20 $LOG_DIR/history/process_history.csv | while IFS=, read TIMESTAMP PROCESS GROUP STATUS LAG; do
CSS_CLASS=”normal”
if [ “$STATUS” != “RUNNING” ]; then
CSS_CLASS=”alert”
fi
echo “

” >> $REPORT_FILE
done

cat >> $REPORT_FILE << 'EOF'

Process Group Status Lag
$PROCESS $GROUP $STATUS $LAG

Performance Metrics

EOF

# 添加性能数据
tail -10 $LOG_DIR/history/resource_history.csv | while IFS=, read TIMESTAMP CPU MEM DISK GGCPU GGMEM; do
echo “

” >> $REPORT_FILE
done

echo “

Time CPU% Memory% Disk% GG CPU% GG Mem%
$TIMESTAMP $CPU $MEM $DISK $GGCPU $GGMEM

” >> $REPORT_FILE

echo “$(date): Report generated: $REPORT_FILE”
}

# 主程序
main() {
echo “$(date): Starting GoldenGate Monitor Platform”

# 初始化
init_log_dir

# 运行各监控模块
module_process
module_lag
module_resource

# 处理告警
module_alert

# 生成报告(每小时执行一次)
if [ $(date +%M) == “00” ]; then
module_report
fi

echo “$(date): GoldenGate Monitor Platform completed”
}

main

生产环境建议:综合监控平台要根据实际需求定制,建议从简单的监控开始,逐步完善功能。同时要定期回顾监控数据,优化监控策略。from oracle:www.itpux.com

Part05-风哥经验总结与分享

5.1 Oracle GoldenGate监控总结

Oracle GoldenGate监控的关键要点:

  • 全面覆盖:监控所有关键指标,包括进程状态、延迟、资源使用等
  • 及时告警:设置合理的告警阈值,确保问题能够及时发现
  • 自动化:使用自动化脚本,减少人工操作
  • 历史记录:保留历史监控数据,便于趋势分析
  • 定期优化:定期回顾监控策略,持续优化

5.2 Oracle GoldenGate监控检查清单

Oracle GoldenGate监控检查清单:

# 日常监控检查清单
□ Manager进程状态
□ Extract进程状态
□ Replicat进程状态
□ 复制延迟情况
□ Trail文件空间
□ 系统资源使用
□ 错误日志检查

# 告警配置检查清单
□ 告警阈值设置合理
□ 告警通知方式配置正确
□ 告警升级机制完善
□ 告警抑制机制有效

# 监控脚本检查清单
□ 脚本执行权限正确
□ 日志目录存在且可写
□ crontab配置正确
□ 脚本输出正常

5.3 Oracle GoldenGate监控工具推荐

Oracle GoldenGate监控推荐工具:

  • GGSCI:GoldenGate命令行工具,用于查看进程状态
  • GoldenGate Monitor:Oracle官方图形化监控工具
  • Oracle Enterprise Manager:综合管理平台
  • Prometheus + Grafana:开源监控方案
  • 自定义脚本:根据需求定制的监控脚本
风哥提示:监控是GoldenGate运维的基础,要建立完善的监控体系,确保能够及时发现和处理问题。同时要定期回顾监控数据,持续优化监控策略。更多视频教程www.fgedu.net.cn

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息