本文档风哥主要介绍Oracle GoldenGate监控与告警的实战操作,包括Oracle GoldenGate监控的概念、监控项、告警机制、监控脚本、告警实现等内容,由风哥教程参考Oracle官方文档GoldenGate文档,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。
Part01-基础概念与理论知识
1.1 Oracle GoldenGate监控概念
Oracle GoldenGate监控是指通过收集和分析GoldenGate运行状态、性能指标、日志信息等数据,实时了解GoldenGate运行状况的过程。监控是GoldenGate运维管理的基础,能够及时发现异常并预防故障。更多视频教程www.fgedu.net.cn
- 实时了解进程运行状态
- 监控数据复制延迟
- 发现潜在的性能问题
- 及时发现故障和异常
- 为容量规划提供数据支持
1.2 Oracle GoldenGate监控项
Oracle GoldenGate主要监控项:
- 进程状态:Manager、Extract、Replicat进程的运行状态
- 复制延迟:数据从源端到目标端的延迟时间
- 检查点信息:进程的读取和应用位置
- 统计信息:处理的事务数、记录数等
- Trail文件:Trail文件的大小、数量、增长情况
- 系统资源:CPU、内存、磁盘、网络等资源使用情况
- 错误日志:进程报告和日志文件中的错误信息
1.3 Oracle GoldenGate告警概念
Oracle GoldenGate告警是指当监控指标超过预设阈值时,通过邮件、短信、钉钉等方式通知相关人员的过程。告警是监控的延伸,能够确保问题得到及时处理。
Part02-生产环境规划与建议
2.1 Oracle GoldenGate监控规划
Oracle GoldenGate监控规划要点:
实时监控:进程状态、复制延迟
– 监控频率:每5分钟
– 监控工具:自动化脚本
定时监控:统计信息、资源使用
– 监控频率:每小时
– 监控工具:定时任务
日常监控:趋势分析、容量评估
– 监控频率:每天
– 监控工具:报表系统
# 监控范围规划
源端监控:
– Extract进程状态
– 数据库日志生成情况
– 系统资源使用
目标端监控:
– Replicat进程状态
– 数据库写入性能
– 系统资源使用
网络监控:
– 网络带宽使用
– 网络延迟
– 连接状态
2.2 Oracle GoldenGate告警规划
Oracle GoldenGate告警阈值规划:
严重告警(Critical):
– 进程ABENDED或STOPPED
– 延迟超过30分钟
– 数据库连接失败
– 通知方式:短信+邮件+电话
重要告警(Warning):
– 延迟超过10分钟
– Trail文件空间不足
– 系统资源使用超过80%
– 通知方式:邮件+钉钉
一般告警(Info):
– 延迟超过5分钟
– 统计信息异常
– 通知方式:邮件
# 告警阈值设置
LAG_CRITICAL=1800 # 30分钟
LAG_WARNING=600 # 10分钟
LAG_INFO=300 # 5分钟
DISK_CRITICAL=90 # 磁盘使用率90%
DISK_WARNING=80 # 磁盘使用率80%
CPU_CRITICAL=90 # CPU使用率90%
CPU_WARNING=80 # CPU使用率80%
2.3 Oracle GoldenGate监控最佳实践
Oracle GoldenGate监控最佳实践:
- 全面覆盖:监控所有关键指标,不留死角
- 合理阈值:根据实际情况设置合理的告警阈值
- 及时响应:建立快速响应机制,及时处理告警
- 定期回顾:定期回顾监控数据,优化监控策略
- 文档记录:记录监控配置和处理流程
- 自动化:尽量使用自动化脚本,减少人工操作
Part03-生产环境项目实施方案
3.1 Oracle GoldenGate GGSCI监控
3.1.1 进程状态监控
GGSCI (fgedu.net.cn) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:00 00:00:03
EXTRACT RUNNING DP01 00:00:00 00:00:05
# 2. 查看指定进程详细信息
GGSCI (fgedu.net.cn) 2> info extract ext01, detail
EXTRACT EXT01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
Log Read Checkpoint Oracle Integrated Redo Logs
2026-04-03 10:00:00 SCN 12345678
Log position: 12345678
# 3. 查看进程检查点
GGSCI (fgedu.net.cn) 3> info extract ext01, showch
EXTRACT EXT01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
Read Checkpoint #1
Oracle Integrated Redo Logs
Startup Checkpoint (starting position in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 08:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log
Current Checkpoint (position of last record read in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 10:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log
3.1.2 延迟监控
GGSCI (fgedu.net.cn) 1> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
# 2. 查看Replicat延迟
GGSCI (fgedu-target.net.cn) 1> info replicat rep01
REPLICAT REP01 Last Started 2026-04-03 08:00:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)
# 3. 查看详细延迟信息
GGSCI (fgedu-target.net.cn) 2> lag replicat rep01
Sending LAG request to REPLICAT REP01 …
Last record lag: 2 seconds.
At EOF, no more records to process.
# 4. 查看统计信息
GGSCI (fgedu-target.net.cn) 3> stats replicat rep01
Sending STATS request to REPLICAT REP01 …
Start of Statistics at 2026-04-03 10:30:00.
Replicating from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
*** Total statistics since 2026-04-03 08:00:00 ***
Total inserts 10000.00
Total updates 5000.00
Total deletes 1000.00
Total operations 16000.00
*** Daily statistics since 2026-04-03 00:00:00 ***
Total inserts 50000.00
Total updates 25000.00
Total deletes 5000.00
Total operations 80000.00
*** Hourly statistics since 2026-04-03 10:00:00 ***
Total inserts 2000.00
Total updates 1000.00
Total deletes 200.00
Total operations 3200.00
3.2 Oracle GoldenGate脚本监控
3.2.1 进程状态监控脚本
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_status_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 配置
GG_HOME=”/oracle/goldengate”
LOG_FILE=”$GG_HOME/monitor/status_$(date +%Y%m%d).log”
ALERT_FILE=”$GG_HOME/monitor/alert.log”
# 检查Manager状态
check_manager() {
STATUS=$(echo “info mgr” | $GG_HOME/ggsci | grep “Manager” | awk ‘{print $2}’)
if [ “$STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Manager is $STATUS” >> $ALERT_FILE
return 1
fi
return 0
}
# 检查Extract状态
check_extract() {
EXTRACT_LIST=”ext01 ext02″
for EXTRACT in $EXTRACT_LIST; do
STATUS=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci | grep “EXTRACT” | awk ‘{print $2}’)
if [ “$STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Extract $EXTRACT is $STATUS” >> $ALERT_FILE
return 1
fi
done
return 0
}
# 检查Replicat状态
check_replicat() {
REPLICAT_LIST=”rep01 rep02″
for REPLICAT in $REPLICAT_LIST; do
STATUS=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci | grep “REPLICAT” | awk ‘{print $2}’)
if [ “$STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Replicat $REPLICAT is $STATUS” >> $ALERT_FILE
return 1
fi
done
return 0
}
# 主程序
main() {
echo “$(date): Starting GoldenGate status check” >> $LOG_FILE
check_manager
check_extract
check_replicat
echo “$(date): GoldenGate status check completed” >> $LOG_FILE
}
main
3.2.2 延迟监控脚本
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_lag_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 配置
GG_HOME=”/oracle/goldengate”
LOG_FILE=”$GG_HOME/monitor/lag_$(date +%Y%m%d).log”
ALERT_FILE=”$GG_HOME/monitor/lag_alert.log”
# 阈值(秒)
LAG_CRITICAL=1800
LAG_WARNING=600
# 检查Extract延迟
check_extract_lag() {
EXTRACT=$1
LAG_INFO=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci | grep “Checkpoint Lag” | awk ‘{print $3}’)
# 转换延迟为秒
LAG_SECONDS=$(echo $LAG_INFO | awk -F: ‘{print $1*3600 + $2*60 + $3}’)
echo “$(date): Extract $EXTRACT lag is $LAG_INFO ($LAG_SECONDS seconds)” >> $LOG_FILE
if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
echo “$(date): CRITICAL – Extract $EXTRACT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 2
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
echo “$(date): WARNING – Extract $EXTRACT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 1
fi
return 0
}
# 检查Replicat延迟
check_replicat_lag() {
REPLICAT=$1
LAG_INFO=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci | grep “Checkpoint Lag” | awk ‘{print $3}’)
# 转换延迟为秒
LAG_SECONDS=$(echo $LAG_INFO | awk -F: ‘{print $1*3600 + $2*60 + $3}’)
echo “$(date): Replicat $REPLICAT lag is $LAG_INFO ($LAG_SECONDS seconds)” >> $LOG_FILE
if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
echo “$(date): CRITICAL – Replicat $REPLICAT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 2
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
echo “$(date): WARNING – Replicat $REPLICAT lag $LAG_INFO exceeds threshold” >> $ALERT_FILE
return 1
fi
return 0
}
# 主程序
main() {
echo “$(date): Starting GoldenGate lag check” >> $LOG_FILE
check_extract_lag “ext01”
check_extract_lag “ext02”
check_replicat_lag “rep01”
check_replicat_lag “rep02”
echo “$(date): GoldenGate lag check completed” >> $LOG_FILE
}
main
3.3 Oracle GoldenGate告警实现
3.3.1 邮件告警
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_email_alert.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 配置
ALERT_FILE=”/oracle/goldengate/monitor/alert.log”
EMAIL_LIST=”dba@fgedu.net.cn,support@fgedu.net.cn”
SMTP_SERVER=”smtp.fgedu.net.cn”
SMTP_USER=”alert@fgedu.net.cn”
SMTP_PASS=”password”
# 发送邮件告警
send_email_alert() {
SUBJECT=”GoldenGate Alert – $(date +%Y%m%d%H%M%S)”
BODY=$(cat $ALERT_FILE)
# 使用sendmail发送邮件
{
echo “To: $EMAIL_LIST”
echo “Subject: $SUBJECT”
echo “Content-Type: text/plain; charset=UTF-8”
echo “”
echo “$BODY”
} | /usr/sbin/sendmail -t
echo “$(date): Email alert sent” >> /oracle/goldengate/monitor/alert.log
}
# 检查是否有告警
if [ -s $ALERT_FILE ]; then
send_email_alert
# 清空告警文件
> $ALERT_FILE
fi
3.3.2 钉钉告警
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_dingtalk_alert.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 配置
ALERT_FILE=”/oracle/goldengate/monitor/alert.log”
DINGTALK_WEBHOOK=”https://oapi.dingtalk.com/robot/send?access_token=your_token”
# 发送钉钉告警
send_dingtalk_alert() {
MESSAGE=$(cat $ALERT_FILE)
# 构造JSON消息
JSON_DATA=”{
\”msgtype\”: \”text\”,
\”text\”: {
\”content\”: \”GoldenGate Alert\\n$MESSAGE\”
},
\”at\”: {
\”isAtAll\”: false
}
}”
# 发送HTTP请求
curl -s -X POST \
-H “Content-Type: application/json” \
-d “$JSON_DATA” \
$DINGTALK_WEBHOOK
echo “$(date): DingTalk alert sent” >> /oracle/goldengate/monitor/alert.log
}
# 检查是否有告警
if [ -s $ALERT_FILE ]; then
send_dingtalk_alert
# 清空告警文件
> $ALERT_FILE
fi
Part04-生产案例与实战讲解
4.1 Oracle GoldenGate进程状态监控案例
以下是一个完整的进程状态监控案例:
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_process_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 配置
GG_HOME=”/oracle/goldengate”
LOG_DIR=”$GG_HOME/monitor”
LOG_FILE=”$LOG_DIR/process_$(date +%Y%m%d).log”
ALERT_FILE=”$LOG_DIR/process_alert.log”
PID_FILE=”$LOG_DIR/gg_monitor.pid”
# 进程列表
MANAGER_PID=””
EXTRACT_LIST=”ext01 ext02″
REPLICAT_LIST=”rep01 rep02″
# 检查Manager
check_manager() {
echo “$(date): Checking Manager status” >> $LOG_FILE
# 获取Manager状态
MANAGER_STATUS=$(echo “info mgr” | $GG_HOME/ggsci 2>/dev/null | grep “Manager” | awk ‘{print $2}’)
if [ “$MANAGER_STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Manager is $MANAGER_STATUS” >> $ALERT_FILE
echo “$(date): Attempting to start Manager” >> $LOG_FILE
echo “start mgr” | $GG_HOME/ggsci
sleep 5
# 再次检查
MANAGER_STATUS=$(echo “info mgr” | $GG_HOME/ggsci 2>/dev/null | grep “Manager” | awk ‘{print $2}’)
if [ “$MANAGER_STATUS” != “RUNNING” ]; then
echo “$(date): CRITICAL – Failed to start Manager” >> $ALERT_FILE
return 2
fi
fi
echo “$(date): Manager is RUNNING” >> $LOG_FILE
return 0
}
# 检查Extract
check_extract() {
EXTRACT=$1
echo “$(date): Checking Extract $EXTRACT status” >> $LOG_FILE
EXTRACT_STATUS=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci 2>/dev/null | grep “EXTRACT” | head -1 | awk ‘{print $2}’)
if [ “$EXTRACT_STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Extract $EXTRACT is $EXTRACT_STATUS” >> $ALERT_FILE
if [ “$EXTRACT_STATUS” == “ABENDED” ]; then
echo “$(date): Attempting to restart Extract $EXTRACT” >> $LOG_FILE
echo “start extract $EXTRACT” | $GG_HOME/ggsci
sleep 5
# 再次检查
EXTRACT_STATUS=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci 2>/dev/null | grep “EXTRACT” | head -1 | awk ‘{print $2}’)
if [ “$EXTRACT_STATUS” != “RUNNING” ]; then
echo “$(date): CRITICAL – Failed to restart Extract $EXTRACT” >> $ALERT_FILE
return 2
fi
fi
fi
echo “$(date): Extract $EXTRACT is RUNNING” >> $LOG_FILE
return 0
}
# 检查Replicat
check_replicat() {
REPLICAT=$1
echo “$(date): Checking Replicat $REPLICAT status” >> $LOG_FILE
REPLICAT_STATUS=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci 2>/dev/null | grep “REPLICAT” | head -1 | awk ‘{print $2}’)
if [ “$REPLICAT_STATUS” != “RUNNING” ]; then
echo “$(date): ALERT – Replicat $REPLICAT is $REPLICAT_STATUS” >> $ALERT_FILE
if [ “$REPLICAT_STATUS” == “ABENDED” ]; then
echo “$(date): Attempting to restart Replicat $REPLICAT” >> $LOG_FILE
echo “start replicat $REPLICAT” | $GG_HOME/ggsci
sleep 5
# 再次检查
REPLICAT_STATUS=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci 2>/dev/null | grep “REPLICAT” | head -1 | awk ‘{print $2}’)
if [ “$REPLICAT_STATUS” != “RUNNING” ]; then
echo “$(date): CRITICAL – Failed to restart Replicat $REPLICAT” >> $ALERT_FILE
return 2
fi
fi
fi
echo “$(date): Replicat $REPLICAT is RUNNING” >> $LOG_FILE
return 0
}
# 主程序
main() {
# 创建日志目录
mkdir -p $LOG_DIR
# 检查是否已在运行
if [ -f $PID_FILE ]; then
OLD_PID=$(cat $PID_FILE)
if ps -p $OLD_PID > /dev/null 2>&1; then
echo “Monitor is already running with PID $OLD_PID”
exit 1
fi
fi
# 记录PID
echo $$ > $PID_FILE
echo “$(date): Starting GoldenGate process monitor” >> $LOG_FILE
# 检查Manager
check_manager
# 检查Extract
for EXTRACT in $EXTRACT_LIST; do
check_extract $EXTRACT
done
# 检查Replicat
for REPLICAT in $REPLICAT_LIST; do
check_replicat $REPLICAT
done
echo “$(date): GoldenGate process monitor completed” >> $LOG_FILE
# 发送告警
if [ -s $ALERT_FILE ]; then
/oracle/goldengate/monitor/gg_email_alert.sh
/oracle/goldengate/monitor/gg_dingtalk_alert.sh
fi
# 清理PID文件
rm -f $PID_FILE
}
main
4.2 Oracle GoldenGate延迟监控案例
以下是一个完整的延迟监控案例:
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_lag_monitor_advanced.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 配置
GG_HOME=”/oracle/goldengate”
LOG_DIR=”$GG_HOME/monitor”
LOG_FILE=”$LOG_DIR/lag_$(date +%Y%m%d).log”
ALERT_FILE=”$LOG_DIR/lag_alert.log”
HISTORY_FILE=”$LOG_DIR/lag_history.csv”
# 阈值(秒)
LAG_CRITICAL=1800
LAG_WARNING=600
LAG_INFO=300
# 转换为秒的函数
time_to_seconds() {
TIME_STR=$1
HOURS=$(echo $TIME_STR | cut -d: -f1)
MINUTES=$(echo $TIME_STR | cut -d: -f2)
SECONDS=$(echo $TIME_STR | cut -d: -f3)
echo $((HOURS*3600 + MINUTES*60 + SECONDS))
}
# 监控Extract延迟
monitor_extract_lag() {
EXTRACT=$1
TIMESTAMP=$(date +%Y%m%d%H%M%S)
# 获取延迟信息
LAG_INFO=$(echo “info extract $EXTRACT” | $GG_HOME/ggsci 2>/dev/null | grep “Checkpoint Lag” | awk ‘{print $3}’)
if [ -z “$LAG_INFO” ]; then
echo “$TIMESTAMP,$EXTRACT,EXTRACT,UNKNOWN,-1,ERROR” >> $HISTORY_FILE
echo “$(date): ERROR – Cannot get lag info for Extract $EXTRACT” >> $LOG_FILE
return 1
fi
# 转换为秒
LAG_SECONDS=$(time_to_seconds $LAG_INFO)
# 确定告警级别
ALERT_LEVEL=”NORMAL”
if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
ALERT_LEVEL=”CRITICAL”
echo “$(date): CRITICAL – Extract $EXTRACT lag $LAG_INFO exceeds critical threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
ALERT_LEVEL=”WARNING”
echo “$(date): WARNING – Extract $EXTRACT lag $LAG_INFO exceeds warning threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_INFO ]; then
ALERT_LEVEL=”INFO”
fi
# 记录历史数据
echo “$TIMESTAMP,$EXTRACT,EXTRACT,$LAG_INFO,$LAG_SECONDS,$ALERT_LEVEL” >> $HISTORY_FILE
echo “$(date): Extract $EXTRACT lag is $LAG_INFO ($LAG_SECONDS seconds) – $ALERT_LEVEL” >> $LOG_FILE
return 0
}
# 监控Replicat延迟
monitor_replicat_lag() {
REPLICAT=$1
TIMESTAMP=$(date +%Y%m%d%H%M%S)
# 获取延迟信息
LAG_INFO=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci 2>/dev/null | grep “Checkpoint Lag” | awk ‘{print $3}’)
if [ -z “$LAG_INFO” ]; then
echo “$TIMESTAMP,$REPLICAT,REPLICAT,UNKNOWN,-1,ERROR” >> $HISTORY_FILE
echo “$(date): ERROR – Cannot get lag info for Replicat $REPLICAT” >> $LOG_FILE
return 1
fi
# 转换为秒
LAG_SECONDS=$(time_to_seconds $LAG_INFO)
# 确定告警级别
ALERT_LEVEL=”NORMAL”
if [ $LAG_SECONDS -gt $LAG_CRITICAL ]; then
ALERT_LEVEL=”CRITICAL”
echo “$(date): CRITICAL – Replicat $REPLICAT lag $LAG_INFO exceeds critical threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_WARNING ]; then
ALERT_LEVEL=”WARNING”
echo “$(date): WARNING – Replicat $REPLICAT lag $LAG_INFO exceeds warning threshold” >> $ALERT_FILE
elif [ $LAG_SECONDS -gt $LAG_INFO ]; then
ALERT_LEVEL=”INFO”
fi
# 记录历史数据
echo “$TIMESTAMP,$REPLICAT,REPLICAT,$LAG_INFO,$LAG_SECONDS,$ALERT_LEVEL” >> $HISTORY_FILE
echo “$(date): Replicat $REPLICAT lag is $LAG_INFO ($LAG_SECONDS seconds) – $ALERT_LEVEL” >> $LOG_FILE
return 0
}
# 生成延迟趋势报告
generate_lag_report() {
REPORT_FILE=”$LOG_DIR/lag_report_$(date +%Y%m%d).html”
cat > $REPORT_FILE << EOF
GoldenGate Lag Report – $(date +%Y%m%d)
| Time | Process | Type | Lag | Seconds | Level |
|---|---|---|---|---|---|
| $TIMESTAMP | $PROCESS | $TYPE | $LAG | $SECONDS | $LEVEL |
” >> $REPORT_FILE
echo “$(date): Lag report generated: $REPORT_FILE” >> $LOG_FILE
}
# 主程序
main() {
mkdir -p $LOG_DIR
echo “$(date): Starting GoldenGate lag monitor” >> $LOG_FILE
# 监控Extract
monitor_extract_lag “ext01”
monitor_extract_lag “ext02”
# 监控Replicat
monitor_replicat_lag “rep01”
monitor_replicat_lag “rep02”
# 生成报告
generate_lag_report
echo “$(date): GoldenGate lag monitor completed” >> $LOG_FILE
# 发送告警
if [ -s $ALERT_FILE ]; then
/oracle/goldengate/monitor/gg_email_alert.sh
fi
}
main
4.3 Oracle GoldenGate综合监控平台案例
以下是一个综合监控平台的完整案例:
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_monitor_platform.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 综合监控平台主控脚本
# 整合进程监控、延迟监控、资源监控、告警通知等功能
# 配置
GG_HOME=”/oracle/goldengate”
LOG_DIR=”$GG_HOME/monitor”
CONFIG_FILE=”$GG_HOME/monitor/monitor.conf”
# 加载配置
source $CONFIG_FILE
# 初始化日志目录
init_log_dir() {
mkdir -p $LOG_DIR/{status,lag,resource,alert,report}
# 创建历史数据目录
mkdir -p $LOG_DIR/history/{daily,hourly}
}
# 进程状态监控模块
module_process() {
echo “$(date): Running process monitoring module”
LOG_FILE=”$LOG_DIR/status/process_$(date +%Y%m%d%H).log”
# 获取所有进程状态
echo “info all” | $GG_HOME/ggsci > $LOG_DIR/status/current_status.txt
# 分析状态
while read LINE; do
if echo “$LINE” | grep -q “EXTRACT\|REPLICAT”; then
PROCESS=$(echo $LINE | awk ‘{print $1}’)
STATUS=$(echo $LINE | awk ‘{print $2}’)
GROUP=$(echo $LINE | awk ‘{print $3}’)
LAG=$(echo $LINE | awk ‘{print $4}’)
TIMESTAMP=$(date +%Y%m%d%H%M%S)
echo “$TIMESTAMP,$PROCESS,$GROUP,$STATUS,$LAG” >> $LOG_DIR/history/process_history.csv
# 检查异常状态
if [ “$STATUS” != “RUNNING” ]; then
echo “$TIMESTAMP,$PROCESS,$GROUP,$STATUS,ALERT” >> $LOG_DIR/alert/process_alert.csv
fi
fi
done < $LOG_DIR/status/current_status.txt
}
# 延迟监控模块
module_lag() {
echo "$(date): Running lag monitoring module"
# 获取所有Extract和Replicat列表
EXTRACT_LIST=$(echo "info all" | $GG_HOME/ggsci | grep "EXTRACT" | awk '{print $3}')
REPLICAT_LIST=$(echo "info all" | $GG_HOME/ggsci | grep "REPLICAT" | awk '{print $3}')
# 监控Extract延迟
for EXTRACT in $EXTRACT_LIST; do
LAG_INFO=$(echo "info extract $EXTRACT" | $GG_HOME/ggsci | grep "Checkpoint Lag" | awk '{print $3}')
TIMESTAMP=$(date +%Y%m%d%H%M%S)
echo "$TIMESTAMP,$EXTRACT,EXTRACT,$LAG_INFO" >> $LOG_DIR/history/lag_history.csv
done
# 监控Replicat延迟
for REPLICAT in $REPLICAT_LIST; do
LAG_INFO=$(echo “info replicat $REPLICAT” | $GG_HOME/ggsci | grep “Checkpoint Lag” | awk ‘{print $3}’)
TIMESTAMP=$(date +%Y%m%d%H%M%S)
echo “$TIMESTAMP,$REPLICAT,REPLICAT,$LAG_INFO” >> $LOG_DIR/history/lag_history.csv
done
}
# 资源监控模块
module_resource() {
echo “$(date): Running resource monitoring module”
TIMESTAMP=$(date +%Y%m%d%H%M%S)
# CPU使用率
CPU_USAGE=$(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’ | cut -d’%’ -f1)
# 内存使用率
MEM_USAGE=$(free | grep Mem | awk ‘{printf “%.2f”, $3/$2 * 100.0}’)
# 磁盘使用率
DISK_USAGE=$(df -h $GG_HOME | tail -1 | awk ‘{print $5}’ | cut -d’%’ -f1)
# GoldenGate进程资源使用
GG_CPU=$(ps -eo pid,comm,pcpu | grep “extract\|replicat” | awk ‘{sum+=$3} END {print sum}’)
GG_MEM=$(ps -eo pid,comm,pmem | grep “extract\|replicat” | awk ‘{sum+=$3} END {print sum}’)
# 记录数据
echo “$TIMESTAMP,$CPU_USAGE,$MEM_USAGE,$DISK_USAGE,$GG_CPU,$GG_MEM” >> $LOG_DIR/history/resource_history.csv
# 检查阈值
if [ ${DISK_USAGE%.*} -gt 80 ]; then
echo “$TIMESTAMP,DISK,$DISK_USAGE%,WARNING” >> $LOG_DIR/alert/resource_alert.csv
fi
}
# 告警处理模块
module_alert() {
echo “$(date): Running alert processing module”
ALERT_FILE=”$LOG_DIR/alert/all_alerts_$(date +%Y%m%d).log”
# 合并所有告警
cat $LOG_DIR/alert/*.csv > $ALERT_FILE 2>/dev/null
# 发送告警通知
if [ -s $ALERT_FILE ]; then
# 邮件告警
if [ “$EMAIL_ALERT” == “Y” ]; then
/oracle/goldengate/monitor/gg_email_alert.sh
fi
# 钉钉告警
if [ “$DINGTALK_ALERT” == “Y” ]; then
/oracle/goldengate/monitor/gg_dingtalk_alert.sh
fi
# 短信告警(严重告警)
if [ “$SMS_ALERT” == “Y” ] && grep -q “CRITICAL” $ALERT_FILE; then
/oracle/goldengate/monitor/gg_sms_alert.sh
fi
fi
}
# 报告生成模块
module_report() {
echo “$(date): Running report generation module”
REPORT_FILE=”$LOG_DIR/report/daily_report_$(date +%Y%m%d).html”
# 生成HTML报告
cat > $REPORT_FILE << 'EOF'
GoldenGate Daily Report – $(date +%Y-%m-%d)
Process Status
| Process | Group | Status | Lag |
|---|---|---|---|
| $PROCESS | $GROUP | $STATUS | $LAG |
Performance Metrics
| Time | CPU% | Memory% | Disk% | GG CPU% | GG Mem% |
|---|---|---|---|---|---|
| $TIMESTAMP | $CPU | $MEM | $DISK | $GGCPU | $GGMEM |
” >> $REPORT_FILE
echo “$(date): Report generated: $REPORT_FILE”
}
# 主程序
main() {
echo “$(date): Starting GoldenGate Monitor Platform”
# 初始化
init_log_dir
# 运行各监控模块
module_process
module_lag
module_resource
# 处理告警
module_alert
# 生成报告(每小时执行一次)
if [ $(date +%M) == “00” ]; then
module_report
fi
echo “$(date): GoldenGate Monitor Platform completed”
}
main
Part05-风哥经验总结与分享
5.1 Oracle GoldenGate监控总结
Oracle GoldenGate监控的关键要点:
- 全面覆盖:监控所有关键指标,包括进程状态、延迟、资源使用等
- 及时告警:设置合理的告警阈值,确保问题能够及时发现
- 自动化:使用自动化脚本,减少人工操作
- 历史记录:保留历史监控数据,便于趋势分析
- 定期优化:定期回顾监控策略,持续优化
5.2 Oracle GoldenGate监控检查清单
Oracle GoldenGate监控检查清单:
□ Manager进程状态
□ Extract进程状态
□ Replicat进程状态
□ 复制延迟情况
□ Trail文件空间
□ 系统资源使用
□ 错误日志检查
# 告警配置检查清单
□ 告警阈值设置合理
□ 告警通知方式配置正确
□ 告警升级机制完善
□ 告警抑制机制有效
# 监控脚本检查清单
□ 脚本执行权限正确
□ 日志目录存在且可写
□ crontab配置正确
□ 脚本输出正常
5.3 Oracle GoldenGate监控工具推荐
Oracle GoldenGate监控推荐工具:
- GGSCI:GoldenGate命令行工具,用于查看进程状态
- GoldenGate Monitor:Oracle官方图形化监控工具
- Oracle Enterprise Manager:综合管理平台
- Prometheus + Grafana:开源监控方案
- 自定义脚本:根据需求定制的监控脚本
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
