本文档风哥主要介绍Oracle GoldenGate高可用架构的实战操作,包括Oracle GoldenGate高可用的概念、架构、场景、实施方案、故障切换等内容,由风哥教程参考Oracle官方文档GoldenGate文档,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。
Part01-基础概念与理论知识
1.1 Oracle GoldenGate高可用概念
Oracle GoldenGate高可用是指通过架构设计和配置优化,确保GoldenGate数据复制服务在硬件故障、软件故障、网络故障等情况下能够持续提供服务的能力。高可用架构能够最大程度地减少服务中断时间,保证业务连续性。更多视频教程www.fgedu.net.cn
- 消除单点故障
- 实现快速故障切换
- 保证数据复制连续性
- 最小化服务中断时间
- 支持灾难恢复
1.2 Oracle GoldenGate高可用架构
Oracle GoldenGate高可用架构主要包括:
- 主备架构:主节点运行GoldenGate进程,备节点待命
- 双活架构:两个节点同时运行GoldenGate进程,互为备份
- 集群架构:多个节点组成集群,共享存储
- 分布式架构:多个独立GoldenGate实例,负载均衡
1.3 Oracle GoldenGate高可用场景
Oracle GoldenGate高可用主要应用场景:
- 生产环境容灾:确保生产环境GoldenGate服务不中断
- 大数据量复制:通过负载均衡提高复制性能
- 关键业务系统:保证关键业务数据复制的连续性
- 跨地域部署:在不同地域部署GoldenGate实例
Part02-生产环境规划与建议
2.1 Oracle GoldenGate高可用规划
Oracle GoldenGate高可用规划要点:
业务重要性评估:
– 关键业务系统:需要99.99%可用性
– 重要业务系统:需要99.9%可用性
– 一般业务系统:需要99%可用性
RTO/RPO要求:
– RTO(恢复时间目标):故障发生后恢复服务的时间
– RPO(恢复点目标):允许丢失的数据量
# 高可用架构选择
单节点+冷备:
– 适用场景:一般业务系统
– 成本:低
– 可用性:99%
主备双机:
– 适用场景:重要业务系统
– 成本:中
– 可用性:99.9%
双活架构:
– 适用场景:关键业务系统
– 成本:高
– 可用性:99.99%
集群架构:
– 适用场景:大规模部署
– 成本:很高
– 可用性:99.999%
2.2 Oracle GoldenGate高可用设计
Oracle GoldenGate高可用设计要点:
主节点(Primary):
– 运行所有GoldenGate进程
– 处理所有数据复制任务
– 实时监控状态
备节点(Standby):
– 安装GoldenGate软件
– 同步配置文件
– 定期健康检查
– 故障时接管服务
# 双活架构设计
节点A:
– 运行Extract进程
– 处理部分表的数据复制
– 监控节点B状态
节点B:
– 运行Extract进程
– 处理部分表的数据复制
– 监控节点A状态
负载均衡:
– 按表分组分配
– 按业务模块分配
– 动态负载调整
# 故障检测机制
心跳检测:
– 每30秒检测一次
– 连续3次失败判定故障
– 自动触发切换
状态同步:
– 实时同步配置
– 定期同步检查点
– 确保数据一致性
2.3 Oracle GoldenGate高可用最佳实践
Oracle GoldenGate高可用最佳实践:
- 消除单点故障:所有组件都要有备份
- 自动故障检测:实现自动化的故障检测机制
- 快速故障切换:切换时间要控制在分钟级
- 数据一致性:确保切换后数据一致性
- 定期演练:定期进行故障切换演练
- 监控告警:建立完善的监控告警体系
Part03-生产环境项目实施方案
3.1 Oracle GoldenGate主备架构实施
3.1.1 主节点配置
# 1. 安装GoldenGate软件
# 参考Oracle GoldenGate安装文档
# 2. 配置Manager
GGSCI (fgedu-primary.net.cn) 1> edit params mgr
PORT 7809
DYNAMICPORTLIST 7810-7820
AUTOSTART EXTRACT *
AUTOSTART REPLICAT *
AUTORESTART EXTRACT *, RETRIES 3, WAITMINUTES 5
AUTORESTART REPLICAT *, RETRIES 3, WAITMINUTES 5
LAGREPORTMINUTES 10
LAGINFOMINUTES 5
LAGCRITICALMINUTES 10
PURGEOLDEXTRACTS /oracle/goldengate/dirdat/*, USECHECKPOINTS, MINKEEPDAYS 7
# 3. 配置Extract
GGSCI (fgedu-primary.net.cn) 2> edit params ext01
EXTRACT ext01
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/et
TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;
# 4. 配置数据泵
GGSCI (fgedu-primary.net.cn) 3> edit params dp01
EXTRACT dp01
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
PASSTHRU
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTRAIL /oracle/goldengate/dirdat/rt
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;
# 5. 启动进程
GGSCI (fgedu-primary.net.cn) 4> start mgr
GGSCI (fgedu-primary.net.cn) 5> start extract ext01
GGSCI (fgedu-primary.net.cn) 6> start extract dp01
# 6. 验证状态
GGSCI (fgedu-primary.net.cn) 7> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:00 00:00:03
EXTRACT RUNNING DP01 00:00:00 00:00:05
3.1.2 备节点配置
# 1. 安装GoldenGate软件
# 与主节点相同版本
# 2. 配置Manager
GGSCI (fgedu-standby.net.cn) 1> edit params mgr
PORT 7809
DYNAMICPORTLIST 7810-7820
LAGREPORTMINUTES 10
LAGINFOMINUTES 5
LAGCRITICALMINUTES 10
# 3. 同步配置文件
# 从主节点同步参数文件
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/dirprm/ /oracle/goldengate/dirprm/
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/dirdef/ /oracle/goldengate/dirdef/
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/dircrd/ /oracle/goldengate/dircrd/
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/GLOBALS /oracle/goldengate/
# 4. 修改参数文件(适配备节点)
# 修改Extract参数,使用备节点数据库连接
GGSCI (fgedu-standby.net.cn) 2> edit params ext01
EXTRACT ext01
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_standby # 使用备节点数据库凭据
EXTTRAIL /oracle/goldengate/dirdat/et
TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;
# 5. 配置健康检查脚本
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# ha_health_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
PRIMARY_HOST=”fgedu-primary.net.cn”
STANDBY_HOST=”fgedu-standby.net.cn”
GG_HOME=”/oracle/goldengate”
# 检查主节点状态
check_primary() {
# 检查主节点Manager
PRIMARY_MGR=$(ssh $PRIMARY_HOST “echo ‘info mgr’ | $GG_HOME/ggsci | grep ‘Manager’ | awk ‘{print \$2}'”)
if [ “$PRIMARY_MGR” != “RUNNING” ]; then
echo “Primary Manager is not running”
return 1
fi
# 检查主节点Extract
PRIMARY_EXT=$(ssh $PRIMARY_HOST “echo ‘info extract ext01’ | $GG_HOME/ggsci | grep ‘EXTRACT’ | awk ‘{print \$2}'”)
if [ “$PRIMARY_EXT” != “RUNNING” ]; then
echo “Primary Extract is not running”
return 1
fi
return 0
}
# 主程序
if ! check_primary; then
echo “Primary node check failed, standby node should take over”
# 触发切换逻辑
fi
3.2 Oracle GoldenGate故障切换实施
3.2.1 自动故障切换脚本
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_failover.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# GoldenGate自动故障切换脚本
PRIMARY_HOST=”fgedu-primary.net.cn”
STANDBY_HOST=”fgedu-standby.net.cn”
GG_HOME=”/oracle/goldengate”
FAILOVER_LOG=”$GG_HOME/ha/failover.log”
PID_FILE=”$GG_HOME/ha/failover.pid”
# 检查是否已在运行
if [ -f $PID_FILE ]; then
OLD_PID=$(cat $PID_FILE)
if ps -p $OLD_PID > /dev/null 2>&1; then
echo “Failover script is already running”
exit 1
fi
fi
echo $$ > $PID_FILE
# 记录日志
log_message() {
echo “$(date ‘+%Y-%m-%d %H:%M:%S’): $1” >> $FAILOVER_LOG
}
# 检查主节点健康状态
check_primary_health() {
log_message “Checking primary node health…”
# 检查网络连通性
if ! ping -c 3 $PRIMARY_HOST > /dev/null 2>&1; then
log_message “Primary node network unreachable”
return 1
fi
# 检查Manager进程
MGR_STATUS=$(ssh $PRIMARY_HOST “echo ‘info mgr’ | $GG_HOME/ggsci 2>/dev/null | grep ‘Manager’ | awk ‘{print \$2}'”)
if [ “$MGR_STATUS” != “RUNNING” ]; then
log_message “Primary Manager is not running: $MGR_STATUS”
return 1
fi
# 检查Extract进程
EXT_STATUS=$(ssh $PRIMARY_HOST “echo ‘info extract ext01’ | $GG_HOME/ggsci 2>/dev/null | grep ‘EXTRACT’ | head -1 | awk ‘{print \$2}'”)
if [ “$EXT_STATUS” != “RUNNING” ]; then
log_message “Primary Extract is not running: $EXT_STATUS”
return 1
fi
log_message “Primary node is healthy”
return 0
}
# 执行故障切换
perform_failover() {
log_message “Starting failover process…”
# 1. 同步最新配置
log_message “Synchronizing configuration from primary…”
rsync -avz –delete $PRIMARY_HOST:$GG_HOME/dirprm/ $GG_HOME/dirprm/ 2>/dev/null
rsync -avz –delete $PRIMARY_HOST:$GG_HOME/dirchk/ $GG_HOME/dirchk/ 2>/dev/null
# 2. 启动Manager
log_message “Starting Manager on standby…”
echo “start mgr” | $GG_HOME/ggsci
sleep 5
# 3. 验证Manager状态
MGR_STATUS=$(echo “info mgr” | $GG_HOME/ggsci | grep “Manager” | awk ‘{print $2}’)
if [ “$MGR_STATUS” != “RUNNING” ]; then
log_message “Failed to start Manager on standby”
return 1
fi
# 4. 启动Extract
log_message “Starting Extract on standby…”
echo “start extract ext01” | $GG_HOME/ggsci
sleep 5
# 5. 验证Extract状态
EXT_STATUS=$(echo “info extract ext01” | $GG_HOME/ggsci | grep “EXTRACT” | head -1 | awk ‘{print $2}’)
if [ “$EXT_STATUS” != “RUNNING” ]; then
log_message “Failed to start Extract on standby”
return 1
fi
# 6. 启动数据泵
log_message “Starting Data Pump on standby…”
echo “start extract dp01” | $GG_HOME/ggsci
sleep 5
# 7. 验证切换结果
log_message “Verifying failover result…”
INFO_ALL=$(echo “info all” | $GG_HOME/ggsci)
log_message “Standby node status:\n$INFO_ALL”
# 8. 发送告警通知
echo “GoldenGate failover completed. Standby node is now active.” | \
mail -s “GoldenGate Failover Alert” dba@fgedu.net.cn
log_message “Failover completed successfully”
return 0
}
# 主程序
main() {
log_message “Failover monitoring started”
# 连续检测3次,每次间隔30秒
FAILURE_COUNT=0
for i in 1 2 3; do
if ! check_primary_health; then
FAILURE_COUNT=$((FAILURE_COUNT + 1))
fi
sleep 30
done
# 如果3次检测都失败,执行切换
if [ $FAILURE_COUNT -ge 3 ]; then
log_message “Primary node failed $FAILURE_COUNT times, initiating failover”
perform_failover
else
log_message “Primary node is stable, no failover needed”
fi
log_message “Failover monitoring completed”
}
# 执行主程序
main
# 清理PID文件
rm -f $PID_FILE
3.2.2 故障回切操作
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_failback.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# GoldenGate故障回切脚本
PRIMARY_HOST=”fgedu-primary.net.cn”
STANDBY_HOST=”fgedu-standby.net.cn”
GG_HOME=”/oracle/goldengate”
FAILBACK_LOG=”$GG_HOME/ha/failback.log”
# 记录日志
log_message() {
echo “$(date ‘+%Y-%m-%d %H:%M:%S’): $1” >> $FAILBACK_LOG
}
# 检查主节点是否恢复
check_primary_recovery() {
log_message “Checking if primary node has recovered…”
# 检查网络连通性
if ! ping -c 3 $PRIMARY_HOST > /dev/null 2>&1; then
log_message “Primary node still unreachable”
return 1
fi
# 检查数据库连接
DB_STATUS=$(ssh $PRIMARY_HOST “sqlplus -s / as sysdba << EOF
SET PAGESIZE 0 FEEDBACK OFF VERIFY OFF HEADING OFF ECHO OFF
SELECT status FROM v\$instance;
EXIT;
EOF” | tr -d ‘[:space:]’)
if [ “$DB_STATUS” != “OPEN” ]; then
log_message “Primary database is not open: $DB_STATUS”
return 1
fi
log_message “Primary node has recovered”
return 0
}
# 执行故障回切
perform_failback() {
log_message “Starting failback process…”
# 1. 停止备节点进程
log_message “Stopping standby node processes…”
echo “stop extract dp01” | $GG_HOME/ggsci
sleep 5
echo “stop extract ext01” | $GG_HOME/ggsci
sleep 5
echo “stop mgr” | $GG_HOME/ggsci
sleep 5
# 2. 同步检查点信息到主节点
log_message “Synchronizing checkpoint information…”
rsync -avz $GG_HOME/dirchk/ $PRIMARY_HOST:$GG_HOME/dirchk/
# 3. 启动主节点进程
log_message “Starting primary node processes…”
ssh $PRIMARY_HOST “echo ‘start mgr’ | $GG_HOME/ggsci”
sleep 5
ssh $PRIMARY_HOST “echo ‘start extract ext01’ | $GG_HOME/ggsci”
sleep 5
ssh $PRIMARY_HOST “echo ‘start extract dp01’ | $GG_HOME/ggsci”
sleep 5
# 4. 验证主节点状态
log_message “Verifying primary node status…”
PRIMARY_STATUS=$(ssh $PRIMARY_HOST “echo ‘info all’ | $GG_HOME/ggsci”)
log_message “Primary node status:\n$PRIMARY_STATUS”
# 5. 发送通知
echo “GoldenGate failback completed. Primary node is now active.” | \
mail -s “GoldenGate Failback Notification” dba@fgedu.net.cn
log_message “Failback completed successfully”
}
# 主程序
main() {
log_message “Failback check started”
if check_primary_recovery; then
# 等待用户确认
log_message “Primary node is ready for failback”
echo “Primary node has recovered. Do you want to perform failback? (yes/no)”
read CONFIRM
if [ “$CONFIRM” == “yes” ]; then
perform_failback
else
log_message “Failback cancelled by user”
fi
else
log_message “Primary node not ready for failback”
fi
log_message “Failback check completed”
}
main
3.3 Oracle GoldenGate负载均衡实施
3.3.1 双活架构配置
# 负责订单相关表
GGSCI (fgedu-nodea.net.cn) 1> edit params ext_orders
EXTRACT ext_orders
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/eo
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_order_items;
GGSCI (fgedu-nodea.net.cn) 2> edit params dp_orders
EXTRACT dp_orders
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
PASSTHRU
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTRAIL /oracle/goldengate/dirdat/ro
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_order_items;
# 节点B(fgedu-nodeb.net.cn)配置
# 负责客户相关表
GGSCI (fgedu-nodeb.net.cn) 1> edit params ext_customers
EXTRACT ext_customers
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/ec
TABLE fgedu.fgedu_customers;
TABLE fgedu.fgedu_products;
GGSCI (fgedu-nodeb.net.cn) 2> edit params dp_customers
EXTRACT dp_customers
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
PASSTHRU
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTRAIL /oracle/goldengate/dirdat/rc
TABLE fgedu.fgedu_customers;
TABLE fgedu.fgedu_products;
# 目标端配置
GGSCI (fgedu-target.net.cn) 1> edit params rep_orders
REPLICAT rep_orders
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_target
DISCARDFILE /oracle/goldengate/dirrpt/rep_orders.dsc, APPEND, MEGABYTES 100
MAP fgedu.fgedu_orders, TARGET fgedu.fgedu_orders;
MAP fgedu.fgedu_order_items, TARGET fgedu.fgedu_order_items;
GGSCI (fgedu-target.net.cn) 2> edit params rep_customers
REPLICAT rep_customers
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_target
DISCARDFILE /oracle/goldengate/dirrpt/rep_customers.dsc, APPEND, MEGABYTES 100
MAP fgedu.fgedu_customers, TARGET fgedu.fgedu_customers;
MAP fgedu.fgedu_products, TARGET fgedu.fgedu_products;
# 启动双活架构
# 节点A
GGSCI (fgedu-nodea.net.cn) 3> start mgr
GGSCI (fgedu-nodea.net.cn) 4> start extract ext_orders
GGSCI (fgedu-nodea.net.cn) 5> start extract dp_orders
# 节点B
GGSCI (fgedu-nodeb.net.cn) 3> start mgr
GGSCI (fgedu-nodeb.net.cn) 4> start extract ext_customers
GGSCI (fgedu-nodeb.net.cn) 5> start extract dp_customers
# 目标端
GGSCI (fgedu-target.net.cn) 3> start replicat rep_orders
GGSCI (fgedu-target.net.cn) 4> start replicat rep_customers
# 验证状态
GGSCI (fgedu-nodea.net.cn) 6> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT_ORDERS 00:00:00 00:00:03
EXTRACT RUNNING DP_ORDERS 00:00:00 00:00:05
GGSCI (fgedu-nodeb.net.cn) 6> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT_CUST 00:00:00 00:00:03
EXTRACT RUNNING DP_CUST 00:00:00 00:00:05
Part04-生产案例与实战讲解
4.1 Oracle GoldenGate双活架构案例
以下是一个完整的双活架构实施案例:
# 架构设计
# 节点A:处理订单、支付相关表
# 节点B:处理用户、商品相关表
# 目标端:统一接收所有数据
# 1. 表分组策略
# 节点A负责表:
# – fgedu_orders
# – fgedu_order_items
# – fgedu_payments
# – fgedu_invoices
# 节点B负责表:
# – fgedu_customers
# – fgedu_products
# – fgedu_inventory
# – fgedu_categories
# 2. 节点A配置
GGSCI (fgedu-nodea.net.cn) 1> edit params ext_a
EXTRACT ext_a
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/ea
CACHEMGR CACHESIZE 4GB
TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_order_items;
TABLE fgedu.fgedu_payments;
TABLE fgedu.fgedu_invoices;
# 3. 节点B配置
GGSCI (fgedu-nodeb.net.cn) 1> edit params ext_b
EXTRACT ext_b
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/eb
CACHEMGR CACHESIZE 4GB
TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)
TABLE fgedu.fgedu_customers;
TABLE fgedu.fgedu_products;
TABLE fgedu.fgedu_inventory;
TABLE fgedu.fgedu_categories;
# 4. 健康检查脚本
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# dual_active_health_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
NODE_A=”fgedu-nodea.net.cn”
NODE_B=”fgedu-nodeb.net.cn”
GG_HOME=”/oracle/goldengate”
# 检查节点A
check_node_a() {
echo “Checking Node A…”
STATUS_A=$(ssh $NODE_A “echo ‘info all’ | $GG_HOME/ggsci” 2>/dev/null)
if echo “$STATUS_A” | grep -q “ABENDED\|STOPPED”; then
echo “Node A has issues:”
echo “$STATUS_A”
return 1
fi
echo “Node A is healthy”
return 0
}
# 检查节点B
check_node_b() {
echo “Checking Node B…”
STATUS_B=$(ssh $NODE_B “echo ‘info all’ | $GG_HOME/ggsci” 2>/dev/null)
if echo “$STATUS_B” | grep -q “ABENDED\|STOPPED”; then
echo “Node B has issues:”
echo “$STATUS_B”
return 1
fi
echo “Node B is healthy”
return 0
}
# 负载均衡检查
check_load_balance() {
echo “Checking load balance…”
# 获取节点A统计
STATS_A=$(ssh $NODE_A “echo ‘stats extract ext_a’ | $GG_HOME/ggsci” 2>/dev/null)
OPS_A=$(echo “$STATS_A” | grep “Total operations” | tail -1 | awk ‘{print $NF}’)
# 获取节点B统计
STATS_B=$(ssh $NODE_B “echo ‘stats extract ext_b’ | $GG_HOME/ggsci” 2>/dev/null)
OPS_B=$(echo “$STATS_B” | grep “Total operations” | tail -1 | awk ‘{print $NF}’)
echo “Node A operations: $OPS_A”
echo “Node B operations: $OPS_B”
# 检查负载是否均衡(差异不超过20%)
if [ -n “$OPS_A” ] && [ -n “$OPS_B” ]; then
DIFF=$(echo “scale=2; ($OPS_A – $OPS_B) / $OPS_A * 100” | bc | tr -d ‘-‘)
if (( $(echo “$DIFF > 20” | bc -l) )); then
echo “WARNING: Load imbalance detected (${DIFF}%)”
return 1
fi
fi
echo “Load is balanced”
return 0
}
# 主程序
main() {
echo “=== Dual-Active Health Check ===”
echo “Time: $(date)”
check_node_a
check_node_b
check_load_balance
echo “Health check completed”
}
main
4.2 Oracle GoldenGate故障切换案例
以下是一个完整的故障切换案例:
# 1. 故障检测
# 监控系统检测到主节点异常
[2026-04-03 10:00:00] ALERT: Primary node fgedu-primary.net.cn is unreachable
[2026-04-03 10:00:30] ALERT: Primary node still unreachable after retry
[2026-04-03 10:01:00] ALERT: Primary node failed 3 consecutive checks
[2026-04-03 10:01:01] INFO: Initiating automatic failover
# 2. 执行故障切换
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# execute_failover.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
GG_HOME=”/oracle/goldengate”
FAILOVER_LOG=”$GG_HOME/ha/failover_$(date +%Y%m%d%H%M%S).log”
echo “$(date): Failover started” >> $FAILOVER_LOG
# 步骤1:同步配置
echo “$(date): Step 1 – Synchronizing configuration” >> $FAILOVER_LOG
rsync -avz fgedu-primary.net.cn:$GG_HOME/dirprm/ $GG_HOME/dirprm/ 2>/dev/null
echo “$(date): Configuration synchronized” >> $FAILOVER_LOG
# 步骤2:启动Manager
echo “$(date): Step 2 – Starting Manager” >> $FAILOVER_LOG
echo “start mgr” | $GG_HOME/ggsci >> $FAILOVER_LOG 2>&1
sleep 5
# 步骤3:启动Extract
echo “$(date): Step 3 – Starting Extract” >> $FAILOVER_LOG
echo “start extract ext01” | $GG_HOME/ggsci >> $FAILOVER_LOG 2>&1
sleep 5
# 步骤4:启动数据泵
echo “$(date): Step 4 – Starting Data Pump” >> $FAILOVER_LOG
echo “start extract dp01” | $GG_HOME/ggsci >> $FAILOVER_LOG 2>&1
sleep 5
# 步骤5:验证状态
echo “$(date): Step 5 – Verifying status” >> $FAILOVER_LOG
STATUS=$(echo “info all” | $GG_HOME/ggsci)
echo “$STATUS” >> $FAILOVER_LOG
# 检查是否成功
if echo “$STATUS” | grep -q “ABENDED\|STOPPED”; then
echo “$(date): FAILOVER FAILED – Some processes are not running” >> $FAILOVER_LOG
echo “GoldenGate failover FAILED” | mail -s “CRITICAL: Failover Failed” dba@fgedu.net.cn
exit 1
else
echo “$(date): FAILOVER SUCCESSFUL” >> $FAILOVER_LOG
echo “GoldenGate failover completed successfully” | mail -s “Failover Completed” dba@fgedu.net.cn
fi
# 3. 切换结果验证
GGSCI (fgedu-standby.net.cn) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:00 00:00:03
EXTRACT RUNNING DP01 00:00:00 00:00:05
# 4. 业务验证
# 检查数据复制是否正常
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders;
COUNT(*)
———-
10000000
# 数据复制正常,故障切换成功
# 5. 故障恢复后的回切
# 当主节点恢复后,执行回切操作
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# execute_failback.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “Primary node has recovered. Executing failback…”
# 停止备节点
echo “stop extract dp01” | $GG_HOME/ggsci
echo “stop extract ext01” | $GG_HOME/ggsci
echo “stop mgr” | $GG_HOME/ggsci
# 同步检查点
rsync -avz $GG_HOME/dirchk/ fgedu-primary.net.cn:$GG_HOME/dirchk/
# 启动主节点
ssh fgedu-primary.net.cn “echo ‘start mgr’ | $GG_HOME/ggsci”
ssh fgedu-primary.net.cn “echo ‘start extract ext01’ | $GG_HOME/ggsci”
ssh fgedu-primary.net.cn “echo ‘start extract dp01’ | $GG_HOME/ggsci”
echo “Failback completed”
4.3 Oracle GoldenGate容灾切换案例
以下是一个完整的容灾切换案例:
# 架构说明
# 生产中心:fgedu-prod.net.cn
# 灾备中心:fgedu-dr.net.cn
# 复制模式:单向复制(生产->灾备)
# 1. 容灾切换前准备
# 确认灾备中心状态
GGSCI (fgedu-dr.net.cn) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT RUNNING REP_DR 00:00:00 00:00:03
# 确认延迟情况
GGSCI (fgedu-dr.net.cn) 2> lag replicat rep_dr
Sending LAG request to REPLICAT REP_DR …
Last record lag: 3 seconds.
At EOF, no more records to process.
# 2. 执行容灾切换
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# disaster_recovery_switch.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
DR_HOST=”fgedu-dr.net.cn”
GG_HOME=”/oracle/goldengate”
SWITCH_LOG=”$GG_HOME/dr/switch_$(date +%Y%m%d%H%M%S).log”
echo “$(date): Disaster recovery switch started” >> $SWITCH_LOG
# 步骤1:停止Replicat
echo “$(date): Step 1 – Stopping Replicat” >> $SWITCH_LOG
ssh $DR_HOST “echo ‘stop replicat rep_dr’ | $GG_HOME/ggsci” >> $SWITCH_LOG 2>&1
# 步骤2:验证数据一致性
echo “$(date): Step 2 – Verifying data consistency” >> $SWITCH_LOG
ssh $DR_HOST “sqlplus -s fgedu/password@fgedudb_dr << EOF
SET PAGESIZE 0 FEEDBACK OFF VERIFY OFF HEADING OFF ECHO OFF
SELECT 'ORDERS_COUNT:'||COUNT(*) FROM fgedu_orders;
SELECT ‘CUSTOMERS_COUNT:’||COUNT(*) FROM fgedu_customers;
EXIT;
EOF” >> $SWITCH_LOG 2>&1
# 步骤3:激活灾备数据库
echo “$(date): Step 3 – Activating DR database” >> $SWITCH_LOG
ssh $DR_HOST “sqlplus -s / as sysdba << EOF
ALTER DATABASE ACTIVATE STANDBY DATABASE;
EXIT;
EOF” >> $SWITCH_LOG 2>&1
# 步骤4:配置反向复制(可选)
echo “$(date): Step 4 – Configuring reverse replication” >> $SWITCH_LOG
# 创建从灾备到生产的Extract(当生产恢复后使用)
ssh $DR_HOST “cat > $GG_HOME/dirprm/ext_reverse.prm << 'EOF'
EXTRACT ext_reverse
SETENV (NLS_LANG=\"AMERICAN_AMERICA.AL32UTF8\")
USERIDALIAS fgedu_ggadmin_dr
EXTTRAIL /oracle/goldengate/dirdat/er
TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;
EOF”
# 步骤5:启动灾备中心业务
echo “$(date): Step 5 – Starting DR business services” >> $SWITCH_LOG
# 启动应用服务器、负载均衡器等
# 步骤6:通知相关人员
echo “$(date): Step 6 – Sending notifications” >> $SWITCH_LOG
echo “Disaster recovery switch completed. DR site is now active.” | \
mail -s “DR Switch Completed” dba@fgedu.net.cn ops@fgedu.net.cn
echo “$(date): Disaster recovery switch completed” >> $SWITCH_LOG
# 3. 切换后监控
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# dr_monitoring.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
DR_HOST=”fgedu-dr.net.cn”
GG_HOME=”/oracle/goldengate”
# 监控数据库状态
DB_STATUS=$(ssh $DR_HOST “sqlplus -s / as sysdba << EOF
SET PAGESIZE 0 FEEDBACK OFF VERIFY OFF HEADING OFF ECHO OFF
SELECT status FROM v\$instance;
EXIT;
EOF” | tr -d ‘[:space:]’)
if [ “$DB_STATUS” != “OPEN” ]; then
echo “ALERT: DR database is not open: $DB_STATUS”
exit 1
fi
# 监控表空间使用率
ssh $DR_HOST “sqlplus -s / as sysdba << EOF
SET LINESIZE 200
SET PAGESIZE 100
COLUMN tablespace_name FORMAT A20
COLUMN used_pct FORMAT 999.99
SELECT tablespace_name,
ROUND((1 - free_space / total_space) * 100, 2) used_pct
FROM (SELECT tablespace_name,
SUM(bytes) / 1024 / 1024 free_space
FROM dba_free_space
GROUP BY tablespace_name) f,
(SELECT tablespace_name,
SUM(bytes) / 1024 / 1024 total_space
FROM dba_data_files
GROUP BY tablespace_name) t
WHERE f.tablespace_name = t.tablespace_name
ORDER BY used_pct DESC;
EXIT;
EOF”
# 4. 生产恢复后的回切
# 当生产中心恢复后,执行回切操作
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# dr_failback.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
PROD_HOST=”fgedu-prod.net.cn”
DR_HOST=”fgedu-dr.net.cn”
GG_HOME=”/oracle/goldengate”
echo “Production site has recovered. Preparing failback…”
# 步骤1:同步数据
echo “Synchronizing data from DR to Production…”
# 使用GoldenGate或Data Pump进行数据同步
# 步骤2:停止DR业务
echo “Stopping DR business services…”
# 步骤3:切换回生产中心
echo “Switching back to production site…”
# 步骤4:恢复GoldenGate复制
echo “Restoring GoldenGate replication…”
ssh $PROD_HOST “echo ‘start mgr’ | $GG_HOME/ggsci”
ssh $PROD_HOST “echo ‘start extract ext01’ | $GG_HOME/ggsci”
# 步骤5:启动生产业务
echo “Starting production business services…”
echo “Failback completed”
Part05-风哥经验总结与分享
5.1 Oracle GoldenGate高可用总结
Oracle GoldenGate高可用架构的关键要点:
- 架构选择:根据业务需求选择合适的高可用架构
- 自动切换:实现自动化的故障检测和切换
- 数据一致性:确保切换后数据一致性
- 定期演练:定期进行故障切换演练
- 监控告警:建立完善的监控告警体系
5.2 Oracle GoldenGate高可用检查清单
Oracle GoldenGate高可用检查清单:
□ 主备节点软件版本一致
□ 配置文件同步机制建立
□ 故障检测脚本部署
□ 自动切换脚本测试
□ 告警通知配置完成
□ 回切流程验证
# 日常维护检查清单
□ 主备节点状态检查
□ 配置同步状态检查
□ 故障检测脚本运行状态
□ 切换演练计划执行
# 故障切换检查清单
□ 故障确认和评估
□ 切换前数据备份
□ 执行切换操作
□ 验证切换结果
□ 通知相关人员
□ 记录切换过程
5.3 Oracle GoldenGate高可用工具推荐
Oracle GoldenGate高可用推荐工具:
- Keepalived:实现VIP漂移和故障切换
- Pacemaker:集群资源管理工具
- 自定义脚本:根据需求定制的切换脚本
- 监控工具:Zabbix、Prometheus等
- 告警工具:钉钉、企业微信、短信网关
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
