1. 首页 > Oracle教程 > 正文

Oracle教程FG254-Oracle GoldenGate高可用架构实战

本文档风哥主要介绍Oracle GoldenGate高可用架构的实战操作,包括Oracle GoldenGate高可用的概念、架构、场景、实施方案、故障切换等内容,由风哥教程参考Oracle官方文档GoldenGate文档,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。

Part01-基础概念与理论知识

1.1 Oracle GoldenGate高可用概念

Oracle GoldenGate高可用是指通过架构设计和配置优化,确保GoldenGate数据复制服务在硬件故障、软件故障、网络故障等情况下能够持续提供服务的能力。高可用架构能够最大程度地减少服务中断时间,保证业务连续性。更多视频教程www.fgedu.net.cn

Oracle GoldenGate高可用的主要目标:

  • 消除单点故障
  • 实现快速故障切换
  • 保证数据复制连续性
  • 最小化服务中断时间
  • 支持灾难恢复

1.2 Oracle GoldenGate高可用架构

Oracle GoldenGate高可用架构主要包括:

  • 主备架构:主节点运行GoldenGate进程,备节点待命
  • 双活架构:两个节点同时运行GoldenGate进程,互为备份
  • 集群架构:多个节点组成集群,共享存储
  • 分布式架构:多个独立GoldenGate实例,负载均衡

1.3 Oracle GoldenGate高可用场景

Oracle GoldenGate高可用主要应用场景:

  • 生产环境容灾:确保生产环境GoldenGate服务不中断
  • 大数据量复制:通过负载均衡提高复制性能
  • 关键业务系统:保证关键业务数据复制的连续性
  • 跨地域部署:在不同地域部署GoldenGate实例
风哥提示:高可用架构要根据业务需求和成本预算进行设计。不是所有系统都需要最高级别的高可用,要权衡可用性和成本。学习交流加群风哥微信: itpux-com

Part02-生产环境规划与建议

2.1 Oracle GoldenGate高可用规划

Oracle GoldenGate高可用规划要点:

# 高可用需求分析
业务重要性评估:
– 关键业务系统:需要99.99%可用性
– 重要业务系统:需要99.9%可用性
– 一般业务系统:需要99%可用性

RTO/RPO要求:
– RTO(恢复时间目标):故障发生后恢复服务的时间
– RPO(恢复点目标):允许丢失的数据量

# 高可用架构选择
单节点+冷备:
– 适用场景:一般业务系统
– 成本:低
– 可用性:99%

主备双机:
– 适用场景:重要业务系统
– 成本:中
– 可用性:99.9%

双活架构:
– 适用场景:关键业务系统
– 成本:高
– 可用性:99.99%

集群架构:
– 适用场景:大规模部署
– 成本:很高
– 可用性:99.999%

2.2 Oracle GoldenGate高可用设计

Oracle GoldenGate高可用设计要点:

# 主备架构设计
主节点(Primary):
– 运行所有GoldenGate进程
– 处理所有数据复制任务
– 实时监控状态

备节点(Standby):
– 安装GoldenGate软件
– 同步配置文件
– 定期健康检查
– 故障时接管服务

# 双活架构设计
节点A:
– 运行Extract进程
– 处理部分表的数据复制
– 监控节点B状态

节点B:
– 运行Extract进程
– 处理部分表的数据复制
– 监控节点A状态

负载均衡:
– 按表分组分配
– 按业务模块分配
– 动态负载调整

# 故障检测机制
心跳检测:
– 每30秒检测一次
– 连续3次失败判定故障
– 自动触发切换

状态同步:
– 实时同步配置
– 定期同步检查点
– 确保数据一致性

2.3 Oracle GoldenGate高可用最佳实践

Oracle GoldenGate高可用最佳实践:

  • 消除单点故障:所有组件都要有备份
  • 自动故障检测:实现自动化的故障检测机制
  • 快速故障切换:切换时间要控制在分钟级
  • 数据一致性:确保切换后数据一致性
  • 定期演练:定期进行故障切换演练
  • 监控告警:建立完善的监控告警体系
生产环境建议:高可用架构要综合考虑可用性、性能、成本等因素。建议从简单的主备架构开始,根据业务发展逐步升级。学习交流加群风哥QQ113257174

Part03-生产环境项目实施方案

3.1 Oracle GoldenGate主备架构实施

3.1.1 主节点配置

# 主节点(fgedu-primary.net.cn)配置

# 1. 安装GoldenGate软件
# 参考Oracle GoldenGate安装文档

# 2. 配置Manager
GGSCI (fgedu-primary.net.cn) 1> edit params mgr

PORT 7809
DYNAMICPORTLIST 7810-7820
AUTOSTART EXTRACT *
AUTOSTART REPLICAT *
AUTORESTART EXTRACT *, RETRIES 3, WAITMINUTES 5
AUTORESTART REPLICAT *, RETRIES 3, WAITMINUTES 5
LAGREPORTMINUTES 10
LAGINFOMINUTES 5
LAGCRITICALMINUTES 10
PURGEOLDEXTRACTS /oracle/goldengate/dirdat/*, USECHECKPOINTS, MINKEEPDAYS 7

# 3. 配置Extract
GGSCI (fgedu-primary.net.cn) 2> edit params ext01

EXTRACT ext01
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/et

TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)

TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;

# 4. 配置数据泵
GGSCI (fgedu-primary.net.cn) 3> edit params dp01

EXTRACT dp01
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
PASSTHRU
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTRAIL /oracle/goldengate/dirdat/rt

TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;

# 5. 启动进程
GGSCI (fgedu-primary.net.cn) 4> start mgr
GGSCI (fgedu-primary.net.cn) 5> start extract ext01
GGSCI (fgedu-primary.net.cn) 6> start extract dp01

# 6. 验证状态
GGSCI (fgedu-primary.net.cn) 7> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:00 00:00:03
EXTRACT RUNNING DP01 00:00:00 00:00:05

3.1.2 备节点配置

# 备节点(fgedu-standby.net.cn)配置

# 1. 安装GoldenGate软件
# 与主节点相同版本

# 2. 配置Manager
GGSCI (fgedu-standby.net.cn) 1> edit params mgr

PORT 7809
DYNAMICPORTLIST 7810-7820
LAGREPORTMINUTES 10
LAGINFOMINUTES 5
LAGCRITICALMINUTES 10

# 3. 同步配置文件
# 从主节点同步参数文件
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/dirprm/ /oracle/goldengate/dirprm/
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/dirdef/ /oracle/goldengate/dirdef/
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/dircrd/ /oracle/goldengate/dircrd/
$ rsync -avz fgedu-primary.net.cn:/oracle/goldengate/GLOBALS /oracle/goldengate/

# 4. 修改参数文件(适配备节点)
# 修改Extract参数,使用备节点数据库连接
GGSCI (fgedu-standby.net.cn) 2> edit params ext01

EXTRACT ext01
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_standby # 使用备节点数据库凭据
EXTTRAIL /oracle/goldengate/dirdat/et

TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)

TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;

# 5. 配置健康检查脚本
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# ha_health_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

PRIMARY_HOST=”fgedu-primary.net.cn”
STANDBY_HOST=”fgedu-standby.net.cn”
GG_HOME=”/oracle/goldengate”

# 检查主节点状态
check_primary() {
# 检查主节点Manager
PRIMARY_MGR=$(ssh $PRIMARY_HOST “echo ‘info mgr’ | $GG_HOME/ggsci | grep ‘Manager’ | awk ‘{print \$2}'”)

if [ “$PRIMARY_MGR” != “RUNNING” ]; then
echo “Primary Manager is not running”
return 1
fi

# 检查主节点Extract
PRIMARY_EXT=$(ssh $PRIMARY_HOST “echo ‘info extract ext01’ | $GG_HOME/ggsci | grep ‘EXTRACT’ | awk ‘{print \$2}'”)

if [ “$PRIMARY_EXT” != “RUNNING” ]; then
echo “Primary Extract is not running”
return 1
fi

return 0
}

# 主程序
if ! check_primary; then
echo “Primary node check failed, standby node should take over”
# 触发切换逻辑
fi

3.2 Oracle GoldenGate故障切换实施

3.2.1 自动故障切换脚本

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_failover.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# GoldenGate自动故障切换脚本

PRIMARY_HOST=”fgedu-primary.net.cn”
STANDBY_HOST=”fgedu-standby.net.cn”
GG_HOME=”/oracle/goldengate”
FAILOVER_LOG=”$GG_HOME/ha/failover.log”
PID_FILE=”$GG_HOME/ha/failover.pid”

# 检查是否已在运行
if [ -f $PID_FILE ]; then
OLD_PID=$(cat $PID_FILE)
if ps -p $OLD_PID > /dev/null 2>&1; then
echo “Failover script is already running”
exit 1
fi
fi
echo $$ > $PID_FILE

# 记录日志
log_message() {
echo “$(date ‘+%Y-%m-%d %H:%M:%S’): $1” >> $FAILOVER_LOG
}

# 检查主节点健康状态
check_primary_health() {
log_message “Checking primary node health…”

# 检查网络连通性
if ! ping -c 3 $PRIMARY_HOST > /dev/null 2>&1; then
log_message “Primary node network unreachable”
return 1
fi

# 检查Manager进程
MGR_STATUS=$(ssh $PRIMARY_HOST “echo ‘info mgr’ | $GG_HOME/ggsci 2>/dev/null | grep ‘Manager’ | awk ‘{print \$2}'”)
if [ “$MGR_STATUS” != “RUNNING” ]; then
log_message “Primary Manager is not running: $MGR_STATUS”
return 1
fi

# 检查Extract进程
EXT_STATUS=$(ssh $PRIMARY_HOST “echo ‘info extract ext01’ | $GG_HOME/ggsci 2>/dev/null | grep ‘EXTRACT’ | head -1 | awk ‘{print \$2}'”)
if [ “$EXT_STATUS” != “RUNNING” ]; then
log_message “Primary Extract is not running: $EXT_STATUS”
return 1
fi

log_message “Primary node is healthy”
return 0
}

# 执行故障切换
perform_failover() {
log_message “Starting failover process…”

# 1. 同步最新配置
log_message “Synchronizing configuration from primary…”
rsync -avz –delete $PRIMARY_HOST:$GG_HOME/dirprm/ $GG_HOME/dirprm/ 2>/dev/null
rsync -avz –delete $PRIMARY_HOST:$GG_HOME/dirchk/ $GG_HOME/dirchk/ 2>/dev/null

# 2. 启动Manager
log_message “Starting Manager on standby…”
echo “start mgr” | $GG_HOME/ggsci
sleep 5

# 3. 验证Manager状态
MGR_STATUS=$(echo “info mgr” | $GG_HOME/ggsci | grep “Manager” | awk ‘{print $2}’)
if [ “$MGR_STATUS” != “RUNNING” ]; then
log_message “Failed to start Manager on standby”
return 1
fi

# 4. 启动Extract
log_message “Starting Extract on standby…”
echo “start extract ext01” | $GG_HOME/ggsci
sleep 5

# 5. 验证Extract状态
EXT_STATUS=$(echo “info extract ext01” | $GG_HOME/ggsci | grep “EXTRACT” | head -1 | awk ‘{print $2}’)
if [ “$EXT_STATUS” != “RUNNING” ]; then
log_message “Failed to start Extract on standby”
return 1
fi

# 6. 启动数据泵
log_message “Starting Data Pump on standby…”
echo “start extract dp01” | $GG_HOME/ggsci
sleep 5

# 7. 验证切换结果
log_message “Verifying failover result…”
INFO_ALL=$(echo “info all” | $GG_HOME/ggsci)
log_message “Standby node status:\n$INFO_ALL”

# 8. 发送告警通知
echo “GoldenGate failover completed. Standby node is now active.” | \
mail -s “GoldenGate Failover Alert” dba@fgedu.net.cn

log_message “Failover completed successfully”
return 0
}

# 主程序
main() {
log_message “Failover monitoring started”

# 连续检测3次,每次间隔30秒
FAILURE_COUNT=0
for i in 1 2 3; do
if ! check_primary_health; then
FAILURE_COUNT=$((FAILURE_COUNT + 1))
fi
sleep 30
done

# 如果3次检测都失败,执行切换
if [ $FAILURE_COUNT -ge 3 ]; then
log_message “Primary node failed $FAILURE_COUNT times, initiating failover”
perform_failover
else
log_message “Primary node is stable, no failover needed”
fi

log_message “Failover monitoring completed”
}

# 执行主程序
main

# 清理PID文件
rm -f $PID_FILE

3.2.2 故障回切操作

#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# gg_failback.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

# GoldenGate故障回切脚本

PRIMARY_HOST=”fgedu-primary.net.cn”
STANDBY_HOST=”fgedu-standby.net.cn”
GG_HOME=”/oracle/goldengate”
FAILBACK_LOG=”$GG_HOME/ha/failback.log”

# 记录日志
log_message() {
echo “$(date ‘+%Y-%m-%d %H:%M:%S’): $1” >> $FAILBACK_LOG
}

# 检查主节点是否恢复
check_primary_recovery() {
log_message “Checking if primary node has recovered…”

# 检查网络连通性
if ! ping -c 3 $PRIMARY_HOST > /dev/null 2>&1; then
log_message “Primary node still unreachable”
return 1
fi

# 检查数据库连接
DB_STATUS=$(ssh $PRIMARY_HOST “sqlplus -s / as sysdba << EOF SET PAGESIZE 0 FEEDBACK OFF VERIFY OFF HEADING OFF ECHO OFF SELECT status FROM v\$instance;
EXIT;
EOF” | tr -d ‘[:space:]’)

if [ “$DB_STATUS” != “OPEN” ]; then
log_message “Primary database is not open: $DB_STATUS”
return 1
fi

log_message “Primary node has recovered”
return 0
}

# 执行故障回切
perform_failback() {
log_message “Starting failback process…”

# 1. 停止备节点进程
log_message “Stopping standby node processes…”
echo “stop extract dp01” | $GG_HOME/ggsci
sleep 5
echo “stop extract ext01” | $GG_HOME/ggsci
sleep 5
echo “stop mgr” | $GG_HOME/ggsci
sleep 5

# 2. 同步检查点信息到主节点
log_message “Synchronizing checkpoint information…”
rsync -avz $GG_HOME/dirchk/ $PRIMARY_HOST:$GG_HOME/dirchk/

# 3. 启动主节点进程
log_message “Starting primary node processes…”
ssh $PRIMARY_HOST “echo ‘start mgr’ | $GG_HOME/ggsci”
sleep 5
ssh $PRIMARY_HOST “echo ‘start extract ext01’ | $GG_HOME/ggsci”
sleep 5
ssh $PRIMARY_HOST “echo ‘start extract dp01’ | $GG_HOME/ggsci”
sleep 5

# 4. 验证主节点状态
log_message “Verifying primary node status…”
PRIMARY_STATUS=$(ssh $PRIMARY_HOST “echo ‘info all’ | $GG_HOME/ggsci”)
log_message “Primary node status:\n$PRIMARY_STATUS”

# 5. 发送通知
echo “GoldenGate failback completed. Primary node is now active.” | \
mail -s “GoldenGate Failback Notification” dba@fgedu.net.cn

log_message “Failback completed successfully”
}

# 主程序
main() {
log_message “Failback check started”

if check_primary_recovery; then
# 等待用户确认
log_message “Primary node is ready for failback”
echo “Primary node has recovered. Do you want to perform failback? (yes/no)”
read CONFIRM

if [ “$CONFIRM” == “yes” ]; then
perform_failback
else
log_message “Failback cancelled by user”
fi
else
log_message “Primary node not ready for failback”
fi

log_message “Failback check completed”
}

main

3.3 Oracle GoldenGate负载均衡实施

3.3.1 双活架构配置

# 节点A(fgedu-nodea.net.cn)配置
# 负责订单相关表

GGSCI (fgedu-nodea.net.cn) 1> edit params ext_orders

EXTRACT ext_orders
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/eo

TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_order_items;

GGSCI (fgedu-nodea.net.cn) 2> edit params dp_orders

EXTRACT dp_orders
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
PASSTHRU
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTRAIL /oracle/goldengate/dirdat/ro

TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_order_items;

# 节点B(fgedu-nodeb.net.cn)配置
# 负责客户相关表

GGSCI (fgedu-nodeb.net.cn) 1> edit params ext_customers

EXTRACT ext_customers
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/ec

TABLE fgedu.fgedu_customers;
TABLE fgedu.fgedu_products;

GGSCI (fgedu-nodeb.net.cn) 2> edit params dp_customers

EXTRACT dp_customers
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
PASSTHRU
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTRAIL /oracle/goldengate/dirdat/rc

TABLE fgedu.fgedu_customers;
TABLE fgedu.fgedu_products;

# 目标端配置
GGSCI (fgedu-target.net.cn) 1> edit params rep_orders

REPLICAT rep_orders
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_target
DISCARDFILE /oracle/goldengate/dirrpt/rep_orders.dsc, APPEND, MEGABYTES 100

MAP fgedu.fgedu_orders, TARGET fgedu.fgedu_orders;
MAP fgedu.fgedu_order_items, TARGET fgedu.fgedu_order_items;

GGSCI (fgedu-target.net.cn) 2> edit params rep_customers

REPLICAT rep_customers
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_target
DISCARDFILE /oracle/goldengate/dirrpt/rep_customers.dsc, APPEND, MEGABYTES 100

MAP fgedu.fgedu_customers, TARGET fgedu.fgedu_customers;
MAP fgedu.fgedu_products, TARGET fgedu.fgedu_products;

# 启动双活架构
# 节点A
GGSCI (fgedu-nodea.net.cn) 3> start mgr
GGSCI (fgedu-nodea.net.cn) 4> start extract ext_orders
GGSCI (fgedu-nodea.net.cn) 5> start extract dp_orders

# 节点B
GGSCI (fgedu-nodeb.net.cn) 3> start mgr
GGSCI (fgedu-nodeb.net.cn) 4> start extract ext_customers
GGSCI (fgedu-nodeb.net.cn) 5> start extract dp_customers

# 目标端
GGSCI (fgedu-target.net.cn) 3> start replicat rep_orders
GGSCI (fgedu-target.net.cn) 4> start replicat rep_customers

# 验证状态
GGSCI (fgedu-nodea.net.cn) 6> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT RUNNING EXT_ORDERS 00:00:00 00:00:03
EXTRACT RUNNING DP_ORDERS 00:00:00 00:00:05

GGSCI (fgedu-nodeb.net.cn) 6> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT RUNNING EXT_CUST 00:00:00 00:00:03
EXTRACT RUNNING DP_CUST 00:00:00 00:00:05

风哥提示:双活架构能够提高GoldenGate的可用性和性能,但也增加了管理复杂度。建议根据业务需求和团队能力选择合适的架构。更多学习教程公众号风哥教程itpux_com

Part04-生产案例与实战讲解

4.1 Oracle GoldenGate双活架构案例

以下是一个完整的双活架构实施案例:

# 场景:大型电商系统需要高可用GoldenGate架构

# 架构设计
# 节点A:处理订单、支付相关表
# 节点B:处理用户、商品相关表
# 目标端:统一接收所有数据

# 1. 表分组策略
# 节点A负责表:
# – fgedu_orders
# – fgedu_order_items
# – fgedu_payments
# – fgedu_invoices

# 节点B负责表:
# – fgedu_customers
# – fgedu_products
# – fgedu_inventory
# – fgedu_categories

# 2. 节点A配置
GGSCI (fgedu-nodea.net.cn) 1> edit params ext_a

EXTRACT ext_a
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/ea

CACHEMGR CACHESIZE 4GB
TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)

TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_order_items;
TABLE fgedu.fgedu_payments;
TABLE fgedu.fgedu_invoices;

# 3. 节点B配置
GGSCI (fgedu-nodeb.net.cn) 1> edit params ext_b

EXTRACT ext_b
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
EXTTRAIL /oracle/goldengate/dirdat/eb

CACHEMGR CACHESIZE 4GB
TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)

TABLE fgedu.fgedu_customers;
TABLE fgedu.fgedu_products;
TABLE fgedu.fgedu_inventory;
TABLE fgedu.fgedu_categories;

# 4. 健康检查脚本
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# dual_active_health_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

NODE_A=”fgedu-nodea.net.cn”
NODE_B=”fgedu-nodeb.net.cn”
GG_HOME=”/oracle/goldengate”

# 检查节点A
check_node_a() {
echo “Checking Node A…”
STATUS_A=$(ssh $NODE_A “echo ‘info all’ | $GG_HOME/ggsci” 2>/dev/null)

if echo “$STATUS_A” | grep -q “ABENDED\|STOPPED”; then
echo “Node A has issues:”
echo “$STATUS_A”
return 1
fi

echo “Node A is healthy”
return 0
}

# 检查节点B
check_node_b() {
echo “Checking Node B…”
STATUS_B=$(ssh $NODE_B “echo ‘info all’ | $GG_HOME/ggsci” 2>/dev/null)

if echo “$STATUS_B” | grep -q “ABENDED\|STOPPED”; then
echo “Node B has issues:”
echo “$STATUS_B”
return 1
fi

echo “Node B is healthy”
return 0
}

# 负载均衡检查
check_load_balance() {
echo “Checking load balance…”

# 获取节点A统计
STATS_A=$(ssh $NODE_A “echo ‘stats extract ext_a’ | $GG_HOME/ggsci” 2>/dev/null)
OPS_A=$(echo “$STATS_A” | grep “Total operations” | tail -1 | awk ‘{print $NF}’)

# 获取节点B统计
STATS_B=$(ssh $NODE_B “echo ‘stats extract ext_b’ | $GG_HOME/ggsci” 2>/dev/null)
OPS_B=$(echo “$STATS_B” | grep “Total operations” | tail -1 | awk ‘{print $NF}’)

echo “Node A operations: $OPS_A”
echo “Node B operations: $OPS_B”

# 检查负载是否均衡(差异不超过20%)
if [ -n “$OPS_A” ] && [ -n “$OPS_B” ]; then
DIFF=$(echo “scale=2; ($OPS_A – $OPS_B) / $OPS_A * 100” | bc | tr -d ‘-‘)
if (( $(echo “$DIFF > 20” | bc -l) )); then
echo “WARNING: Load imbalance detected (${DIFF}%)”
return 1
fi
fi

echo “Load is balanced”
return 0
}

# 主程序
main() {
echo “=== Dual-Active Health Check ===”
echo “Time: $(date)”

check_node_a
check_node_b
check_load_balance

echo “Health check completed”
}

main

4.2 Oracle GoldenGate故障切换案例

以下是一个完整的故障切换案例:

# 场景:主节点硬件故障,需要切换到备节点

# 1. 故障检测
# 监控系统检测到主节点异常
[2026-04-03 10:00:00] ALERT: Primary node fgedu-primary.net.cn is unreachable
[2026-04-03 10:00:30] ALERT: Primary node still unreachable after retry
[2026-04-03 10:01:00] ALERT: Primary node failed 3 consecutive checks
[2026-04-03 10:01:01] INFO: Initiating automatic failover

# 2. 执行故障切换
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# execute_failover.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

GG_HOME=”/oracle/goldengate”
FAILOVER_LOG=”$GG_HOME/ha/failover_$(date +%Y%m%d%H%M%S).log”

echo “$(date): Failover started” >> $FAILOVER_LOG

# 步骤1:同步配置
echo “$(date): Step 1 – Synchronizing configuration” >> $FAILOVER_LOG
rsync -avz fgedu-primary.net.cn:$GG_HOME/dirprm/ $GG_HOME/dirprm/ 2>/dev/null
echo “$(date): Configuration synchronized” >> $FAILOVER_LOG

# 步骤2:启动Manager
echo “$(date): Step 2 – Starting Manager” >> $FAILOVER_LOG
echo “start mgr” | $GG_HOME/ggsci >> $FAILOVER_LOG 2>&1
sleep 5

# 步骤3:启动Extract
echo “$(date): Step 3 – Starting Extract” >> $FAILOVER_LOG
echo “start extract ext01” | $GG_HOME/ggsci >> $FAILOVER_LOG 2>&1
sleep 5

# 步骤4:启动数据泵
echo “$(date): Step 4 – Starting Data Pump” >> $FAILOVER_LOG
echo “start extract dp01” | $GG_HOME/ggsci >> $FAILOVER_LOG 2>&1
sleep 5

# 步骤5:验证状态
echo “$(date): Step 5 – Verifying status” >> $FAILOVER_LOG
STATUS=$(echo “info all” | $GG_HOME/ggsci)
echo “$STATUS” >> $FAILOVER_LOG

# 检查是否成功
if echo “$STATUS” | grep -q “ABENDED\|STOPPED”; then
echo “$(date): FAILOVER FAILED – Some processes are not running” >> $FAILOVER_LOG
echo “GoldenGate failover FAILED” | mail -s “CRITICAL: Failover Failed” dba@fgedu.net.cn
exit 1
else
echo “$(date): FAILOVER SUCCESSFUL” >> $FAILOVER_LOG
echo “GoldenGate failover completed successfully” | mail -s “Failover Completed” dba@fgedu.net.cn
fi

# 3. 切换结果验证
GGSCI (fgedu-standby.net.cn) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT RUNNING EXT01 00:00:00 00:00:03
EXTRACT RUNNING DP01 00:00:00 00:00:05

# 4. 业务验证
# 检查数据复制是否正常
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders;

COUNT(*)
———-
10000000

# 数据复制正常,故障切换成功

# 5. 故障恢复后的回切
# 当主节点恢复后,执行回切操作
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# execute_failback.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

echo “Primary node has recovered. Executing failback…”

# 停止备节点
echo “stop extract dp01” | $GG_HOME/ggsci
echo “stop extract ext01” | $GG_HOME/ggsci
echo “stop mgr” | $GG_HOME/ggsci

# 同步检查点
rsync -avz $GG_HOME/dirchk/ fgedu-primary.net.cn:$GG_HOME/dirchk/

# 启动主节点
ssh fgedu-primary.net.cn “echo ‘start mgr’ | $GG_HOME/ggsci”
ssh fgedu-primary.net.cn “echo ‘start extract ext01’ | $GG_HOME/ggsci”
ssh fgedu-primary.net.cn “echo ‘start extract dp01’ | $GG_HOME/ggsci”

echo “Failback completed”

4.3 Oracle GoldenGate容灾切换案例

以下是一个完整的容灾切换案例:

# 场景:数据中心故障,需要切换到灾备中心

# 架构说明
# 生产中心:fgedu-prod.net.cn
# 灾备中心:fgedu-dr.net.cn
# 复制模式:单向复制(生产->灾备)

# 1. 容灾切换前准备
# 确认灾备中心状态
GGSCI (fgedu-dr.net.cn) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
REPLICAT RUNNING REP_DR 00:00:00 00:00:03

# 确认延迟情况
GGSCI (fgedu-dr.net.cn) 2> lag replicat rep_dr

Sending LAG request to REPLICAT REP_DR …

Last record lag: 3 seconds.
At EOF, no more records to process.

# 2. 执行容灾切换
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# disaster_recovery_switch.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

DR_HOST=”fgedu-dr.net.cn”
GG_HOME=”/oracle/goldengate”
SWITCH_LOG=”$GG_HOME/dr/switch_$(date +%Y%m%d%H%M%S).log”

echo “$(date): Disaster recovery switch started” >> $SWITCH_LOG

# 步骤1:停止Replicat
echo “$(date): Step 1 – Stopping Replicat” >> $SWITCH_LOG
ssh $DR_HOST “echo ‘stop replicat rep_dr’ | $GG_HOME/ggsci” >> $SWITCH_LOG 2>&1

# 步骤2:验证数据一致性
echo “$(date): Step 2 – Verifying data consistency” >> $SWITCH_LOG
ssh $DR_HOST “sqlplus -s fgedu/password@fgedudb_dr << EOF SET PAGESIZE 0 FEEDBACK OFF VERIFY OFF HEADING OFF ECHO OFF SELECT 'ORDERS_COUNT:'||COUNT(*) FROM fgedu_orders;
SELECT ‘CUSTOMERS_COUNT:’||COUNT(*) FROM fgedu_customers;
EXIT;
EOF” >> $SWITCH_LOG 2>&1

# 步骤3:激活灾备数据库
echo “$(date): Step 3 – Activating DR database” >> $SWITCH_LOG
ssh $DR_HOST “sqlplus -s / as sysdba << EOF ALTER DATABASE ACTIVATE STANDBY DATABASE;
EXIT;
EOF” >> $SWITCH_LOG 2>&1

# 步骤4:配置反向复制(可选)
echo “$(date): Step 4 – Configuring reverse replication” >> $SWITCH_LOG
# 创建从灾备到生产的Extract(当生产恢复后使用)
ssh $DR_HOST “cat > $GG_HOME/dirprm/ext_reverse.prm << 'EOF' EXTRACT ext_reverse SETENV (NLS_LANG=\"AMERICAN_AMERICA.AL32UTF8\") USERIDALIAS fgedu_ggadmin_dr EXTTRAIL /oracle/goldengate/dirdat/er TABLE fgedu.fgedu_orders;
TABLE fgedu.fgedu_customers;
EOF”

# 步骤5:启动灾备中心业务
echo “$(date): Step 5 – Starting DR business services” >> $SWITCH_LOG
# 启动应用服务器、负载均衡器等

# 步骤6:通知相关人员
echo “$(date): Step 6 – Sending notifications” >> $SWITCH_LOG
echo “Disaster recovery switch completed. DR site is now active.” | \
mail -s “DR Switch Completed” dba@fgedu.net.cn ops@fgedu.net.cn

echo “$(date): Disaster recovery switch completed” >> $SWITCH_LOG

# 3. 切换后监控
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# dr_monitoring.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

DR_HOST=”fgedu-dr.net.cn”
GG_HOME=”/oracle/goldengate”

# 监控数据库状态
DB_STATUS=$(ssh $DR_HOST “sqlplus -s / as sysdba << EOF SET PAGESIZE 0 FEEDBACK OFF VERIFY OFF HEADING OFF ECHO OFF SELECT status FROM v\$instance;
EXIT;
EOF” | tr -d ‘[:space:]’)

if [ “$DB_STATUS” != “OPEN” ]; then
echo “ALERT: DR database is not open: $DB_STATUS”
exit 1
fi

# 监控表空间使用率
ssh $DR_HOST “sqlplus -s / as sysdba << EOF SET LINESIZE 200 SET PAGESIZE 100 COLUMN tablespace_name FORMAT A20 COLUMN used_pct FORMAT 999.99 SELECT tablespace_name, ROUND((1 - free_space / total_space) * 100, 2) used_pct FROM (SELECT tablespace_name, SUM(bytes) / 1024 / 1024 free_space FROM dba_free_space GROUP BY tablespace_name) f, (SELECT tablespace_name, SUM(bytes) / 1024 / 1024 total_space FROM dba_data_files GROUP BY tablespace_name) t WHERE f.tablespace_name = t.tablespace_name ORDER BY used_pct DESC;
EXIT;
EOF”

# 4. 生产恢复后的回切
# 当生产中心恢复后,执行回切操作
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# dr_failback.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn

PROD_HOST=”fgedu-prod.net.cn”
DR_HOST=”fgedu-dr.net.cn”
GG_HOME=”/oracle/goldengate”

echo “Production site has recovered. Preparing failback…”

# 步骤1:同步数据
echo “Synchronizing data from DR to Production…”
# 使用GoldenGate或Data Pump进行数据同步

# 步骤2:停止DR业务
echo “Stopping DR business services…”

# 步骤3:切换回生产中心
echo “Switching back to production site…”

# 步骤4:恢复GoldenGate复制
echo “Restoring GoldenGate replication…”
ssh $PROD_HOST “echo ‘start mgr’ | $GG_HOME/ggsci”
ssh $PROD_HOST “echo ‘start extract ext01’ | $GG_HOME/ggsci”

# 步骤5:启动生产业务
echo “Starting production business services…”

echo “Failback completed”

生产环境建议:高可用架构要定期进行演练,确保故障切换流程能够正常执行。同时要建立完善的监控和告警机制,及时发现和处理故障。from oracle:www.itpux.com

Part05-风哥经验总结与分享

5.1 Oracle GoldenGate高可用总结

Oracle GoldenGate高可用架构的关键要点:

  • 架构选择:根据业务需求选择合适的高可用架构
  • 自动切换:实现自动化的故障检测和切换
  • 数据一致性:确保切换后数据一致性
  • 定期演练:定期进行故障切换演练
  • 监控告警:建立完善的监控告警体系

5.2 Oracle GoldenGate高可用检查清单

Oracle GoldenGate高可用检查清单:

# 高可用部署检查清单
□ 主备节点软件版本一致
□ 配置文件同步机制建立
□ 故障检测脚本部署
□ 自动切换脚本测试
□ 告警通知配置完成
□ 回切流程验证

# 日常维护检查清单
□ 主备节点状态检查
□ 配置同步状态检查
□ 故障检测脚本运行状态
□ 切换演练计划执行

# 故障切换检查清单
□ 故障确认和评估
□ 切换前数据备份
□ 执行切换操作
□ 验证切换结果
□ 通知相关人员
□ 记录切换过程

5.3 Oracle GoldenGate高可用工具推荐

Oracle GoldenGate高可用推荐工具:

  • Keepalived:实现VIP漂移和故障切换
  • Pacemaker:集群资源管理工具
  • 自定义脚本:根据需求定制的切换脚本
  • 监控工具:Zabbix、Prometheus等
  • 告警工具:钉钉、企业微信、短信网关
风哥提示:高可用架构不是一蹴而就的,需要根据业务发展逐步完善。建议从简单的主备架构开始,逐步升级到更复杂的架构。同时要重视演练和文档,确保团队能够熟练处理故障。更多视频教程www.fgedu.net.cn

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息