本文档风哥主要介绍Oracle GoldenGate故障处理与恢复的实战操作,包括Oracle GoldenGate故障处理的概念、类型、流程、处理方法、恢复操作等内容,由风哥教程参考Oracle官方文档GoldenGate文档,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。
Part01-基础概念与理论知识
1.1 Oracle GoldenGate故障处理概念
Oracle GoldenGate故障处理是指在GoldenGate运行过程中,当发生异常情况时,通过诊断、分析和修复等操作,使GoldenGate恢复正常运行的过程。故障处理是GoldenGate运维管理的重要组成部分。更多视频教程www.fgedu.net.cn
- 快速诊断故障原因
- 最小化业务影响
- 恢复数据复制服务
- 防止故障再次发生
- 积累故障处理经验
1.2 Oracle GoldenGate常见故障类型
Oracle GoldenGate常见故障类型:
- 进程故障:Extract或Replicat进程ABENDED或STOPPED
- 网络故障:网络连接中断或延迟过高
- 数据库故障:源端或目标端数据库不可用
- 存储故障:Trail文件存储空间不足或损坏
- 数据故障:数据不一致或数据损坏
- 配置故障:参数配置错误或权限不足
1.3 Oracle GoldenGate故障处理流程
Oracle GoldenGate故障处理标准流程:
- 故障发现:通过监控告警或巡检发现故障
- 信息收集:收集进程状态、日志、报告等信息
- 故障诊断:分析故障原因,定位问题根源
- 故障处理:根据诊断结果执行修复操作
- 验证恢复:验证故障是否已解决,服务是否恢复
- 总结记录:记录故障处理过程,总结经验教训
Part02-生产环境规划与建议
2.1 Oracle GoldenGate故障处理规划
Oracle GoldenGate故障处理规划要点:
一级故障:复制完全中断
– 影响:数据复制停止
– 响应时间:立即响应
– 处理时间:30分钟内恢复
二级故障:复制延迟过高
– 影响:数据延迟超过阈值
– 响应时间:15分钟内响应
– 处理时间:2小时内恢复
三级故障:偶发错误
– 影响:个别记录处理失败
– 响应时间:1小时内响应
– 处理时间:4小时内恢复
# 应急预案规划
– 建立故障响应团队
– 制定故障升级机制
– 准备故障处理手册
– 定期进行故障演练
2.2 Oracle GoldenGate故障预防策略
Oracle GoldenGate故障预防策略:
- 监控告警:建立完善的监控体系,及时发现异常
- 定期检查:定期检查进程状态、延迟、日志等
- 容量规划:合理规划存储空间,避免空间不足
- 备份配置:定期备份GoldenGate配置文件
- 变更管理:规范变更流程,避免配置错误
- 演练测试:定期进行故障演练,提高处理能力
2.3 Oracle GoldenGate故障处理最佳实践
Oracle GoldenGate故障处理最佳实践:
- 保留现场:故障发生时保留现场信息,便于分析
- 查看日志:首先查看进程报告和日志文件
- 逐步排查:按照网络、数据库、GoldenGate顺序排查
- 谨慎操作:避免盲目重启,防止数据丢失
- 记录过程:详细记录故障处理过程
- 及时通报:及时向相关人员通报故障情况
Part03-生产环境项目实施方案
3.1 Oracle GoldenGate Extract故障处理
3.1.1 Extract进程ABENDED处理
GGSCI (fgedu.net.cn) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT ABENDED EXT01 00:00:00 00:15:00
# 2. 查看错误报告
GGSCI (fgedu.net.cn) 2> view report ext01
2026-04-03 18:00:00 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ext01.prm: OCI Error beginning session (status = 28009-ORA-28009: connection as SYS should be as SYSDBA or SYSOPER).
2026-04-03 18:00:00 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext01.prm: PROCESS ABENDING.
# 3. 诊断原因:数据库连接失败
# 检查数据库连接
$ sqlplus fgedu_ggadmin/password@fgedudb
SQL*Plus: Release 19.0.0.0.0 – Production on Thu Apr 3 18:15:00 2026
Copyright (c) 1982, 2019, Oracle. All rights reserved.
ERROR:
ORA-28009: connection as SYS should be as SYSDBA or SYSOPER
# 4. 解决方案
# 检查tnsnames.ora配置
# 检查数据库监听器状态
$ lsnrctl status
LSNRCTL for Linux: Version 19.0.0.0.0 – Production on 03-APR-2026 18:16:00
Copyright (c) 1991, 2019, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=fgedu.net.cn)(PORT=1521)))
STATUS of the LISTENER
————————
Alias LISTENER
Version TNSLSNR for Linux: Version 19.0.0.0.0
Start Date 03-APR-2026 08:00:00
Uptime 0 days 10 hr. 16 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /oracle/app/oracle/product/19c/dbhome_1/network/admin/listener.ora
Listener Log File /oracle/app/oracle/diag/tnslsnr/fgedu/listener/alert/log.xml
Listening Endpoints Summary…
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=fgedu.net.cn)(PORT=1521)))
Services Summary…
Service “fgedudb” has 1 instance(s).
Instance “fgedudb”, status READY, has 1 handler(s) for this service…
The command completed successfully
# 5. 修复后重启Extract
GGSCI (fgedu.net.cn) 3> start extract ext01
Sending START request to MANAGER …
EXTRACT EXT01 starting
# 6. 验证恢复
GGSCI (fgedu.net.cn) 4> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 18:20:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
3.1.2 Extract延迟过高处理
GGSCI (fgedu.net.cn) 1> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 19:00:00 Status RUNNING
Checkpoint Lag 02:30:00 (updated 00:00:05 ago)
# 2. 查看统计信息
GGSCI (fgedu.net.cn) 2> stats extract ext01
Sending STATS request to EXTRACT EXT01 …
Start of Statistics at 2026-04-03 21:30:00.
Extracting from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
*** Total statistics since 2026-04-03 19:00:00 ***
Total inserts 100.00
Total updates 50.00
Total deletes 10.00
Total operations 160.00
# 3. 诊断原因:数据库日志生成量过大
SQL> SELECT thread#, sequence#, bytes/1024/1024 mb, first_time
FROM v$archived_log
WHERE first_time > sysdate – 1/24
ORDER BY first_time DESC;
THREAD# SEQUENCE# MB FIRST_TIME
———- ———- ———- ——————-
1 1234 1000 2026-04-03 21:30:00
1 1233 1000 2026-04-03 21:29:00
1 1232 1000 2026-04-03 21:28:00
# 4. 解决方案
# 增加Extract并行度
GGSCI (fgedu.net.cn) 3> edit params ext01
# 修改参数:TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)
# 重启Extract
GGSCI (fgedu.net.cn) 4> stop extract ext01
GGSCI (fgedu.net.cn) 5> start extract ext01
# 5. 监控效果
GGSCI (fgedu.net.cn) 6> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 21:35:00 Status RUNNING
Checkpoint Lag 00:05:00 (updated 00:00:03 ago)
3.2 Oracle GoldenGate Replicat故障处理
3.2.1 Replicat进程ABENDED处理
GGSCI (fgedu-target.net.cn) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT ABENDED REP01 00:00:00 00:30:00
# 2. 查看错误报告
GGSCI (fgedu-target.net.cn) 2> view report rep01
2026-04-03 20:00:00 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep01.prm: Error mapping from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS.
2026-04-03 20:00:00 ERROR OGG-01161 Oracle GoldenGate Delivery for Oracle, rep01.prm: Bad column index (5) specified for table FGEDU.FGEDU_ORDERS, max columns = 4.
# 3. 诊断原因:表结构不一致
SQL> desc fgedu.fgedu_orders
Name Null? Type
—————————————– ——– —————————-
ORDER_ID NOT NULL NUMBER(10)
CUSTOMER_ID NUMBER(10)
ORDER_DATE DATE
ORDER_AMOUNT NUMBER(10,2)
# 4. 解决方案
# 添加缺失列
SQL> ALTER TABLE fgedu.fgedu_orders ADD (order_status VARCHAR2(20));
Table altered.
# 5. 重启Replicat
GGSCI (fgedu-target.net.cn) 3> start replicat rep01
Sending START request to MANAGER …
REPLICAT REP01 starting
# 6. 验证恢复
GGSCI (fgedu-target.net.cn) 4> info replicat rep01
REPLICAT REP01 Last Started 2026-04-03 20:30:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)
3.2.2 Replicat数据冲突处理
GGSCI (fgedu-target.net.cn) 1> view report rep01
2026-04-03 22:00:00 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep01.prm: Error mapping from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS.
2026-04-03 22:00:00 ERROR OGG-01163 Oracle GoldenGate Delivery for Oracle, rep01.prm: BAD record.
2026-04-03 22:00:00 ERROR OGG-01020 Oracle GoldenGate Delivery for Oracle, rep01.prm: Processed record: order_id=10001.
# 2. 查看Discard文件
GGSCI (fgedu-target.net.cn) 2> shell cat /oracle/goldengate/dirdat/rep01.dsc
Oracle GoldenGate Delivery for Oracle process started
Error: ORA-00001: unique constraint (FGEDU.PK_FGEDU_ORDERS) violated
Record: order_id=10001, customer_id=1000, order_amount=1500.00
# 3. 诊断原因:主键冲突
# 目标端已存在相同主键的记录
# 4. 解决方案1:跳过冲突记录
GGSCI (fgedu-target.net.cn) 3> start replicat rep01, skiptransaction
Sending START request to MANAGER …
REPLICAT REP01 starting
# 解决方案2:删除目标端重复记录
SQL> DELETE FROM fgedu.fgedu_orders WHERE order_id = 10001;
SQL> COMMIT;
GGSCI (fgedu-target.net.cn) 4> start replicat rep01
# 5. 验证恢复
GGSCI (fgedu-target.net.cn) 5> info replicat rep01
REPLICAT REP01 Last Started 2026-04-03 22:30:00 Status RUNNING
3.3 Oracle GoldenGate故障恢复操作
3.3.1 从检查点恢复
# 1. 查看当前检查点
GGSCI (fgedu.net.cn) 1> info extract ext01, showch
EXTRACT EXT01 Last Started 2026-04-03 23:00:00 Status ABENDED
Checkpoint Lag 00:00:00 (updated 00:00:00 ago)
Read Checkpoint #1
Oracle Integrated Redo Logs
Startup Checkpoint (starting position in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 23:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 23:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log
Current Checkpoint (position of last record read in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 23:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log
# 2. 从检查点重启
GGSCI (fgedu.net.cn) 2> start extract ext01
Sending START request to MANAGER …
EXTRACT EXT01 starting
# 3. 验证恢复
GGSCI (fgedu.net.cn) 3> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 23:05:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
3.3.2 从指定时间恢复
# 1. 停止进程
GGSCI (fgedu.net.cn) 1> stop extract ext01
Sending STOP request to EXTRACT EXT01 …
Request processed.
# 2. 修改检查点到指定时间
GGSCI (fgedu.net.cn) 2> alter extract ext01, begin 2026-04-03 20:00:00
EXTRACT altered.
# 3. 重新注册Extract(集成模式)
GGSCI (fgedu.net.cn) 3> dblogin useridalias fgedu_ggadmin
Successfully logged into database.
GGSCI (fgedu.net.cn) 4> unregister extract ext01 database
GGSCI (fgedu.net.cn) 5> register extract ext01 database
Extract EXT01 successfully registered with database at SCN 12346000.
# 4. 启动进程
GGSCI (fgedu.net.cn) 6> start extract ext01
Sending START request to MANAGER …
EXTRACT EXT01 starting
# 5. 验证恢复
GGSCI (fgedu.net.cn) 7> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 23:30:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)
Log Read Checkpoint Oracle Integrated Redo Logs
2026-04-03 20:00:00 SCN 12346000
Part04-生产案例与实战讲解
4.1 Oracle GoldenGate进程ABENDED处理案例
以下是一个完整的进程ABENDED处理案例:
# 1. 发现故障
GGSCI (fgedu.net.cn) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT ABENDED EXT01 00:00:00 00:45:00
# 2. 收集信息
GGSCI (fgedu.net.cn) 2> view report ext01
2026-04-03 23:45:00 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ext01.prm: OCI Error beginning session (status = 1017-ORA-01017: invalid username/password; logon denied).
2026-04-03 23:45:00 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext01.prm: PROCESS ABENDING.
GGSCI (fgedu.net.cn) 3> view ggsevt
2026-04-03 23:45:00 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ext01.prm: OCI Error beginning session.
# 3. 诊断分析
# 错误原因:数据库密码过期或错误
# 4. 验证数据库连接
$ sqlplus fgedu_ggadmin/wrong_password@fgedudb
SQL*Plus: Release 19.0.0.0.0 – Production on Thu Apr 3 23:50:00 2026
Copyright (c) 1982, 2019, Oracle. All rights reserved.
ERROR:
ORA-01017: invalid username/password; logon denied
# 5. 解决方案
# 重置数据库密码
SQL> ALTER USER fgedu_ggadmin IDENTIFIED BY new_password;
User altered.
# 更新GoldenGate凭证
GGSCI (fgedu.net.cn) 4> alter credentialstore replace user fgedu_ggadmin password new_password alias fgedu_ggadmin
Credential store altered.
# 6. 重启进程
GGSCI (fgedu.net.cn) 5> start extract ext01
Sending START request to MANAGER …
EXTRACT EXT01 starting
# 7. 验证恢复
GGSCI (fgedu.net.cn) 6> info extract ext01
EXTRACT EXT01 Last Started 2026-04-03 23:55:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)
GGSCI (fgedu.net.cn) 7> stats extract ext01
Sending STATS request to EXTRACT EXT01 …
Start of Statistics at 2026-04-04 00:00:00.
Extracting from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
Total inserts 500.00
Total updates 200.00
Total deletes 50.00
Total operations 750.00
End of Statistics.
4.2 Oracle GoldenGate数据延迟处理案例
以下是一个完整的数据延迟处理案例:
# 1. 发现延迟
GGSCI (fgedu-target.net.cn) 1> info replicat rep01
REPLICAT REP01 Last Started 2026-04-03 20:00:00 Status RUNNING
Checkpoint Lag 02:30:00 (updated 00:00:05 ago)
# 2. 收集信息
GGSCI (fgedu-target.net.cn) 2> stats replicat rep01
Sending STATS request to REPLICAT REP01 …
Start of Statistics at 2026-04-03 22:30:00.
Replicating from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
*** Total statistics since 2026-04-03 20:00:00 ***
Total inserts 100.00
Total updates 50.00
Total deletes 10.00
Total operations 160.00
# 3. 诊断分析
# 处理速度过慢,检查目标数据库性能
SQL> SELECT event, total_waits, time_waited
FROM v$system_event
ORDER BY time_waited DESC;
EVENT TOTAL_WAITS TIME_WAITED
—————————————- ———– ———–
db file scattered read 123456 1234567
db file sequential read 234567 2345678
log file sync 34567 345678
# 4. 解决方案
# 优化目标数据库
# – 增加索引
SQL> CREATE INDEX idx_fgedu_orders_customer ON fgedu.fgedu_orders(customer_id);
Index created.
# – 调整Replicat参数
GGSCI (fgedu-target.net.cn) 3> edit params rep01
# 增加参数:
BATCHTRANS 2000
MAXTRANS 1000
# 5. 重启Replicat
GGSCI (fgedu-target.net.cn) 4> stop replicat rep01
GGSCI (fgedu-target.net.cn) 5> start replicat rep01
# 6. 监控效果
GGSCI (fgedu-target.net.cn) 6> info replicat rep01
REPLICAT REP01 Last Started 2026-04-03 22:35:00 Status RUNNING
Checkpoint Lag 00:10:00 (updated 00:00:03 ago)
# 7. 验证恢复
GGSCI (fgedu-target.net.cn) 7> lag replicat rep01
Sending LAG request to REPLICAT REP01 …
Last record lag: 10 seconds.
At EOF, no more records to process.
4.3 Oracle GoldenGate数据不一致恢复案例
以下是一个完整的数据不一致恢复案例:
# 1. 发现不一致
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders@source;
COUNT(*)
———-
10000000
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders;
COUNT(*)
———-
9999900
# 差异:目标端少100条记录
# 2. 定位缺失数据
SQL> SELECT order_id FROM fgedu.fgedu_orders@source
MINUS
SELECT order_id FROM fgedu.fgedu_orders;
ORDER_ID
———-
99001
99002
…
99100
# 3. 分析原因
# 查看Replicat报告
GGSCI (fgedu-target.net.cn) 1> view report rep01
2026-04-03 18:00:00 WARNING OGG-01004 Oracle GoldenGate Delivery for Oracle, rep01.prm: Aborted grouped transaction.
2026-04-03 18:00:00 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep01.prm: Error mapping from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS.
# 4. 解决方案
# 使用GoldenGate修复缺失数据
# 创建修复Extract
GGSCI (fgedu.net.cn) 2> add extract fix_ext, sourceistable
GGSCI (fgedu.net.cn) 3> edit params fix_ext
EXTRACT fix_ext
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTASK REPLICAT, GROUP fix_rep
TABLE fgedu.fgedu_orders, FILTER (order_id >= 99001 AND order_id <= 99100);
# 创建修复Replicat
GGSCI (fgedu-target.net.cn) 2> add replicat fix_rep, specialrun
GGSCI (fgedu-target.net.cn) 3> edit params fix_rep
REPLICAT fix_rep
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_target
HANDLECOLLISIONS
MAP fgedu.fgedu_orders, TARGET fgedu.fgedu_orders;
# 执行修复
GGSCI (fgedu.net.cn) 4> start extract fix_ext
# 5. 验证修复结果
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders;
COUNT(*)
———-
10000000
# 6. 重新校验
SQL> SELECT order_id FROM fgedu.fgedu_orders@source
MINUS
SELECT order_id FROM fgedu.fgedu_orders;
no rows selected
# 修复成功
Part05-风哥经验总结与分享
5.1 Oracle GoldenGate故障处理总结
Oracle GoldenGate故障处理的关键要点:
- 快速响应:建立快速响应机制,及时发现和处理故障
- 保留现场:故障发生时保留现场信息,便于分析
- 系统排查:按照网络、数据库、GoldenGate顺序系统排查
- 谨慎操作:避免盲目操作,防止问题扩大
- 记录总结:详细记录故障处理过程,总结经验
5.2 Oracle GoldenGate故障处理检查清单
Oracle GoldenGate故障处理检查清单:
□ 进程状态是否正常
□ 延迟是否在正常范围
□ 报告文件是否有错误
□ 系统资源是否正常
# 故障诊断检查清单
□ 查看进程报告和日志
□ 检查网络连接
□ 检查数据库状态
□ 检查存储空间
□ 检查配置参数
# 故障处理检查清单
□ 备份相关配置和数据
□ 制定修复方案
□ 执行修复操作
□ 验证修复结果
□ 记录处理过程
5.3 Oracle GoldenGate故障处理工具推荐
Oracle GoldenGate故障处理推荐工具:
- GGSCI:GoldenGate命令行管理工具
- Logdump:Trail文件分析工具
- GoldenGate Monitor:图形化监控工具
- Oracle Enterprise Manager:综合管理平台
- 自定义脚本:自动化故障检测和处理脚本
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
