1. 首页 > Oracle教程 > 正文

Oracle教程FG250-Oracle GoldenGate故障处理与恢复实战

本文档风哥主要介绍Oracle GoldenGate故障处理与恢复的实战操作,包括Oracle GoldenGate故障处理的概念、类型、流程、处理方法、恢复操作等内容,由风哥教程参考Oracle官方文档GoldenGate文档,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。

Part01-基础概念与理论知识

1.1 Oracle GoldenGate故障处理概念

Oracle GoldenGate故障处理是指在GoldenGate运行过程中,当发生异常情况时,通过诊断、分析和修复等操作,使GoldenGate恢复正常运行的过程。故障处理是GoldenGate运维管理的重要组成部分。更多视频教程www.fgedu.net.cn

Oracle GoldenGate故障处理的主要目标:

  • 快速诊断故障原因
  • 最小化业务影响
  • 恢复数据复制服务
  • 防止故障再次发生
  • 积累故障处理经验

1.2 Oracle GoldenGate常见故障类型

Oracle GoldenGate常见故障类型:

  • 进程故障:Extract或Replicat进程ABENDED或STOPPED
  • 网络故障:网络连接中断或延迟过高
  • 数据库故障:源端或目标端数据库不可用
  • 存储故障:Trail文件存储空间不足或损坏
  • 数据故障:数据不一致或数据损坏
  • 配置故障:参数配置错误或权限不足

1.3 Oracle GoldenGate故障处理流程

Oracle GoldenGate故障处理标准流程:

  • 故障发现:通过监控告警或巡检发现故障
  • 信息收集:收集进程状态、日志、报告等信息
  • 故障诊断:分析故障原因,定位问题根源
  • 故障处理:根据诊断结果执行修复操作
  • 验证恢复:验证故障是否已解决,服务是否恢复
  • 总结记录:记录故障处理过程,总结经验教训
风哥提示:故障处理要遵循”先恢复、后分析”的原则。首先要尽快恢复服务,减少业务影响,然后再进行详细的故障分析和根因定位。学习交流加群风哥微信: itpux-com

Part02-生产环境规划与建议

2.1 Oracle GoldenGate故障处理规划

Oracle GoldenGate故障处理规划要点:

# 故障分级规划
一级故障:复制完全中断
– 影响:数据复制停止
– 响应时间:立即响应
– 处理时间:30分钟内恢复

二级故障:复制延迟过高
– 影响:数据延迟超过阈值
– 响应时间:15分钟内响应
– 处理时间:2小时内恢复

三级故障:偶发错误
– 影响:个别记录处理失败
– 响应时间:1小时内响应
– 处理时间:4小时内恢复

# 应急预案规划
– 建立故障响应团队
– 制定故障升级机制
– 准备故障处理手册
– 定期进行故障演练

2.2 Oracle GoldenGate故障预防策略

Oracle GoldenGate故障预防策略:

  • 监控告警:建立完善的监控体系,及时发现异常
  • 定期检查:定期检查进程状态、延迟、日志等
  • 容量规划:合理规划存储空间,避免空间不足
  • 备份配置:定期备份GoldenGate配置文件
  • 变更管理:规范变更流程,避免配置错误
  • 演练测试:定期进行故障演练,提高处理能力

2.3 Oracle GoldenGate故障处理最佳实践

Oracle GoldenGate故障处理最佳实践:

  • 保留现场:故障发生时保留现场信息,便于分析
  • 查看日志:首先查看进程报告和日志文件
  • 逐步排查:按照网络、数据库、GoldenGate顺序排查
  • 谨慎操作:避免盲目重启,防止数据丢失
  • 记录过程:详细记录故障处理过程
  • 及时通报:及时向相关人员通报故障情况
生产环境建议:建立完善的故障处理流程和应急预案,定期进行故障演练,确保团队具备快速响应和处理故障的能力。学习交流加群风哥QQ113257174

Part03-生产环境项目实施方案

3.1 Oracle GoldenGate Extract故障处理

3.1.1 Extract进程ABENDED处理

# 1. 查看进程状态
GGSCI (fgedu.net.cn) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT ABENDED EXT01 00:00:00 00:15:00

# 2. 查看错误报告
GGSCI (fgedu.net.cn) 2> view report ext01

2026-04-03 18:00:00 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ext01.prm: OCI Error beginning session (status = 28009-ORA-28009: connection as SYS should be as SYSDBA or SYSOPER).
2026-04-03 18:00:00 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext01.prm: PROCESS ABENDING.

# 3. 诊断原因:数据库连接失败
# 检查数据库连接
$ sqlplus fgedu_ggadmin/password@fgedudb

SQL*Plus: Release 19.0.0.0.0 – Production on Thu Apr 3 18:15:00 2026

Copyright (c) 1982, 2019, Oracle. All rights reserved.

ERROR:
ORA-28009: connection as SYS should be as SYSDBA or SYSOPER

# 4. 解决方案
# 检查tnsnames.ora配置
# 检查数据库监听器状态
$ lsnrctl status

LSNRCTL for Linux: Version 19.0.0.0.0 – Production on 03-APR-2026 18:16:00

Copyright (c) 1991, 2019, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=fgedu.net.cn)(PORT=1521)))
STATUS of the LISTENER
————————
Alias LISTENER
Version TNSLSNR for Linux: Version 19.0.0.0.0
Start Date 03-APR-2026 08:00:00
Uptime 0 days 10 hr. 16 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /oracle/app/oracle/product/19c/dbhome_1/network/admin/listener.ora
Listener Log File /oracle/app/oracle/diag/tnslsnr/fgedu/listener/alert/log.xml
Listening Endpoints Summary…
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=fgedu.net.cn)(PORT=1521)))
Services Summary…
Service “fgedudb” has 1 instance(s).
Instance “fgedudb”, status READY, has 1 handler(s) for this service…
The command completed successfully

# 5. 修复后重启Extract
GGSCI (fgedu.net.cn) 3> start extract ext01

Sending START request to MANAGER …
EXTRACT EXT01 starting

# 6. 验证恢复
GGSCI (fgedu.net.cn) 4> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 18:20:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)

3.1.2 Extract延迟过高处理

# 1. 查看进程延迟
GGSCI (fgedu.net.cn) 1> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 19:00:00 Status RUNNING
Checkpoint Lag 02:30:00 (updated 00:00:05 ago)

# 2. 查看统计信息
GGSCI (fgedu.net.cn) 2> stats extract ext01

Sending STATS request to EXTRACT EXT01 …

Start of Statistics at 2026-04-03 21:30:00.

Extracting from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
*** Total statistics since 2026-04-03 19:00:00 ***
Total inserts 100.00
Total updates 50.00
Total deletes 10.00
Total operations 160.00

# 3. 诊断原因:数据库日志生成量过大
SQL> SELECT thread#, sequence#, bytes/1024/1024 mb, first_time
FROM v$archived_log
WHERE first_time > sysdate – 1/24
ORDER BY first_time DESC;

THREAD# SEQUENCE# MB FIRST_TIME
———- ———- ———- ——————-
1 1234 1000 2026-04-03 21:30:00
1 1233 1000 2026-04-03 21:29:00
1 1232 1000 2026-04-03 21:28:00

# 4. 解决方案
# 增加Extract并行度
GGSCI (fgedu.net.cn) 3> edit params ext01
# 修改参数:TRANLOGOPTIONS INTEGRATEDPARAMS (parallelism 4)

# 重启Extract
GGSCI (fgedu.net.cn) 4> stop extract ext01
GGSCI (fgedu.net.cn) 5> start extract ext01

# 5. 监控效果
GGSCI (fgedu.net.cn) 6> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 21:35:00 Status RUNNING
Checkpoint Lag 00:05:00 (updated 00:00:03 ago)

3.2 Oracle GoldenGate Replicat故障处理

3.2.1 Replicat进程ABENDED处理

# 1. 查看进程状态
GGSCI (fgedu-target.net.cn) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
REPLICAT ABENDED REP01 00:00:00 00:30:00

# 2. 查看错误报告
GGSCI (fgedu-target.net.cn) 2> view report rep01

2026-04-03 20:00:00 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep01.prm: Error mapping from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS.
2026-04-03 20:00:00 ERROR OGG-01161 Oracle GoldenGate Delivery for Oracle, rep01.prm: Bad column index (5) specified for table FGEDU.FGEDU_ORDERS, max columns = 4.

# 3. 诊断原因:表结构不一致
SQL> desc fgedu.fgedu_orders
Name Null? Type
—————————————– ——– —————————-
ORDER_ID NOT NULL NUMBER(10)
CUSTOMER_ID NUMBER(10)
ORDER_DATE DATE
ORDER_AMOUNT NUMBER(10,2)

# 4. 解决方案
# 添加缺失列
SQL> ALTER TABLE fgedu.fgedu_orders ADD (order_status VARCHAR2(20));

Table altered.

# 5. 重启Replicat
GGSCI (fgedu-target.net.cn) 3> start replicat rep01

Sending START request to MANAGER …
REPLICAT REP01 starting

# 6. 验证恢复
GGSCI (fgedu-target.net.cn) 4> info replicat rep01

REPLICAT REP01 Last Started 2026-04-03 20:30:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)

3.2.2 Replicat数据冲突处理

# 1. 查看错误报告
GGSCI (fgedu-target.net.cn) 1> view report rep01

2026-04-03 22:00:00 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep01.prm: Error mapping from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS.
2026-04-03 22:00:00 ERROR OGG-01163 Oracle GoldenGate Delivery for Oracle, rep01.prm: BAD record.
2026-04-03 22:00:00 ERROR OGG-01020 Oracle GoldenGate Delivery for Oracle, rep01.prm: Processed record: order_id=10001.

# 2. 查看Discard文件
GGSCI (fgedu-target.net.cn) 2> shell cat /oracle/goldengate/dirdat/rep01.dsc

Oracle GoldenGate Delivery for Oracle process started
Error: ORA-00001: unique constraint (FGEDU.PK_FGEDU_ORDERS) violated
Record: order_id=10001, customer_id=1000, order_amount=1500.00

# 3. 诊断原因:主键冲突
# 目标端已存在相同主键的记录

# 4. 解决方案1:跳过冲突记录
GGSCI (fgedu-target.net.cn) 3> start replicat rep01, skiptransaction

Sending START request to MANAGER …
REPLICAT REP01 starting

# 解决方案2:删除目标端重复记录
SQL> DELETE FROM fgedu.fgedu_orders WHERE order_id = 10001;
SQL> COMMIT;

GGSCI (fgedu-target.net.cn) 4> start replicat rep01

# 5. 验证恢复
GGSCI (fgedu-target.net.cn) 5> info replicat rep01

REPLICAT REP01 Last Started 2026-04-03 22:30:00 Status RUNNING

3.3 Oracle GoldenGate故障恢复操作

3.3.1 从检查点恢复

# 场景:Extract进程异常终止,需要从检查点恢复

# 1. 查看当前检查点
GGSCI (fgedu.net.cn) 1> info extract ext01, showch

EXTRACT EXT01 Last Started 2026-04-03 23:00:00 Status ABENDED
Checkpoint Lag 00:00:00 (updated 00:00:00 ago)

Read Checkpoint #1
Oracle Integrated Redo Logs
Startup Checkpoint (starting position in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 23:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log

Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 23:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log

Current Checkpoint (position of last record read in the data source):
Sequence #: 100
RBA: 12345678
Timestamp: 2026-04-03 23:00:00.000000
SCN: 12345678
Redo File: /oradata/fgedudb/redo01.log

# 2. 从检查点重启
GGSCI (fgedu.net.cn) 2> start extract ext01

Sending START request to MANAGER …
EXTRACT EXT01 starting

# 3. 验证恢复
GGSCI (fgedu.net.cn) 3> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 23:05:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)

3.3.2 从指定时间恢复

# 场景:需要重新从指定时间开始复制

# 1. 停止进程
GGSCI (fgedu.net.cn) 1> stop extract ext01

Sending STOP request to EXTRACT EXT01 …
Request processed.

# 2. 修改检查点到指定时间
GGSCI (fgedu.net.cn) 2> alter extract ext01, begin 2026-04-03 20:00:00
EXTRACT altered.

# 3. 重新注册Extract(集成模式)
GGSCI (fgedu.net.cn) 3> dblogin useridalias fgedu_ggadmin
Successfully logged into database.

GGSCI (fgedu.net.cn) 4> unregister extract ext01 database
GGSCI (fgedu.net.cn) 5> register extract ext01 database
Extract EXT01 successfully registered with database at SCN 12346000.

# 4. 启动进程
GGSCI (fgedu.net.cn) 6> start extract ext01

Sending START request to MANAGER …
EXTRACT EXT01 starting

# 5. 验证恢复
GGSCI (fgedu.net.cn) 7> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 23:30:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:02 ago)
Log Read Checkpoint Oracle Integrated Redo Logs
2026-04-03 20:00:00 SCN 12346000

风哥提示:故障恢复操作要谨慎,特别是涉及数据一致性的操作。建议在执行恢复操作前备份相关配置和数据,确保可以回滚。更多学习教程公众号风哥教程itpux_com

Part04-生产案例与实战讲解

4.1 Oracle GoldenGate进程ABENDED处理案例

以下是一个完整的进程ABENDED处理案例:

# 场景:Extract进程ABENDED,错误代码OGG-00664

# 1. 发现故障
GGSCI (fgedu.net.cn) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
EXTRACT ABENDED EXT01 00:00:00 00:45:00

# 2. 收集信息
GGSCI (fgedu.net.cn) 2> view report ext01

2026-04-03 23:45:00 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ext01.prm: OCI Error beginning session (status = 1017-ORA-01017: invalid username/password; logon denied).
2026-04-03 23:45:00 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, ext01.prm: PROCESS ABENDING.

GGSCI (fgedu.net.cn) 3> view ggsevt

2026-04-03 23:45:00 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, ext01.prm: OCI Error beginning session.

# 3. 诊断分析
# 错误原因:数据库密码过期或错误

# 4. 验证数据库连接
$ sqlplus fgedu_ggadmin/wrong_password@fgedudb

SQL*Plus: Release 19.0.0.0.0 – Production on Thu Apr 3 23:50:00 2026

Copyright (c) 1982, 2019, Oracle. All rights reserved.

ERROR:
ORA-01017: invalid username/password; logon denied

# 5. 解决方案
# 重置数据库密码
SQL> ALTER USER fgedu_ggadmin IDENTIFIED BY new_password;

User altered.

# 更新GoldenGate凭证
GGSCI (fgedu.net.cn) 4> alter credentialstore replace user fgedu_ggadmin password new_password alias fgedu_ggadmin
Credential store altered.

# 6. 重启进程
GGSCI (fgedu.net.cn) 5> start extract ext01

Sending START request to MANAGER …
EXTRACT EXT01 starting

# 7. 验证恢复
GGSCI (fgedu.net.cn) 6> info extract ext01

EXTRACT EXT01 Last Started 2026-04-03 23:55:00 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:03 ago)

GGSCI (fgedu.net.cn) 7> stats extract ext01

Sending STATS request to EXTRACT EXT01 …

Start of Statistics at 2026-04-04 00:00:00.

Extracting from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
Total inserts 500.00
Total updates 200.00
Total deletes 50.00
Total operations 750.00

End of Statistics.

4.2 Oracle GoldenGate数据延迟处理案例

以下是一个完整的数据延迟处理案例:

# 场景:Replicat延迟超过2小时

# 1. 发现延迟
GGSCI (fgedu-target.net.cn) 1> info replicat rep01

REPLICAT REP01 Last Started 2026-04-03 20:00:00 Status RUNNING
Checkpoint Lag 02:30:00 (updated 00:00:05 ago)

# 2. 收集信息
GGSCI (fgedu-target.net.cn) 2> stats replicat rep01

Sending STATS request to REPLICAT REP01 …

Start of Statistics at 2026-04-03 22:30:00.

Replicating from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS:
*** Total statistics since 2026-04-03 20:00:00 ***
Total inserts 100.00
Total updates 50.00
Total deletes 10.00
Total operations 160.00

# 3. 诊断分析
# 处理速度过慢,检查目标数据库性能

SQL> SELECT event, total_waits, time_waited
FROM v$system_event
ORDER BY time_waited DESC;

EVENT TOTAL_WAITS TIME_WAITED
—————————————- ———– ———–
db file scattered read 123456 1234567
db file sequential read 234567 2345678
log file sync 34567 345678

# 4. 解决方案
# 优化目标数据库
# – 增加索引
SQL> CREATE INDEX idx_fgedu_orders_customer ON fgedu.fgedu_orders(customer_id);

Index created.

# – 调整Replicat参数
GGSCI (fgedu-target.net.cn) 3> edit params rep01
# 增加参数:
BATCHTRANS 2000
MAXTRANS 1000

# 5. 重启Replicat
GGSCI (fgedu-target.net.cn) 4> stop replicat rep01
GGSCI (fgedu-target.net.cn) 5> start replicat rep01

# 6. 监控效果
GGSCI (fgedu-target.net.cn) 6> info replicat rep01

REPLICAT REP01 Last Started 2026-04-03 22:35:00 Status RUNNING
Checkpoint Lag 00:10:00 (updated 00:00:03 ago)

# 7. 验证恢复
GGSCI (fgedu-target.net.cn) 7> lag replicat rep01

Sending LAG request to REPLICAT REP01 …

Last record lag: 10 seconds.
At EOF, no more records to process.

4.3 Oracle GoldenGate数据不一致恢复案例

以下是一个完整的数据不一致恢复案例:

# 场景:数据校验发现目标端数据缺失

# 1. 发现不一致
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders@source;

COUNT(*)
———-
10000000

SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders;

COUNT(*)
———-
9999900

# 差异:目标端少100条记录

# 2. 定位缺失数据
SQL> SELECT order_id FROM fgedu.fgedu_orders@source
MINUS
SELECT order_id FROM fgedu.fgedu_orders;

ORDER_ID
———-
99001
99002

99100

# 3. 分析原因
# 查看Replicat报告
GGSCI (fgedu-target.net.cn) 1> view report rep01

2026-04-03 18:00:00 WARNING OGG-01004 Oracle GoldenGate Delivery for Oracle, rep01.prm: Aborted grouped transaction.
2026-04-03 18:00:00 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep01.prm: Error mapping from FGEDU.FGEDU_ORDERS to FGEDU.FGEDU_ORDERS.

# 4. 解决方案
# 使用GoldenGate修复缺失数据

# 创建修复Extract
GGSCI (fgedu.net.cn) 2> add extract fix_ext, sourceistable

GGSCI (fgedu.net.cn) 3> edit params fix_ext

EXTRACT fix_ext
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin
RMTHOST 192.168.1.100, MGRPORT 7809
RMTTASK REPLICAT, GROUP fix_rep

TABLE fgedu.fgedu_orders, FILTER (order_id >= 99001 AND order_id <= 99100);

# 创建修复Replicat
GGSCI (fgedu-target.net.cn) 2> add replicat fix_rep, specialrun

GGSCI (fgedu-target.net.cn) 3> edit params fix_rep

REPLICAT fix_rep
SETENV (NLS_LANG=”AMERICAN_AMERICA.AL32UTF8″)
USERIDALIAS fgedu_ggadmin_target

HANDLECOLLISIONS
MAP fgedu.fgedu_orders, TARGET fgedu.fgedu_orders;

# 执行修复
GGSCI (fgedu.net.cn) 4> start extract fix_ext

# 5. 验证修复结果
SQL> SELECT COUNT(*) FROM fgedu.fgedu_orders;

COUNT(*)
———-
10000000

# 6. 重新校验
SQL> SELECT order_id FROM fgedu.fgedu_orders@source
MINUS
SELECT order_id FROM fgedu.fgedu_orders;

no rows selected

# 修复成功

生产环境建议:数据不一致修复要谨慎,建议在修复前备份目标表数据。对于大量数据不一致,建议重新进行初始加载。from oracle:www.itpux.com

Part05-风哥经验总结与分享

5.1 Oracle GoldenGate故障处理总结

Oracle GoldenGate故障处理的关键要点:

  • 快速响应:建立快速响应机制,及时发现和处理故障
  • 保留现场:故障发生时保留现场信息,便于分析
  • 系统排查:按照网络、数据库、GoldenGate顺序系统排查
  • 谨慎操作:避免盲目操作,防止问题扩大
  • 记录总结:详细记录故障处理过程,总结经验

5.2 Oracle GoldenGate故障处理检查清单

Oracle GoldenGate故障处理检查清单:

# 故障发现检查清单
□ 进程状态是否正常
□ 延迟是否在正常范围
□ 报告文件是否有错误
□ 系统资源是否正常

# 故障诊断检查清单
□ 查看进程报告和日志
□ 检查网络连接
□ 检查数据库状态
□ 检查存储空间
□ 检查配置参数

# 故障处理检查清单
□ 备份相关配置和数据
□ 制定修复方案
□ 执行修复操作
□ 验证修复结果
□ 记录处理过程

5.3 Oracle GoldenGate故障处理工具推荐

Oracle GoldenGate故障处理推荐工具:

  • GGSCI:GoldenGate命令行管理工具
  • Logdump:Trail文件分析工具
  • GoldenGate Monitor:图形化监控工具
  • Oracle Enterprise Manager:综合管理平台
  • 自定义脚本:自动化故障检测和处理脚本
风哥提示:GoldenGate故障处理需要丰富的经验和系统的知识。建议建立完善的故障处理流程和知识库,定期进行故障演练,提高团队的故障处理能力。更多视频教程www.fgedu.net.cn

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息