1. 首页 > MongoDB教程 > 正文

MongoDB教程FG094-MongoDB数据库故障处理实战

本文主要介绍MongoDB数据库的故障处理策略,包括常见故障类型、故障诊断方法、故障恢复流程和最佳实践等内容。风哥教程参考MongoDB官方文档Troubleshooting相关章节。

目录大纲

Part01-基础概念与理论知识

Part02-生产环境规划与建议

Part03-生产环境项目实施方案

Part04-生产案例与实战讲解

Part05-风哥经验总结与分享

Part01-基础概念与理论知识

1.1 故障处理概述

故障处理是MongoDB数据库运维的重要组成部分,通过及时发现和解决故障,可以确保数据库的稳定运行,减少业务中断的风险。故障处理包括故障诊断、故障恢复和故障预防等环节。

故障处理的目标是:

  • 快速发现故障
  • 准确诊断故障原因
  • 及时恢复服务
  • 预防类似故障再次发生

故障处理的基本原则:

  • 保持冷静,有序处理
  • 优先恢复服务
  • 记录故障过程和解决方案
  • 分析故障原因,制定预防措施

学习交流加群风哥微信: itpux-com

1.2 常见故障类型

MongoDB常见的故障类型包括:

  • 服务故障:MongoDB服务无法启动或崩溃
  • 数据损坏:数据文件损坏或丢失
  • 网络故障:网络连接中断或延迟
  • 性能故障:查询速度慢或系统响应时间长
  • 安全故障:未授权访问或数据泄露
  • 复制集故障:复制集成员无法同步或选举失败
  • 分片集群故障:分片集群无法正常工作

更多视频教程www.fgedu.net.cn

Part02-生产环境规划与建议

2.1 故障预防策略

故障预防策略:

  • 监控与告警:设置监控系统,及时发现异常
  • 定期备份:定期备份数据,确保数据安全
  • 高可用架构:部署复制集或分片集群,提高系统可用性
  • 定期维护:定期进行数据库维护,如索引优化、空间清理等
  • 版本管理:及时更新MongoDB版本,修复安全漏洞
  • 容量规划:合理规划存储容量,避免空间不足
  • 故障演练:定期进行故障演练,提高应对故障的能力

风哥提示:故障预防是故障处理的重要环节,通过预防措施可以减少故障的发生。

2.2 故障处理流程

故障处理流程:

  1. 故障发现:通过监控系统或用户反馈发现故障
  2. 故障诊断:收集信息,分析故障原因
  3. 故障评估:评估故障的影响范围和严重程度
  4. 故障恢复:采取措施恢复服务
  5. 故障验证:验证服务是否恢复正常
  6. 故障分析:分析故障原因,制定预防措施
  7. 故障记录:记录故障过程和解决方案

更多学习教程公众号风哥教程itpux_com

Part03-生产环境项目实施方案

3.1 故障诊断

故障诊断工具和方法:

# 1. 查看MongoDB日志
tail -f /mongodb/logs/mongod.log

# 2. 检查MongoDB服务状态
systemctl status mongod

# 3. 查看数据库状态
mongosh –eval “db.serverStatus()”

# 4. 查看复制集状态
mongosh –eval “rs.status()”

# 5. 查看当前操作
mongosh –eval “db.currentOp()”

# 6. 检查系统资源使用情况
top
free -h
df -h

# 7. 检查网络连接
ping 192.168.1.100
netstat -tuln

3.2 故障恢复

服务故障恢复:

# 1. 重启MongoDB服务
systemctl restart mongod

# 2. 修复数据文件
mongod –repair –dbpath /mongodb/fgdata

# 3. 从备份恢复
mongorestore –host 192.168.1.100 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin /mongodb/backup/20260408

# 4. 复制集故障恢复
# 重新初始化复制集
mongosh –eval “rs.initiate({ _id: ‘fgedu-repl’, members: [{ _id: 0, host: ‘192.168.1.100:27017’ }, { _id: 1, host: ‘192.168.1.101:27017’ }, { _id: 2, host: ‘192.168.1.102:27017’ }] })”

# 5. 分片集群故障恢复
# 重启config server
systemctl restart mongod-config
# 重启mongos
systemctl restart mongos

3.3 故障演练

故障演练:

# 1. 模拟服务故障
# 停止MongoDB服务
systemctl stop mongod
# 验证服务停止
systemctl status mongod
# 启动MongoDB服务
systemctl start mongod
# 验证服务启动
systemctl status mongod

# 2. 模拟复制集故障
# 停止一个复制集成员
systemctl stop mongod-member1
# 查看复制集状态
mongosh –eval “rs.status()”
# 启动复制集成员
systemctl start mongod-member1
# 查看复制集状态
mongosh –eval “rs.status()”

# 3. 模拟数据损坏
# 备份数据
mongodump –out /mongodb/backup/test
# 模拟数据损坏(删除数据文件)
rm -f /mongodb/fgdata/*.wt
# 从备份恢复
mongorestore /mongodb/backup/test

Part04-生产案例与实战讲解

4.1 服务故障处理

服务故障处理实战:

# 1. 故障现象:MongoDB服务无法启动
# 检查日志
tail -f /mongodb/logs/mongod.log
# 输出:
2026-04-08T10:00:00.000+0800 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify –sslDisabledProtocols ‘none’
2026-04-08T10:00:00.000+0800 I CONTROL [main] MongoDB starting : pid=12345 port=27017 dbpath=/mongodb/fgdata 64-bit host=fgedu.net.cn
2026-04-08T10:00:00.000+0800 I CONTROL [main] db version v5.0.0
2026-04-08T10:00:00.000+0800 E STORAGE [main] WiredTiger error (17) [1617820800:0:0]: __open_file: 8594: /mongodb/fgdata/WiredTiger.wt: handle-open: open: File exists
2026-04-08T10:00:00.000+0800 E STORAGE [main] WiredTiger error (0) [1617820800:0:0]: WiredTiger_open config: 0x10: 256: the process must exit and restart: WT_PANIC: WiredTiger library panic
2026-04-08T10:00:00.000+0800 F – [main] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 706

# 2. 故障原因:WiredTiger数据文件损坏

# 3. 故障处理:
# 修复数据文件
mongod –repair –dbpath /mongodb/fgdata
# 输出:
2026-04-08T10:05:00.000+0800 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify –sslDisabledProtocols ‘none’
2026-04-08T10:05:00.000+0800 I CONTROL [main] MongoDB starting : pid=12346 port=27017 dbpath=/mongodb/fgdata 64-bit host=fgedu.net.cn
2026-04-08T10:05:00.000+0800 I CONTROL [main] db version v5.0.0
2026-04-08T10:05:00.000+0800 I STORAGE [main] Detected data files in /mongodb/fgdata created by the ‘wiredTiger’ storage engine, so setting the active storage engine to ‘wiredTiger’.
2026-04-08T10:05:00.000+0800 I STORAGE [main] wiredtiger_open config: create,cache_size=16G,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress],
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Main recovery loop:
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovering log 100000000000000001
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovering log 100000000000000002
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovering log 100000000000000003
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Set global recovery timestamp: (0, 0)
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovery complete
2026-04-08T10:05:00.000+0800 I CONTROL [main] MongoDB repair complete

# 4. 启动MongoDB服务
systemctl start mongod
systemctl status mongod
# 输出:
mongod.service – MongoDB Database Server
Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2026-04-08 10:10:00 CST; 1min ago
Main PID: 12347 (mongod)
CGroup: /system.slice/mongod.service
└─12347 /mongodb/app/bin/mongod -f /mongodb/app/mongod.conf

4.2 数据损坏处理

数据损坏处理实战:

# 1. 故障现象:数据文件损坏,查询报错
mongosh –eval “db.fgedu_users.find()”
# 输出:
Error: error doing query: failed: Location40415: BSON field ‘insert.documents.0’ is the wrong type ‘array’, expected type ‘object’

# 2. 故障原因:数据文件损坏

# 3. 故障处理:
# 方法一:从备份恢复
# 停止MongoDB服务
systemctl stop mongod
# 删除损坏的数据文件
rm -rf /mongodb/fgdata/*
# 从备份恢复
mongorestore –host 192.168.1.100 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin /mongodb/backup/20260408
# 输出:
2026-04-08T10:15:00.000+0800 preparing collections to restore from
2026-04-08T10:15:00.000+0800 reading metadata for fgedudb.fgedu_users from /mongodb/backup/20260408/fgedudb/fgedu_users.metadata.json
2026-04-08T10:15:00.000+0800 restoring fgedudb.fgedu_users from /mongodb/backup/20260408/fgedudb/fgedu_users.bson
2026-04-08T10:15:00.000+0800 finished restoring fgedudb.fgedu_users (10000 documents)
2026-04-08T10:15:00.000+0800 done

# 方法二:使用mongodump和mongorestore
# 导出数据
mongodump –host 192.168.1.100 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin –out /mongodb/backup/emergency
# 导入数据到新实例
mongorestore –host 192.168.1.101 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin /mongodb/backup/emergency

from MongoDB视频:www.itpux.com

4.3 网络故障处理

网络故障处理实战:

# 1. 故障现象:复制集成员无法连接
mongosh –eval “rs.status()”
# 输出:
{
“set”: “fgedu-repl”,
“date”: ISODate(“2026-04-08T10:20:00.000Z”),
“myState”: 1,
“term”: NumberLong(1),
“syncSource”: “”,
“syncSourceHost”: “”,
“heartbeatIntervalMillis”: NumberLong(2000),
“majorityVoteCount”: 2,
“writeMajorityCount”: 2,
“votingMembersCount”: 3,
“writableVotingMembersCount”: 2,
“optimes”: {
“lastAppliedOpTime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“lastDurableOpTime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“lastAppliedWallTime”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:20:00.000Z”)
},
“lastStableRecoveryTimestamp”: Timestamp(1617822000, 1),
“electionCandidateMetrics”: {
“lastElectionReason”: “electionTimeout”,
“lastElectionDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“electionTerm”: NumberLong(1),
“lastCommittedOpTimeAtElection”: { “ts”: Timestamp(1617820800, 1), “t”: NumberLong(-1) },
“lastSeenOpTimeAtElection”: { “ts”: Timestamp(1617820800, 1), “t”: NumberLong(-1) },
“numVotesNeeded”: 2,
“priorityAtElection”: 1,
“electionTimeoutMillis”: NumberLong(10000),
“numCatchUpOps”: NumberLong(0),
“newTermStartDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“wMajorityWriteAvailabilityDate”: ISODate(“2026-04-08T10:00:00.000Z”)
},
“members”: [
{
“_id”: 0,
“name”: “192.168.1.100:27017”,
“health”: 1,
“state”: 1,
“stateStr”: “PRIMARY”,
“uptime”: 1200,
“optime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:20:00.000Z”),
“syncSource”: “”,
“syncSourceHost”: “”,
“infoMessage”: “”,
“electionTime”: Timestamp(1617820800, 1),
“electionDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“configVersion”: 1,
“configTerm”: 1,
“self”: true,
“lastHeartbeatMessage”: “”
},
{
“_id”: 1,
“name”: “192.168.1.101:27017”,
“health”: 0,
“state”: 8,
“stateStr”: “(not reachable/healthy)”,
“uptime”: 0,
“optime”: { “ts”: Timestamp(0, 0), “t”: NumberLong(-1) },
“optimeDurable”: { “ts”: Timestamp(0, 0), “t”: NumberLong(-1) },
“optimeDate”: ISODate(“1970-01-01T00:00:00.000Z”),
“optimeDurableDate”: ISODate(“1970-01-01T00:00:00.000Z”),
“lastAppliedWallTime”: ISODate(“1970-01-01T00:00:00.000Z”),
“lastDurableWallTime”: ISODate(“1970-01-01T00:00:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:19:58.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:19:58.000Z”),
“pingMs”: NumberLong(0),
“lastHeartbeatMessage”: “Connection refused”,
“syncSource”: “”,
“syncSourceHost”: “”,
“infoMessage”: “”,
“configVersion”: -1,
“configTerm”: -1
},
{
“_id”: 2,
“name”: “192.168.1.102:27017”,
“health”: 1,
“state”: 2,
“stateStr”: “SECONDARY”,
“uptime”: 1200,
“optime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“optimeDurable”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:20:00.000Z”),
“optimeDurableDate”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastAppliedWallTime”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:20:00.000Z”),
“pingMs”: NumberLong(0),
“syncSource”: “192.168.1.100:27017”,
“syncSourceHost”: “192.168.1.100:27017”,
“infoMessage”: “”,
“configVersion”: 1,
“configTerm”: 1
}
],
“ok”: 1,
“$clusterTime”: {
“clusterTime”: Timestamp(1617822000, 1),
“signature”: {
“hash”: BinData(0, “AAAAAAAAAAAAAAAAAAAAAAAAAAA=”),
“keyId”: NumberLong(0)
}
},
“operationTime”: Timestamp(1617822000, 1)
}

# 2. 故障原因:192.168.1.101网络连接失败

# 3. 故障处理:
# 检查网络连接
ping 192.168.1.101
# 输出:
PING 192.168.1.101 (192.168.1.101) 56(84) bytes of data.
From 192.168.1.100 icmp_seq=1 Destination Host Unreachable

# 检查防火墙
firewall-cmd –list-all
# 输出:
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
sources:
services: ssh dhcpv6-client
ports: 27017/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:

# 开放MongoDB端口
firewall-cmd –permanent –add-port=27017/tcp
firewall-cmd –reload

# 检查MongoDB服务状态
ssh 192.168.1.101 “systemctl status mongod”
# 输出:
mongod.service – MongoDB Database Server
Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
Active: inactive (dead)

# 启动MongoDB服务
ssh 192.168.1.101 “systemctl start mongod”

# 验证复制集状态
mongosh –eval “rs.status()”
# 输出:
{
“set”: “fgedu-repl”,
“date”: ISODate(“2026-04-08T10:25:00.000Z”),
“myState”: 1,
“term”: NumberLong(1),
“syncSource”: “”,
“syncSourceHost”: “”,
“heartbeatIntervalMillis”: NumberLong(2000),
“majorityVoteCount”: 2,
“writeMajorityCount”: 2,
“votingMembersCount”: 3,
“writableVotingMembersCount”: 3,
“optimes”: {
“lastAppliedOpTime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“lastDurableOpTime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“lastAppliedWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:25:00.000Z”)
},
“lastStableRecoveryTimestamp”: Timestamp(1617822300, 1),
“members”: [
{
“_id”: 0,
“name”: “192.168.1.100:27017”,
“health”: 1,
“state”: 1,
“stateStr”: “PRIMARY”,
“uptime”: 1500,
“optime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“syncSource”: “”,
“syncSourceHost”: “”,
“infoMessage”: “”,
“electionTime”: Timestamp(1617820800, 1),
“electionDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“configVersion”: 1,
“configTerm”: 1,
“self”: true,
“lastHeartbeatMessage”: “”
},
{
“_id”: 1,
“name”: “192.168.1.101:27017”,
“health”: 1,
“state”: 2,
“stateStr”: “SECONDARY”,
“uptime”: 300,
“optime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDurable”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“optimeDurableDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastAppliedWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:25:00.000Z”),
“pingMs”: NumberLong(0),
“syncSource”: “192.168.1.100:27017”,
“syncSourceHost”: “192.168.1.100:27017”,
“infoMessage”: “”,
“configVersion”: 1,
“configTerm”: 1
},
{
“_id”: 2,
“name”: “192.168.1.102:27017”,
“health”: 1,
“state”: 2,
“stateStr”: “SECONDARY”,
“uptime”: 1500,
“optime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDurable”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“optimeDurableDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastAppliedWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:25:00.000Z”),
“pingMs”: NumberLong(0),
“syncSource”: “192.168.1.100:27017”,
“syncSourceHost”: “192.168.1.100:27017”,
“infoMessage”: “”,
“configVersion”: 1,
“configTerm”: 1
}
],
“ok”: 1,
“$clusterTime”: {
“clusterTime”: Timestamp(1617822300, 1),
“signature”: {
“hash”: BinData(0, “AAAAAAAAAAAAAAAAAAAAAAAAAAA=”),
“keyId”: NumberLong(0)
}
},
“operationTime”: Timestamp(1617822300, 1)
}

风哥提示:网络故障处理的关键是检查网络连接、防火墙配置和服务状态。

Part05-风哥经验总结与分享

5.1 故障处理最佳实践

风哥建议的故障处理最佳实践:

  • 定期备份:定期备份数据,确保数据安全
  • 监控告警:设置监控系统,及时发现异常
  • 高可用架构:部署复制集或分片集群,提高系统可用性
  • 故障演练:定期进行故障演练,提高应对故障的能力
  • 文档记录:记录故障处理过程和解决方案
  • 快速响应:快速响应故障,减少业务中断时间
  • 根本原因分析:分析故障根本原因,制定预防措施
  • 持续改进:持续改进故障处理流程和预防措施

学习交流加群风哥QQ113257174

5.2 常见故障与解决方案

常见故障与解决方案:

  • 问题:MongoDB服务无法启动
  • 解决方案:检查日志,修复数据文件,检查配置文件
  • 问题:数据文件损坏
  • 解决方案:从备份恢复,使用mongodump/mongorestore
  • 问题:复制集成员无法同步
  • 解决方案:检查网络连接,检查防火墙配置,重新初始化复制集
  • 问题:查询速度慢
  • 解决方案:创建合适的索引,优化查询语句,增加硬件资源
  • 问题:内存使用高
  • 解决方案:调整WiredTiger缓存大小,清理无用数据

更多视频教程www.fgedu.net.cn

注意事项

  • 定期备份数据,确保数据安全
  • 设置监控系统,及时发现异常
  • 部署高可用架构,提高系统可用性
  • 定期进行故障演练,提高应对故障的能力
  • 记录故障处理过程和解决方案
  • 快速响应故障,减少业务中断时间
  • 分析故障根本原因,制定预防措施
  • 持续改进故障处理流程和预防措施

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息