本文主要介绍MongoDB数据库的故障处理策略,包括常见故障类型、故障诊断方法、故障恢复流程和最佳实践等内容。风哥教程参考MongoDB官方文档Troubleshooting相关章节。
目录大纲
Part01-基础概念与理论知识
1.1 故障处理概述
故障处理是MongoDB数据库运维的重要组成部分,通过及时发现和解决故障,可以确保数据库的稳定运行,减少业务中断的风险。故障处理包括故障诊断、故障恢复和故障预防等环节。
故障处理的目标是:
- 快速发现故障
- 准确诊断故障原因
- 及时恢复服务
- 预防类似故障再次发生
故障处理的基本原则:
- 保持冷静,有序处理
- 优先恢复服务
- 记录故障过程和解决方案
- 分析故障原因,制定预防措施
学习交流加群风哥微信: itpux-com
1.2 常见故障类型
MongoDB常见的故障类型包括:
- 服务故障:MongoDB服务无法启动或崩溃
- 数据损坏:数据文件损坏或丢失
- 网络故障:网络连接中断或延迟
- 性能故障:查询速度慢或系统响应时间长
- 安全故障:未授权访问或数据泄露
- 复制集故障:复制集成员无法同步或选举失败
- 分片集群故障:分片集群无法正常工作
更多视频教程www.fgedu.net.cn
Part02-生产环境规划与建议
2.1 故障预防策略
故障预防策略:
- 监控与告警:设置监控系统,及时发现异常
- 定期备份:定期备份数据,确保数据安全
- 高可用架构:部署复制集或分片集群,提高系统可用性
- 定期维护:定期进行数据库维护,如索引优化、空间清理等
- 版本管理:及时更新MongoDB版本,修复安全漏洞
- 容量规划:合理规划存储容量,避免空间不足
- 故障演练:定期进行故障演练,提高应对故障的能力
风哥提示:故障预防是故障处理的重要环节,通过预防措施可以减少故障的发生。
2.2 故障处理流程
故障处理流程:
- 故障发现:通过监控系统或用户反馈发现故障
- 故障诊断:收集信息,分析故障原因
- 故障评估:评估故障的影响范围和严重程度
- 故障恢复:采取措施恢复服务
- 故障验证:验证服务是否恢复正常
- 故障分析:分析故障原因,制定预防措施
- 故障记录:记录故障过程和解决方案
更多学习教程公众号风哥教程itpux_com
Part03-生产环境项目实施方案
3.1 故障诊断
故障诊断工具和方法:
# 1. 查看MongoDB日志
tail -f /mongodb/logs/mongod.log
# 2. 检查MongoDB服务状态
systemctl status mongod
# 3. 查看数据库状态
mongosh –eval “db.serverStatus()”
# 4. 查看复制集状态
mongosh –eval “rs.status()”
# 5. 查看当前操作
mongosh –eval “db.currentOp()”
# 6. 检查系统资源使用情况
top
free -h
df -h
# 7. 检查网络连接
ping 192.168.1.100
netstat -tuln
3.2 故障恢复
服务故障恢复:
# 1. 重启MongoDB服务
systemctl restart mongod
# 2. 修复数据文件
mongod –repair –dbpath /mongodb/fgdata
# 3. 从备份恢复
mongorestore –host 192.168.1.100 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin /mongodb/backup/20260408
# 4. 复制集故障恢复
# 重新初始化复制集
mongosh –eval “rs.initiate({ _id: ‘fgedu-repl’, members: [{ _id: 0, host: ‘192.168.1.100:27017’ }, { _id: 1, host: ‘192.168.1.101:27017’ }, { _id: 2, host: ‘192.168.1.102:27017’ }] })”
# 5. 分片集群故障恢复
# 重启config server
systemctl restart mongod-config
# 重启mongos
systemctl restart mongos
3.3 故障演练
故障演练:
# 1. 模拟服务故障
# 停止MongoDB服务
systemctl stop mongod
# 验证服务停止
systemctl status mongod
# 启动MongoDB服务
systemctl start mongod
# 验证服务启动
systemctl status mongod
# 2. 模拟复制集故障
# 停止一个复制集成员
systemctl stop mongod-member1
# 查看复制集状态
mongosh –eval “rs.status()”
# 启动复制集成员
systemctl start mongod-member1
# 查看复制集状态
mongosh –eval “rs.status()”
# 3. 模拟数据损坏
# 备份数据
mongodump –out /mongodb/backup/test
# 模拟数据损坏(删除数据文件)
rm -f /mongodb/fgdata/*.wt
# 从备份恢复
mongorestore /mongodb/backup/test
Part04-生产案例与实战讲解
4.1 服务故障处理
服务故障处理实战:
# 1. 故障现象:MongoDB服务无法启动
# 检查日志
tail -f /mongodb/logs/mongod.log
# 输出:
2026-04-08T10:00:00.000+0800 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify –sslDisabledProtocols ‘none’
2026-04-08T10:00:00.000+0800 I CONTROL [main] MongoDB starting : pid=12345 port=27017 dbpath=/mongodb/fgdata 64-bit host=fgedu.net.cn
2026-04-08T10:00:00.000+0800 I CONTROL [main] db version v5.0.0
2026-04-08T10:00:00.000+0800 E STORAGE [main] WiredTiger error (17) [1617820800:0:0]: __open_file: 8594: /mongodb/fgdata/WiredTiger.wt: handle-open: open: File exists
2026-04-08T10:00:00.000+0800 E STORAGE [main] WiredTiger error (0) [1617820800:0:0]: WiredTiger_open config: 0x10: 256: the process must exit and restart: WT_PANIC: WiredTiger library panic
2026-04-08T10:00:00.000+0800 F – [main] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 706
# 2. 故障原因:WiredTiger数据文件损坏
# 3. 故障处理:
# 修复数据文件
mongod –repair –dbpath /mongodb/fgdata
# 输出:
2026-04-08T10:05:00.000+0800 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify –sslDisabledProtocols ‘none’
2026-04-08T10:05:00.000+0800 I CONTROL [main] MongoDB starting : pid=12346 port=27017 dbpath=/mongodb/fgdata 64-bit host=fgedu.net.cn
2026-04-08T10:05:00.000+0800 I CONTROL [main] db version v5.0.0
2026-04-08T10:05:00.000+0800 I STORAGE [main] Detected data files in /mongodb/fgdata created by the ‘wiredTiger’ storage engine, so setting the active storage engine to ‘wiredTiger’.
2026-04-08T10:05:00.000+0800 I STORAGE [main] wiredtiger_open config: create,cache_size=16G,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress],
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Main recovery loop:
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovering log 100000000000000001
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovering log 100000000000000002
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovering log 100000000000000003
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Set global recovery timestamp: (0, 0)
2026-04-08T10:05:00.000+0800 I STORAGE [main] WiredTiger message [1617821100:0:0], txn-recover: Recovery complete
2026-04-08T10:05:00.000+0800 I CONTROL [main] MongoDB repair complete
# 4. 启动MongoDB服务
systemctl start mongod
systemctl status mongod
# 输出:
mongod.service – MongoDB Database Server
Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2026-04-08 10:10:00 CST; 1min ago
Main PID: 12347 (mongod)
CGroup: /system.slice/mongod.service
└─12347 /mongodb/app/bin/mongod -f /mongodb/app/mongod.conf
4.2 数据损坏处理
数据损坏处理实战:
# 1. 故障现象:数据文件损坏,查询报错
mongosh –eval “db.fgedu_users.find()”
# 输出:
Error: error doing query: failed: Location40415: BSON field ‘insert.documents.0’ is the wrong type ‘array’, expected type ‘object’
# 2. 故障原因:数据文件损坏
# 3. 故障处理:
# 方法一:从备份恢复
# 停止MongoDB服务
systemctl stop mongod
# 删除损坏的数据文件
rm -rf /mongodb/fgdata/*
# 从备份恢复
mongorestore –host 192.168.1.100 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin /mongodb/backup/20260408
# 输出:
2026-04-08T10:15:00.000+0800 preparing collections to restore from
2026-04-08T10:15:00.000+0800 reading metadata for fgedudb.fgedu_users from /mongodb/backup/20260408/fgedudb/fgedu_users.metadata.json
2026-04-08T10:15:00.000+0800 restoring fgedudb.fgedu_users from /mongodb/backup/20260408/fgedudb/fgedu_users.bson
2026-04-08T10:15:00.000+0800 finished restoring fgedudb.fgedu_users (10000 documents)
2026-04-08T10:15:00.000+0800 done
# 方法二:使用mongodump和mongorestore
# 导出数据
mongodump –host 192.168.1.100 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin –out /mongodb/backup/emergency
# 导入数据到新实例
mongorestore –host 192.168.1.101 –port 27017 –username fgedu –password fgedu123 –authenticationDatabase admin /mongodb/backup/emergency
from MongoDB视频:www.itpux.com
4.3 网络故障处理
网络故障处理实战:
# 1. 故障现象:复制集成员无法连接
mongosh –eval “rs.status()”
# 输出:
{
“set”: “fgedu-repl”,
“date”: ISODate(“2026-04-08T10:20:00.000Z”),
“myState”: 1,
“term”: NumberLong(1),
“syncSource”: “”,
“syncSourceHost”: “”,
“heartbeatIntervalMillis”: NumberLong(2000),
“majorityVoteCount”: 2,
“writeMajorityCount”: 2,
“votingMembersCount”: 3,
“writableVotingMembersCount”: 2,
“optimes”: {
“lastAppliedOpTime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“lastDurableOpTime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“lastAppliedWallTime”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:20:00.000Z”)
},
“lastStableRecoveryTimestamp”: Timestamp(1617822000, 1),
“electionCandidateMetrics”: {
“lastElectionReason”: “electionTimeout”,
“lastElectionDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“electionTerm”: NumberLong(1),
“lastCommittedOpTimeAtElection”: { “ts”: Timestamp(1617820800, 1), “t”: NumberLong(-1) },
“lastSeenOpTimeAtElection”: { “ts”: Timestamp(1617820800, 1), “t”: NumberLong(-1) },
“numVotesNeeded”: 2,
“priorityAtElection”: 1,
“electionTimeoutMillis”: NumberLong(10000),
“numCatchUpOps”: NumberLong(0),
“newTermStartDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“wMajorityWriteAvailabilityDate”: ISODate(“2026-04-08T10:00:00.000Z”)
},
“members”: [
{
“_id”: 0,
“name”: “192.168.1.100:27017”,
“health”: 1,
“state”: 1,
“stateStr”: “PRIMARY”,
“uptime”: 1200,
“optime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:20:00.000Z”),
“syncSource”: “”,
“syncSourceHost”: “”,
“infoMessage”: “”,
“electionTime”: Timestamp(1617820800, 1),
“electionDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“configVersion”: 1,
“configTerm”: 1,
“self”: true,
“lastHeartbeatMessage”: “”
},
{
“_id”: 1,
“name”: “192.168.1.101:27017”,
“health”: 0,
“state”: 8,
“stateStr”: “(not reachable/healthy)”,
“uptime”: 0,
“optime”: { “ts”: Timestamp(0, 0), “t”: NumberLong(-1) },
“optimeDurable”: { “ts”: Timestamp(0, 0), “t”: NumberLong(-1) },
“optimeDate”: ISODate(“1970-01-01T00:00:00.000Z”),
“optimeDurableDate”: ISODate(“1970-01-01T00:00:00.000Z”),
“lastAppliedWallTime”: ISODate(“1970-01-01T00:00:00.000Z”),
“lastDurableWallTime”: ISODate(“1970-01-01T00:00:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:19:58.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:19:58.000Z”),
“pingMs”: NumberLong(0),
“lastHeartbeatMessage”: “Connection refused”,
“syncSource”: “”,
“syncSourceHost”: “”,
“infoMessage”: “”,
“configVersion”: -1,
“configTerm”: -1
},
{
“_id”: 2,
“name”: “192.168.1.102:27017”,
“health”: 1,
“state”: 2,
“stateStr”: “SECONDARY”,
“uptime”: 1200,
“optime”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“optimeDurable”: { “ts”: Timestamp(1617822000, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:20:00.000Z”),
“optimeDurableDate”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastAppliedWallTime”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:20:00.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:20:00.000Z”),
“pingMs”: NumberLong(0),
“syncSource”: “192.168.1.100:27017”,
“syncSourceHost”: “192.168.1.100:27017”,
“infoMessage”: “”,
“configVersion”: 1,
“configTerm”: 1
}
],
“ok”: 1,
“$clusterTime”: {
“clusterTime”: Timestamp(1617822000, 1),
“signature”: {
“hash”: BinData(0, “AAAAAAAAAAAAAAAAAAAAAAAAAAA=”),
“keyId”: NumberLong(0)
}
},
“operationTime”: Timestamp(1617822000, 1)
}
# 2. 故障原因:192.168.1.101网络连接失败
# 3. 故障处理:
# 检查网络连接
ping 192.168.1.101
# 输出:
PING 192.168.1.101 (192.168.1.101) 56(84) bytes of data.
From 192.168.1.100 icmp_seq=1 Destination Host Unreachable
# 检查防火墙
firewall-cmd –list-all
# 输出:
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
sources:
services: ssh dhcpv6-client
ports: 27017/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
# 开放MongoDB端口
firewall-cmd –permanent –add-port=27017/tcp
firewall-cmd –reload
# 检查MongoDB服务状态
ssh 192.168.1.101 “systemctl status mongod”
# 输出:
mongod.service – MongoDB Database Server
Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
Active: inactive (dead)
# 启动MongoDB服务
ssh 192.168.1.101 “systemctl start mongod”
# 验证复制集状态
mongosh –eval “rs.status()”
# 输出:
{
“set”: “fgedu-repl”,
“date”: ISODate(“2026-04-08T10:25:00.000Z”),
“myState”: 1,
“term”: NumberLong(1),
“syncSource”: “”,
“syncSourceHost”: “”,
“heartbeatIntervalMillis”: NumberLong(2000),
“majorityVoteCount”: 2,
“writeMajorityCount”: 2,
“votingMembersCount”: 3,
“writableVotingMembersCount”: 3,
“optimes”: {
“lastAppliedOpTime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“lastDurableOpTime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“lastAppliedWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:25:00.000Z”)
},
“lastStableRecoveryTimestamp”: Timestamp(1617822300, 1),
“members”: [
{
“_id”: 0,
“name”: “192.168.1.100:27017”,
“health”: 1,
“state”: 1,
“stateStr”: “PRIMARY”,
“uptime”: 1500,
“optime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“syncSource”: “”,
“syncSourceHost”: “”,
“infoMessage”: “”,
“electionTime”: Timestamp(1617820800, 1),
“electionDate”: ISODate(“2026-04-08T10:00:00.000Z”),
“configVersion”: 1,
“configTerm”: 1,
“self”: true,
“lastHeartbeatMessage”: “”
},
{
“_id”: 1,
“name”: “192.168.1.101:27017”,
“health”: 1,
“state”: 2,
“stateStr”: “SECONDARY”,
“uptime”: 300,
“optime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDurable”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“optimeDurableDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastAppliedWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:25:00.000Z”),
“pingMs”: NumberLong(0),
“syncSource”: “192.168.1.100:27017”,
“syncSourceHost”: “192.168.1.100:27017”,
“infoMessage”: “”,
“configVersion”: 1,
“configTerm”: 1
},
{
“_id”: 2,
“name”: “192.168.1.102:27017”,
“health”: 1,
“state”: 2,
“stateStr”: “SECONDARY”,
“uptime”: 1500,
“optime”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDurable”: { “ts”: Timestamp(1617822300, 1), “t”: NumberLong(1) },
“optimeDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“optimeDurableDate”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastAppliedWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastDurableWallTime”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeat”: ISODate(“2026-04-08T10:25:00.000Z”),
“lastHeartbeatRecv”: ISODate(“2026-04-08T10:25:00.000Z”),
“pingMs”: NumberLong(0),
“syncSource”: “192.168.1.100:27017”,
“syncSourceHost”: “192.168.1.100:27017”,
“infoMessage”: “”,
“configVersion”: 1,
“configTerm”: 1
}
],
“ok”: 1,
“$clusterTime”: {
“clusterTime”: Timestamp(1617822300, 1),
“signature”: {
“hash”: BinData(0, “AAAAAAAAAAAAAAAAAAAAAAAAAAA=”),
“keyId”: NumberLong(0)
}
},
“operationTime”: Timestamp(1617822300, 1)
}
风哥提示:网络故障处理的关键是检查网络连接、防火墙配置和服务状态。
Part05-风哥经验总结与分享
5.1 故障处理最佳实践
风哥建议的故障处理最佳实践:
- 定期备份:定期备份数据,确保数据安全
- 监控告警:设置监控系统,及时发现异常
- 高可用架构:部署复制集或分片集群,提高系统可用性
- 故障演练:定期进行故障演练,提高应对故障的能力
- 文档记录:记录故障处理过程和解决方案
- 快速响应:快速响应故障,减少业务中断时间
- 根本原因分析:分析故障根本原因,制定预防措施
- 持续改进:持续改进故障处理流程和预防措施
学习交流加群风哥QQ113257174
5.2 常见故障与解决方案
常见故障与解决方案:
- 问题:MongoDB服务无法启动
- 解决方案:检查日志,修复数据文件,检查配置文件
- 问题:数据文件损坏
- 解决方案:从备份恢复,使用mongodump/mongorestore
- 问题:复制集成员无法同步
- 解决方案:检查网络连接,检查防火墙配置,重新初始化复制集
- 问题:查询速度慢
- 解决方案:创建合适的索引,优化查询语句,增加硬件资源
- 问题:内存使用高
- 解决方案:调整WiredTiger缓存大小,清理无用数据
更多视频教程www.fgedu.net.cn
注意事项
- 定期备份数据,确保数据安全
- 设置监控系统,及时发现异常
- 部署高可用架构,提高系统可用性
- 定期进行故障演练,提高应对故障的能力
- 记录故障处理过程和解决方案
- 快速响应故障,减少业务中断时间
- 分析故障根本原因,制定预防措施
- 持续改进故障处理流程和预防措施
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
