1. 故障排查基础
在进行NBU备份系统故障排查时,需要掌握基本的排查方法和工具,以便快速定位和解决问题。更多学习教程www.fgedu.net.cn
1.1 日志文件位置
NBU的日志文件是故障排查的重要依据,主要日志文件位置如下:
# ls -la /usr/openv/netbackup/logs/
# 查看详细日志目录
# find /usr/openv/netbackup/logs -type d | sort
# 查看最近的作业日志
# ls -la /usr/openv/netbackup/logs/user_ops/root/logs/
# 查看bpdbm日志(数据库管理器)
# ls -la /usr/openv/netbackup/logs/bpdbm/
# 查看bpbrm日志(备份恢复管理器)
# ls -la /usr/openv/netbackup/logs/bpbrm/
1.2 常用排查命令
以下是NBU故障排查中常用的命令:
# /usr/openv/netbackup/bin/bpclntcmd -pn
client_name = client1, client_ip = 192.168.1.10
client_name = client1.fgedu.net, client_ip = 192.168.1.10
# 测试与主服务器的连接
# /usr/openv/netbackup/bin/bpclntcmd -hosts master_server
master_server 192.168.1.1
# 查看NBU版本
# /usr/openv/netbackup/bin/nbuversion
NetBackup 10.1.1
Build info: 10.1.1.0_202309151200
# 查看作业状态
# /usr/openv/netbackup/bin/bpdbjobs -report
Job ID Type Policy Schedule Client State Status Start Time End Time Duration
——————————————————————————————————————–
12345 BACKUP Oracle Full oracle1 EXIT STATUS 0 2023-03-30 22:00:00 2023-03-30 22:30:00 0:30:00
12346 BACKUP SQL Full sql1 EXIT STATUS 0 2023-03-30 23:00:00 2023-03-30 23:45:00 0:45:00
12347 BACKUP File Full client1 EXIT STATUS 196 2023-03-31 00:00:00 2023-03-31 00:05:00 0:05:00
2. 常见故障类型及排查方法
NBU备份系统常见的故障类型包括备份失败、恢复失败、服务无法启动等,以下是具体的排查方法。
2.1 备份失败
备份失败是最常见的故障类型,可能由多种原因引起。
# /usr/openv/netbackup/bin/bpdbjobs -jobid 12347 -details
Job ID: 12347
Job Type: BACKUP
Policy: File
Schedule: Full
Client: client1
State: EXIT STATUS 196
Start time: 2023-03-31 00:00:00
End time: 2023-03-31 00:05:00
Status: client backup failed 196: client connection rejected
# 检查客户端与主服务器的连接
# /usr/openv/netbackup/bin/bpclntcmd -pn -client client1
client_name = client1, client_ip = 192.168.1.10
client_name = client1.fgedu.net, client_ip = 192.168.1.10
# 检查客户端服务状态
# ssh client1 “ps aux | grep bpbkar”
bpbkar32 1234 0.0 0.1 12345 6789 ? S 00:00 0:00 /usr/openv/netbackup/bin/bpbkar32
# 检查防火墙设置
# iptables -L -n | grep 13782
ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:13782
ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:13783
2.2 恢复失败
恢复失败可能由备份数据损坏、权限问题等原因引起。
# /usr/openv/netbackup/bin/bpdbjobs -jobid 12348 -details
Job ID: 12348
Job Type: RESTORE
Policy: File
Client: client1
State: EXIT STATUS 2840
Start time: 2023-03-31 10:00:00
End time: 2023-03-31 10:05:00
Status: restore failed 2840: no images were found for the specified client, policy, and schedule
# 检查备份映像
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -l
# 检查目录权限
# ls -la /restore/path/
drwxr-xr-x 2 root root 4096 Mar 31 10:00 .
drwxr-xr-x 3 root root 4096 Mar 31 09:00 ..
# 检查恢复日志
# cat /usr/openv/netbackup/logs/user_ops/root/logs/12348.log
2.3 服务无法启动
NBU服务无法启动可能由配置错误、端口冲突等原因引起。
# /usr/openv/netbackup/bin/bp.kill_all
# /usr/openv/netbackup/bin/bp.start_all
Starting NetBackup services:
NetBackup master server daemon (nbmaster) failed to start
NetBackup database manager (bpdbm) failed to start
NetBackup request daemon (bprd) failed to start
# 检查nbmaster日志
# tail -50 /usr/openv/netbackup/logs/nbmaster/log.1
# 检查端口占用
# netstat -tulpn | grep 13782
# 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is not responding.
# 启动数据库
# /usr/openv/db/bin/nbdb_start
Starting NetBackup database server… done.
# 再次启动服务
# /usr/openv/netbackup/bin/bp.start_all
Starting NetBackup services:
NetBackup master server daemon (nbmaster) started
NetBackup database manager (bpdbm) started
NetBackup request daemon (bprd) started
3. 故障恢复步骤
当NBU备份系统出现故障时,需要按照以下步骤进行恢复:
3.1 服务故障恢复
# /usr/openv/netbackup/bin/bp.kill_all
# 2. 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is not responding.
# 3. 启动数据库
# /usr/openv/db/bin/nbdb_start
Starting NetBackup database server… done.
# 4. 验证数据库连接
# /usr/openv/db/bin/nbdb_admin -info
Database server is alive and well.
# 5. 启动NBU服务
# /usr/openv/netbackup/bin/bp.start_all
Starting NetBackup services:
NetBackup master server daemon (nbmaster) started
NetBackup database manager (bpdbm) started
NetBackup request daemon (bprd) started
NetBackup media manager (nbmm) started
NetBackup storage lifecycle manager (nbstl) started
# 6. 验证服务状态
# /usr/openv/netbackup/bin/bpclntcmd -pn
client_name = master_server, client_ip = 192.168.1.1
client_name = master_server.fgedu.net, client_ip = 192.168.1.1
3.2 备份数据恢复
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -s 03/25/2023 -e 03/31/2023 -l
# 2. 执行恢复操作
# /usr/openv/netbackup/bin/bprestore -C client1 -t 0 -R /restore/path/:/original/path/ -s 03/29/2023 -e 03/30/2023 /original/path/
# 3. 监控恢复作业
# /usr/openv/netbackup/bin/bpdbjobs -jobid 12349 -progress
Job ID: 12349
Job Type: RESTORE
State: ACTIVE
Progress: 45% completed
# 4. 验证恢复结果
# ls -la /restore/path/
# diff -r /restore/path/ /original/path/
3.3 数据库损坏恢复
# /usr/openv/netbackup/bin/bp.kill_all
# 2. 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is not responding.
# 3. 尝试修复数据库
# /usr/openv/db/bin/nbdb_admin -repair
Repairing NetBackup database… done.
# 4. 验证数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is alive and well.
# 5. 启动NBU服务
# /usr/openv/netbackup/bin/bp.start_all
4. 故障预防措施
为了减少NBU备份系统故障的发生,需要采取以下预防措施:
4.1 定期维护
# find /usr/openv/netbackup/logs -name “*.log*” -mtime +30 -delete
# 定期检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is alive and well.
# 定期检查存储设备
# /usr/openv/volmgr/bin/vmoprcmd -d list
# 定期检查备份策略
# /usr/openv/netbackup/bin/bppllist -U
4.2 监控设置
# /usr/openv/netbackup/bin/admincmd/nbsetconfig
Enter the following:
NOTIFICATION_EMAIL = admin@fgedu.net
NOTIFICATION_LEVEL = 3
# 配置SNMP陷阱
# /usr/openv/netbackup/bin/admincmd/nbsetconfig
Enter the following:
SNMP_TRAP_DESTINATION = 192.168.1.100
SNMP_TRAP_LEVEL = 2
# 配置系统监控
# vi /etc/crontab
0 0 * * * /usr/openv/netbackup/bin/admincmd/bpdbjobs -summary | mail -s “NBU Backup Summary” admin@fgedu.net
4.3 备份验证
# /usr/openv/netbackup/bin/bprestore -C test_client -t 0 -R /test/restore/:/original/path/ -s 03/29/2023 -e 03/30/2023 /original/path/
# 验证备份完整性
# /usr/openv/netbackup/bin/admincmd/bpverify -backupid client1_1234567890
# 检查备份映像
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -l | wc -l
10
5. 故障案例分析
以下是几个NBU备份系统故障的实际案例分析:
5.1 案例一:备份作业失败,错误码196
故障现象:客户端备份作业失败,错误码196:client connection rejected
排查步骤:
# ssh client1 “ps aux | grep bpbkar”
# 2. 检查客户端与主服务器的通信
# /usr/openv/netbackup/bin/bpclntcmd -pn -client client1
# 3. 检查防火墙设置
# iptables -L -n | grep 13782
# 4. 检查客户端配置
# cat /usr/openv/netbackup/bp.conf
解决方案:发现客户端防火墙未开放13782端口,添加防火墙规则后问题解决。
5.2 案例二:恢复作业失败,错误码2840
故障现象:恢复作业失败,错误码2840:no images were found for the specified client, policy, and schedule
排查步骤:
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -l
# 2. 检查备份策略配置
# /usr/openv/netbackup/bin/bppllist File -U
# 3. 检查目录权限
# ls -la /restore/path/
解决方案:发现备份策略配置错误,修正策略后重新执行备份,恢复成功。
5.3 案例三:NBU服务无法启动
故障现象:NBU服务无法启动,报错”NetBackup database manager (bpdbm) failed to start”
排查步骤:
# /usr/openv/db/bin/nbdb_admin -info
# 2. 检查数据库日志
# tail -50 /usr/openv/db/log/server.log
# 3. 尝试修复数据库
# /usr/openv/db/bin/nbdb_admin -repair
解决方案:数据库损坏,执行修复后服务启动成功。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
