1. 首页 > IT综合教程 > 正文

IT教程FG203-NBU备份系统故障排查与恢复

1. 故障排查基础

在进行NBU备份系统故障排查时,需要掌握基本的排查方法和工具,以便快速定位和解决问题。更多学习教程www.fgedu.net.cn

1.1 日志文件位置

NBU的日志文件是故障排查的重要依据,主要日志文件位置如下:

# NBU主日志目录
# ls -la /usr/openv/netbackup/logs/

# 查看详细日志目录
# find /usr/openv/netbackup/logs -type d | sort

# 查看最近的作业日志
# ls -la /usr/openv/netbackup/logs/user_ops/root/logs/

# 查看bpdbm日志(数据库管理器)
# ls -la /usr/openv/netbackup/logs/bpdbm/

# 查看bpbrm日志(备份恢复管理器)
# ls -la /usr/openv/netbackup/logs/bpbrm/

1.2 常用排查命令

以下是NBU故障排查中常用的命令:

# 查看NBU服务状态
# /usr/openv/netbackup/bin/bpclntcmd -pn
client_name = client1, client_ip = 192.168.1.10
client_name = client1.fgedu.net, client_ip = 192.168.1.10

# 测试与主服务器的连接
# /usr/openv/netbackup/bin/bpclntcmd -hosts master_server
master_server 192.168.1.1

# 查看NBU版本
# /usr/openv/netbackup/bin/nbuversion
NetBackup 10.1.1
Build info: 10.1.1.0_202309151200

# 查看作业状态
# /usr/openv/netbackup/bin/bpdbjobs -report
Job ID Type Policy Schedule Client State Status Start Time End Time Duration
——————————————————————————————————————–
12345 BACKUP Oracle Full oracle1 EXIT STATUS 0 2023-03-30 22:00:00 2023-03-30 22:30:00 0:30:00
12346 BACKUP SQL Full sql1 EXIT STATUS 0 2023-03-30 23:00:00 2023-03-30 23:45:00 0:45:00
12347 BACKUP File Full client1 EXIT STATUS 196 2023-03-31 00:00:00 2023-03-31 00:05:00 0:05:00

2. 常见故障类型及排查方法

NBU备份系统常见的故障类型包括备份失败、恢复失败、服务无法启动等,以下是具体的排查方法。

2.1 备份失败

备份失败是最常见的故障类型,可能由多种原因引起。

# 查看失败的备份作业详情
# /usr/openv/netbackup/bin/bpdbjobs -jobid 12347 -details
Job ID: 12347
Job Type: BACKUP
Policy: File
Schedule: Full
Client: client1
State: EXIT STATUS 196
Start time: 2023-03-31 00:00:00
End time: 2023-03-31 00:05:00
Status: client backup failed 196: client connection rejected

# 检查客户端与主服务器的连接
# /usr/openv/netbackup/bin/bpclntcmd -pn -client client1
client_name = client1, client_ip = 192.168.1.10
client_name = client1.fgedu.net, client_ip = 192.168.1.10

# 检查客户端服务状态
# ssh client1 “ps aux | grep bpbkar”
bpbkar32 1234 0.0 0.1 12345 6789 ? S 00:00 0:00 /usr/openv/netbackup/bin/bpbkar32

# 检查防火墙设置
# iptables -L -n | grep 13782
ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:13782
ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:13783

2.2 恢复失败

恢复失败可能由备份数据损坏、权限问题等原因引起。

# 查看恢复作业详情
# /usr/openv/netbackup/bin/bpdbjobs -jobid 12348 -details
Job ID: 12348
Job Type: RESTORE
Policy: File
Client: client1
State: EXIT STATUS 2840
Start time: 2023-03-31 10:00:00
End time: 2023-03-31 10:05:00
Status: restore failed 2840: no images were found for the specified client, policy, and schedule

# 检查备份映像
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -l

# 检查目录权限
# ls -la /restore/path/
drwxr-xr-x 2 root root 4096 Mar 31 10:00 .
drwxr-xr-x 3 root root 4096 Mar 31 09:00 ..

# 检查恢复日志
# cat /usr/openv/netbackup/logs/user_ops/root/logs/12348.log

2.3 服务无法启动

NBU服务无法启动可能由配置错误、端口冲突等原因引起。

# 检查NBU服务状态
# /usr/openv/netbackup/bin/bp.kill_all
# /usr/openv/netbackup/bin/bp.start_all
Starting NetBackup services:
NetBackup master server daemon (nbmaster) failed to start
NetBackup database manager (bpdbm) failed to start
NetBackup request daemon (bprd) failed to start

# 检查nbmaster日志
# tail -50 /usr/openv/netbackup/logs/nbmaster/log.1

# 检查端口占用
# netstat -tulpn | grep 13782

# 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is not responding.

# 启动数据库
# /usr/openv/db/bin/nbdb_start
Starting NetBackup database server… done.

# 再次启动服务
# /usr/openv/netbackup/bin/bp.start_all
Starting NetBackup services:
NetBackup master server daemon (nbmaster) started
NetBackup database manager (bpdbm) started
NetBackup request daemon (bprd) started

3. 故障恢复步骤

当NBU备份系统出现故障时,需要按照以下步骤进行恢复:

3.1 服务故障恢复

# 1. 停止所有NBU服务
# /usr/openv/netbackup/bin/bp.kill_all

# 2. 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is not responding.

# 3. 启动数据库
# /usr/openv/db/bin/nbdb_start
Starting NetBackup database server… done.

# 4. 验证数据库连接
# /usr/openv/db/bin/nbdb_admin -info
Database server is alive and well.

# 5. 启动NBU服务
# /usr/openv/netbackup/bin/bp.start_all
Starting NetBackup services:
NetBackup master server daemon (nbmaster) started
NetBackup database manager (bpdbm) started
NetBackup request daemon (bprd) started
NetBackup media manager (nbmm) started
NetBackup storage lifecycle manager (nbstl) started

# 6. 验证服务状态
# /usr/openv/netbackup/bin/bpclntcmd -pn
client_name = master_server, client_ip = 192.168.1.1
client_name = master_server.fgedu.net, client_ip = 192.168.1.1

3.2 备份数据恢复

# 1. 列出可用的备份映像
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -s 03/25/2023 -e 03/31/2023 -l

# 2. 执行恢复操作
# /usr/openv/netbackup/bin/bprestore -C client1 -t 0 -R /restore/path/:/original/path/ -s 03/29/2023 -e 03/30/2023 /original/path/

# 3. 监控恢复作业
# /usr/openv/netbackup/bin/bpdbjobs -jobid 12349 -progress
Job ID: 12349
Job Type: RESTORE
State: ACTIVE
Progress: 45% completed

# 4. 验证恢复结果
# ls -la /restore/path/
# diff -r /restore/path/ /original/path/

3.3 数据库损坏恢复

# 1. 停止NBU服务
# /usr/openv/netbackup/bin/bp.kill_all

# 2. 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is not responding.

# 3. 尝试修复数据库
# /usr/openv/db/bin/nbdb_admin -repair
Repairing NetBackup database… done.

# 4. 验证数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is alive and well.

# 5. 启动NBU服务
# /usr/openv/netbackup/bin/bp.start_all

4. 故障预防措施

为了减少NBU备份系统故障的发生,需要采取以下预防措施:

4.1 定期维护

# 定期清理日志文件
# find /usr/openv/netbackup/logs -name “*.log*” -mtime +30 -delete

# 定期检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info
Database server is alive and well.

# 定期检查存储设备
# /usr/openv/volmgr/bin/vmoprcmd -d list

# 定期检查备份策略
# /usr/openv/netbackup/bin/bppllist -U

4.2 监控设置

# 配置邮件通知
# /usr/openv/netbackup/bin/admincmd/nbsetconfig
Enter the following:
NOTIFICATION_EMAIL = admin@fgedu.net
NOTIFICATION_LEVEL = 3

# 配置SNMP陷阱
# /usr/openv/netbackup/bin/admincmd/nbsetconfig
Enter the following:
SNMP_TRAP_DESTINATION = 192.168.1.100
SNMP_TRAP_LEVEL = 2

# 配置系统监控
# vi /etc/crontab
0 0 * * * /usr/openv/netbackup/bin/admincmd/bpdbjobs -summary | mail -s “NBU Backup Summary” admin@fgedu.net

4.3 备份验证

# 执行测试恢复
# /usr/openv/netbackup/bin/bprestore -C test_client -t 0 -R /test/restore/:/original/path/ -s 03/29/2023 -e 03/30/2023 /original/path/

# 验证备份完整性
# /usr/openv/netbackup/bin/admincmd/bpverify -backupid client1_1234567890

# 检查备份映像
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -l | wc -l
10

5. 故障案例分析

以下是几个NBU备份系统故障的实际案例分析:

5.1 案例一:备份作业失败,错误码196

故障现象:客户端备份作业失败,错误码196:client connection rejected

排查步骤:

# 1. 检查客户端服务状态
# ssh client1 “ps aux | grep bpbkar”

# 2. 检查客户端与主服务器的通信
# /usr/openv/netbackup/bin/bpclntcmd -pn -client client1

# 3. 检查防火墙设置
# iptables -L -n | grep 13782

# 4. 检查客户端配置
# cat /usr/openv/netbackup/bp.conf

解决方案:发现客户端防火墙未开放13782端口,添加防火墙规则后问题解决。

5.2 案例二:恢复作业失败,错误码2840

故障现象:恢复作业失败,错误码2840:no images were found for the specified client, policy, and schedule

排查步骤:

# 1. 检查备份映像
# /usr/openv/netbackup/bin/bpimagelist -client client1 -policy File -l

# 2. 检查备份策略配置
# /usr/openv/netbackup/bin/bppllist File -U

# 3. 检查目录权限
# ls -la /restore/path/

解决方案:发现备份策略配置错误,修正策略后重新执行备份,恢复成功。

5.3 案例三:NBU服务无法启动

故障现象:NBU服务无法启动,报错”NetBackup database manager (bpdbm) failed to start”

排查步骤:

# 1. 检查数据库状态
# /usr/openv/db/bin/nbdb_admin -info

# 2. 检查数据库日志
# tail -50 /usr/openv/db/log/server.log

# 3. 尝试修复数据库
# /usr/openv/db/bin/nbdb_admin -repair

解决方案:数据库损坏,执行修复后服务启动成功。

生产环境风哥建议:定期进行故障演练,制定详细的故障恢复计划,确保在发生故障时能够快速响应和恢复。

风哥风哥提示:故障排查时要注意收集完整的日志信息,包括作业日志、服务日志和系统日志,以便准确分析问题原因。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息