风哥教程参考DB2官方文档High Availability、Disaster Recovery等内容,详细介绍核心系统稳定性保障、故障预防、应急响应等。更多视频教程www.fgedu.net.cn
目录大纲
Part01-稳定性保障概述
1.1 稳定性目标
核心系统稳定性目标:
- 可用性:99.99%以上
- 性能:响应时间<100ms
- 数据完整性:零数据丢失
- 故障恢复:RTO<1小时
1.2 保障体系
- 架构设计:高可用、容灾
- 监控预警:实时监控、自动告警
- 故障预防:定期巡检、压力测试
- 应急响应:预案、演练、快速恢复
Part02-故障预防
2.1 定期巡检
#!/bin/bash
# daily_check.sh
LOG_FILE=/db2/logs/daily_check_$(date +%Y%m%d).log
echo “=== DB2 Daily Check ===” >> $LOG_FILE
echo “Date: $(date)” >> $LOG_FILE
# 1. 检查实例状态
db2pd -inst >> $LOG_FILE
# 2. 检查数据库状态
db2pd -db FGEDB -inst >> $LOG_FILE
# 3. 检查表空间使用率
db2 connect to FGEDB
db2 “SELECT tbsp_name, round(tbsp_used_pages*100.0/tbsp_total_pages,2) as usage
FROM sysibmadm.tbsp_utilization
WHERE tbsp_used_pages*100.0/tbsp_total_pages > 80” >> $LOG_FILE
# 4. 检查缓冲池命中率
db2 “SELECT bpname, data_hit_ratio_percent
FROM sysibmadm.bp_hitratio
WHERE data_hit_ratio_percent < 95" >> $LOG_FILE
# 5. 检查锁等待
db2 “SELECT COUNT(*) FROM sysibmadm.lockwaits” >> $LOG_FILE
# 6. 检查HADR状态
db2pd -db FGEDB -hadr >> $LOG_FILE
db2 connect reset
echo “=== Check Complete ===” >> $LOG_FILE
# 发送巡检报告
mail -s “DB2 Daily Check Report” admin@example.com < $LOG_FILE
2.2 压力测试
#!/bin/bash
# stress_test.sh
DBNAME=FGEDB
THREADS=100
DURATION=3600
echo “Starting stress test…”
echo “Threads: $THREADS”
echo “Duration: $DURATION seconds”
# 创建测试表
db2 connect to $DBNAME
db2 “CREATE TABLE STRESS_TEST (ID BIGINT NOT NULL, DATA VARCHAR(1000))”
db2 connect reset
# 运行压力测试
for i in $(seq 1 $THREADS); do
(
db2 connect to $DBNAME
START_TIME=$(date +%s)
while [ $(($(date +%s) – START_TIME)) -lt $DURATION ]; do
db2 “INSERT INTO STRESS_TEST VALUES (RAND()*1000000, ‘TEST DATA’)”
db2 “SELECT * FROM STRESS_TEST WHERE ID = RAND()*1000000 FETCH FIRST 1 ROW ONLY”
db2 “UPDATE STRESS_TEST SET DATA = ‘UPDATED’ WHERE ID = RAND()*1000000”
done
db2 connect reset
) &
done
wait
# 清理测试数据
db2 connect to $DBNAME
db2 “DROP TABLE STRESS_TEST”
db2 connect reset
echo “Stress test completed”
Part03-监控预警
3.1 实时监控
#!/bin/bash
# monitor_db2.sh
DBNAME=FGEDB
ALERT_THRESHOLD_CPU=80
ALERT_THRESHOLD_MEM=90
ALERT_THRESHOLD_DISK=85
while true; do
# 监控CPU使用率
CPU_USAGE=$(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’ | cut -d% -f1)
if [ $(echo “$CPU_USAGE > $ALERT_THRESHOLD_CPU” | bc) -eq 1 ]; then
echo “CPU usage alert: $CPU_USAGE%” | mail -s “DB2 CPU Alert” admin@example.com
fi
# 监控内存使用率
MEM_USAGE=$(free | grep Mem | awk ‘{print ($3/$2) * 100.0}’)
if [ $(echo “$MEM_USAGE > $ALERT_THRESHOLD_MEM” | bc) -eq 1 ]; then
echo “Memory usage alert: $MEM_USAGE%” | mail -s “DB2 Memory Alert” admin@example.com
fi
# 监控磁盘使用率
DISK_USAGE=$(df -h /db2data | tail -1 | awk ‘{print $5}’ | cut -d% -f1)
if [ $DISK_USAGE -gt $ALERT_THRESHOLD_DISK ]; then
echo “Disk usage alert: $DISK_USAGE%” | mail -s “DB2 Disk Alert” admin@example.com
fi
# 监控数据库连接数
db2 connect to $DBNAME
CONN_COUNT=$(db2 “SELECT COUNT(*) FROM sysibmadm.applications” | tail -1 | awk ‘{print $1}’)
if [ $CONN_COUNT -gt 500 ]; then
echo “Connection count alert: $CONN_COUNT” | mail -s “DB2 Connection Alert” admin@example.com
fi
db2 connect reset
# 监控表空间使用率
db2 connect to $DBNAME
db2 “SELECT tbsp_name, round(tbsp_used_pages*100.0/tbsp_total_pages,2)
FROM sysibmadm.tbsp_utilization
WHERE tbsp_used_pages*100.0/tbsp_total_pages > 85” > /tmp/tablespace_alert.txt
if [ -s /tmp/tablespace_alert.txt ]; then
mail -s “DB2 Tablespace Alert” admin@example.com < /tmp/tablespace_alert.txt
fi
db2 connect reset
sleep 60
done
3.2 自动告警
CREATE OR REPLACE PROCEDURE SP_CHECK_AND_ALERT()
LANGUAGE SQL
BEGIN
DECLARE v_cpu_usage DECIMAL(5, 2);
DECLARE v_mem_usage DECIMAL(5, 2);
DECLARE v_disk_usage DECIMAL(5, 2);
DECLARE v_conn_count INTEGER;
DECLARE v_tablespace_usage DECIMAL(5, 2);
— 检查CPU使用率
SELECT CPU_USAGE INTO v_cpu_usage
FROM SYSIBMADM.SNAPDBM
WHERE CPU_USAGE > 80;
IF v_cpu_usage > 80 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘CPU’, ‘CPU usage exceeds 80%’);
END IF;
— 检查内存使用率
SELECT MEMORY_USAGE INTO v_mem_usage
FROM SYSIBMADM.SNAPDBM
WHERE MEMORY_USAGE > 90;
IF v_mem_usage > 90 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘MEMORY’, ‘Memory usage exceeds 90%’);
END IF;
— 检查表空间使用率
SELECT MAX(TBSP_USED_PAGES * 100.0 / TBSP_TOTAL_PAGES)
INTO v_tablespace_usage
FROM SYSIBMADM.TBSP_UTILIZATION;
IF v_tablespace_usage > 85 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘TABLESPACE’, ‘Tablespace usage exceeds 85%’);
END IF;
— 检查连接数
SELECT COUNT(*) INTO v_conn_count
FROM SYSIBMADM.APPLICATIONS;
IF v_conn_count > 500 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘CONNECTION’, ‘Connection count exceeds 500’);
END IF;
END;
# 定时执行告警检查
# crontab -e
# */5 * * * * /db2/scripts/check_and_alert.sh
Part04-应急响应
4.1 应急预案
# DB2核心系统应急响应预案
## 1. 实例故障
### 症状
– 实例无法启动
– 数据库无法连接
### 处理步骤
1. 检查实例状态
db2pd -inst
2. 查看实例日志
tail -f /db2home/db2inst1/sqllib/db2dump/db2diag.log
3. 重启实例
db2stop
db2start
4. 检查数据库状态
db2 connect to FGEDB
## 2. 表空间满
### 症状
– 表空间使用率超过95%
– 插入操作失败
### 处理步骤
1. 检查表空间使用
db2 “SELECT * FROM sysibmadm.tbsp_utilization”
2. 扩展表空间
ALTER TABLESPACE USERSPACE1 EXTEND (FILE ‘/db2data/userspace1.dms’ 10G)
3. 清理历史数据
DELETE FROM LOGS WHERE LOG_TIME < CURRENT DATE - 30 DAYS
4. 重组表回收空间
REORG TABLE LARGE_TABLE
## 3. 性能故障
### 症状
- 响应时间慢
- CPU使用率高
### 处理步骤
1. 查看慢查询
db2 "SELECT * FROM sysibmadm.top_dynamic_sql ORDER BY total_exec_time DESC"
2. 查看锁等待
db2 "SELECT * FROM sysibmadm.lockwaits"
3. 终止慢查询
db2 "FORCE APPLICATION (agent_id)"
4. 优化SQL
-- 添加索引
-- 更新统计信息
4.2 故障演练
#!/bin/bash
# disaster_drill.sh
echo “=== DB2 Disaster Drill ===”
echo “Date: $(date)”
# 1. 模拟实例故障
echo “1. Simulating instance failure…”
db2stop force
sleep 10
db2start
echo “Instance recovered”
# 2. 模拟表空间满
echo “2. Simulating tablespace full…”
db2 connect to FGEDB
db2 “CREATE TABLESPACE TEST_FULL PAGESIZE 4K MANAGED BY DATABASE USING (FILE ‘/db2data/test_full.dms’ 10M)”
db2 “CREATE TABLE TEST_TABLE (ID INT) IN TEST_FULL”
for i in {1..10000}; do
db2 “INSERT INTO TEST_TABLE VALUES ($i)” 2>/dev/null
done
db2 “DROP TABLE TEST_TABLE”
db2 “DROP TABLESPACE TEST_FULL”
db2 connect reset
echo “Tablespace test completed”
# 3. 模拟HADR切换
echo “3. Simulating HADR failover…”
db2pd -db FGEDB -hadr
echo “HADR status checked”
# 4. 模拟备份恢复
echo “4. Simulating backup and restore…”
BACKUP_FILE=$(ls -t /db2backup/*.0 | head -1)
echo “Latest backup: $BACKUP_FILE”
echo “Backup and restore test completed”
echo “=== Disaster Drill Complete ===”
Part05-风哥经验总结与分享
5.1 稳定性保障要点
- 建立完善的监控体系
- 定期进行巡检
- 定期压力测试
- 建立应急预案
- 定期故障演练
- 持续优化改进
5.2 保障建议
| 保障项 | 频率 | 责任人 |
|---|---|---|
| 日常巡检 | 每天 | 运维人员 |
| 性能监控 | 实时 | 监控系统 |
| 压力测试 | 每月 | 测试团队 |
| 故障演练 | 每季度 | 运维团队 |
5.3 运维要点
- 建立监控告警机制
- 定期执行巡检
- 定期压力测试
- 建立应急预案
- 定期故障演练
- 持续优化改进
学习交流加群风哥微信: itpux-com
风哥Oracle/MySQL/PostgreSQL/Greenplum/DB2/Redis等数据库培训课程,10年一线实战经验,企业级培训,真正掌握数据库核心技术!
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
