1. 首页 > DB2教程 > 正文

DB2教程FG089-DB2核心系统稳定性保障实战

风哥教程参考DB2官方文档High Availability、Disaster Recovery等内容,详细介绍核心系统稳定性保障、故障预防、应急响应等。更多视频教程www.fgedu.net.cn

目录大纲

Part01-稳定性保障概述

1.1 稳定性目标

核心系统稳定性目标:

  • 可用性:99.99%以上
  • 性能:响应时间<100ms
  • 数据完整性:零数据丢失
  • 故障恢复:RTO<1小时

1.2 保障体系

  • 架构设计:高可用、容灾
  • 监控预警:实时监控、自动告警
  • 故障预防:定期巡检、压力测试
  • 应急响应:预案、演练、快速恢复

Part02-故障预防

2.1 定期巡检

# 每日巡检脚本
#!/bin/bash
# daily_check.sh

LOG_FILE=/db2/logs/daily_check_$(date +%Y%m%d).log

echo “=== DB2 Daily Check ===” >> $LOG_FILE
echo “Date: $(date)” >> $LOG_FILE

# 1. 检查实例状态
db2pd -inst >> $LOG_FILE

# 2. 检查数据库状态
db2pd -db FGEDB -inst >> $LOG_FILE

# 3. 检查表空间使用率
db2 connect to FGEDB
db2 “SELECT tbsp_name, round(tbsp_used_pages*100.0/tbsp_total_pages,2) as usage
FROM sysibmadm.tbsp_utilization
WHERE tbsp_used_pages*100.0/tbsp_total_pages > 80” >> $LOG_FILE

# 4. 检查缓冲池命中率
db2 “SELECT bpname, data_hit_ratio_percent
FROM sysibmadm.bp_hitratio
WHERE data_hit_ratio_percent < 95" >> $LOG_FILE

# 5. 检查锁等待
db2 “SELECT COUNT(*) FROM sysibmadm.lockwaits” >> $LOG_FILE

# 6. 检查HADR状态
db2pd -db FGEDB -hadr >> $LOG_FILE

db2 connect reset

echo “=== Check Complete ===” >> $LOG_FILE

# 发送巡检报告
mail -s “DB2 Daily Check Report” admin@example.com < $LOG_FILE

2.2 压力测试

# 压力测试脚本
#!/bin/bash
# stress_test.sh

DBNAME=FGEDB
THREADS=100
DURATION=3600

echo “Starting stress test…”
echo “Threads: $THREADS”
echo “Duration: $DURATION seconds”

# 创建测试表
db2 connect to $DBNAME
db2 “CREATE TABLE STRESS_TEST (ID BIGINT NOT NULL, DATA VARCHAR(1000))”
db2 connect reset

# 运行压力测试
for i in $(seq 1 $THREADS); do
(
db2 connect to $DBNAME
START_TIME=$(date +%s)
while [ $(($(date +%s) – START_TIME)) -lt $DURATION ]; do
db2 “INSERT INTO STRESS_TEST VALUES (RAND()*1000000, ‘TEST DATA’)”
db2 “SELECT * FROM STRESS_TEST WHERE ID = RAND()*1000000 FETCH FIRST 1 ROW ONLY”
db2 “UPDATE STRESS_TEST SET DATA = ‘UPDATED’ WHERE ID = RAND()*1000000”
done
db2 connect reset
) &
done

wait

# 清理测试数据
db2 connect to $DBNAME
db2 “DROP TABLE STRESS_TEST”
db2 connect reset

echo “Stress test completed”

Part03-监控预警

3.1 实时监控

# 监控脚本
#!/bin/bash
# monitor_db2.sh

DBNAME=FGEDB
ALERT_THRESHOLD_CPU=80
ALERT_THRESHOLD_MEM=90
ALERT_THRESHOLD_DISK=85

while true; do
# 监控CPU使用率
CPU_USAGE=$(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’ | cut -d% -f1)
if [ $(echo “$CPU_USAGE > $ALERT_THRESHOLD_CPU” | bc) -eq 1 ]; then
echo “CPU usage alert: $CPU_USAGE%” | mail -s “DB2 CPU Alert” admin@example.com
fi

# 监控内存使用率
MEM_USAGE=$(free | grep Mem | awk ‘{print ($3/$2) * 100.0}’)
if [ $(echo “$MEM_USAGE > $ALERT_THRESHOLD_MEM” | bc) -eq 1 ]; then
echo “Memory usage alert: $MEM_USAGE%” | mail -s “DB2 Memory Alert” admin@example.com
fi

# 监控磁盘使用率
DISK_USAGE=$(df -h /db2data | tail -1 | awk ‘{print $5}’ | cut -d% -f1)
if [ $DISK_USAGE -gt $ALERT_THRESHOLD_DISK ]; then
echo “Disk usage alert: $DISK_USAGE%” | mail -s “DB2 Disk Alert” admin@example.com
fi

# 监控数据库连接数
db2 connect to $DBNAME
CONN_COUNT=$(db2 “SELECT COUNT(*) FROM sysibmadm.applications” | tail -1 | awk ‘{print $1}’)
if [ $CONN_COUNT -gt 500 ]; then
echo “Connection count alert: $CONN_COUNT” | mail -s “DB2 Connection Alert” admin@example.com
fi
db2 connect reset

# 监控表空间使用率
db2 connect to $DBNAME
db2 “SELECT tbsp_name, round(tbsp_used_pages*100.0/tbsp_total_pages,2)
FROM sysibmadm.tbsp_utilization
WHERE tbsp_used_pages*100.0/tbsp_total_pages > 85” > /tmp/tablespace_alert.txt
if [ -s /tmp/tablespace_alert.txt ]; then
mail -s “DB2 Tablespace Alert” admin@example.com < /tmp/tablespace_alert.txt fi db2 connect reset sleep 60 done

3.2 自动告警

# 创建告警存储过程
CREATE OR REPLACE PROCEDURE SP_CHECK_AND_ALERT()
LANGUAGE SQL
BEGIN
DECLARE v_cpu_usage DECIMAL(5, 2);
DECLARE v_mem_usage DECIMAL(5, 2);
DECLARE v_disk_usage DECIMAL(5, 2);
DECLARE v_conn_count INTEGER;
DECLARE v_tablespace_usage DECIMAL(5, 2);

— 检查CPU使用率
SELECT CPU_USAGE INTO v_cpu_usage
FROM SYSIBMADM.SNAPDBM
WHERE CPU_USAGE > 80;

IF v_cpu_usage > 80 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘CPU’, ‘CPU usage exceeds 80%’);
END IF;

— 检查内存使用率
SELECT MEMORY_USAGE INTO v_mem_usage
FROM SYSIBMADM.SNAPDBM
WHERE MEMORY_USAGE > 90;

IF v_mem_usage > 90 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘MEMORY’, ‘Memory usage exceeds 90%’);
END IF;

— 检查表空间使用率
SELECT MAX(TBSP_USED_PAGES * 100.0 / TBSP_TOTAL_PAGES)
INTO v_tablespace_usage
FROM SYSIBMADM.TBSP_UTILIZATION;

IF v_tablespace_usage > 85 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘TABLESPACE’, ‘Tablespace usage exceeds 85%’);
END IF;

— 检查连接数
SELECT COUNT(*) INTO v_conn_count
FROM SYSIBMADM.APPLICATIONS;

IF v_conn_count > 500 THEN
INSERT INTO ALERT_LOG (ALERT_TIME, ALERT_TYPE, ALERT_MESSAGE)
VALUES (CURRENT TIMESTAMP, ‘CONNECTION’, ‘Connection count exceeds 500’);
END IF;
END;

# 定时执行告警检查
# crontab -e
# */5 * * * * /db2/scripts/check_and_alert.sh

Part04-应急响应

4.1 应急预案

# 应急预案文档
# DB2核心系统应急响应预案

## 1. 实例故障
### 症状
– 实例无法启动
– 数据库无法连接

### 处理步骤
1. 检查实例状态
db2pd -inst

2. 查看实例日志
tail -f /db2home/db2inst1/sqllib/db2dump/db2diag.log

3. 重启实例
db2stop
db2start

4. 检查数据库状态
db2 connect to FGEDB

## 2. 表空间满
### 症状
– 表空间使用率超过95%
– 插入操作失败

### 处理步骤
1. 检查表空间使用
db2 “SELECT * FROM sysibmadm.tbsp_utilization”

2. 扩展表空间
ALTER TABLESPACE USERSPACE1 EXTEND (FILE ‘/db2data/userspace1.dms’ 10G)

3. 清理历史数据
DELETE FROM LOGS WHERE LOG_TIME < CURRENT DATE - 30 DAYS 4. 重组表回收空间 REORG TABLE LARGE_TABLE ## 3. 性能故障 ### 症状 - 响应时间慢 - CPU使用率高 ### 处理步骤 1. 查看慢查询 db2 "SELECT * FROM sysibmadm.top_dynamic_sql ORDER BY total_exec_time DESC" 2. 查看锁等待 db2 "SELECT * FROM sysibmadm.lockwaits" 3. 终止慢查询 db2 "FORCE APPLICATION (agent_id)" 4. 优化SQL -- 添加索引 -- 更新统计信息

4.2 故障演练

# 故障演练脚本
#!/bin/bash
# disaster_drill.sh

echo “=== DB2 Disaster Drill ===”
echo “Date: $(date)”

# 1. 模拟实例故障
echo “1. Simulating instance failure…”
db2stop force
sleep 10
db2start
echo “Instance recovered”

# 2. 模拟表空间满
echo “2. Simulating tablespace full…”
db2 connect to FGEDB
db2 “CREATE TABLESPACE TEST_FULL PAGESIZE 4K MANAGED BY DATABASE USING (FILE ‘/db2data/test_full.dms’ 10M)”
db2 “CREATE TABLE TEST_TABLE (ID INT) IN TEST_FULL”
for i in {1..10000}; do
db2 “INSERT INTO TEST_TABLE VALUES ($i)” 2>/dev/null
done
db2 “DROP TABLE TEST_TABLE”
db2 “DROP TABLESPACE TEST_FULL”
db2 connect reset
echo “Tablespace test completed”

# 3. 模拟HADR切换
echo “3. Simulating HADR failover…”
db2pd -db FGEDB -hadr
echo “HADR status checked”

# 4. 模拟备份恢复
echo “4. Simulating backup and restore…”
BACKUP_FILE=$(ls -t /db2backup/*.0 | head -1)
echo “Latest backup: $BACKUP_FILE”
echo “Backup and restore test completed”

echo “=== Disaster Drill Complete ===”

Part05-风哥经验总结与分享

5.1 稳定性保障要点

  • 建立完善的监控体系
  • 定期进行巡检
  • 定期压力测试
  • 建立应急预案
  • 定期故障演练
  • 持续优化改进

5.2 保障建议

保障项 频率 责任人
日常巡检 每天 运维人员
性能监控 实时 监控系统
压力测试 每月 测试团队
故障演练 每季度 运维团队

5.3 运维要点

  • 建立监控告警机制
  • 定期执行巡检
  • 定期压力测试
  • 建立应急预案
  • 定期故障演练
  • 持续优化改进
更多视频教程www.fgedu.net.cn
学习交流加群风哥微信: itpux-com
风哥Oracle/MySQL/PostgreSQL/Greenplum/DB2/Redis等数据库培训课程,10年一线实战经验,企业级培训,真正掌握数据库核心技术!

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息