PostgreSQL教程FG114-PG健康检查实战教程
本文档详细介绍PostgreSQL数据库健康检查的方法和流程,包括数据库状态检查、性能检查、安全检查等,风哥教程参考PostgreSQL官方文档Monitoring内容,适合DBA在生产环境中进行数据库健康评估。
Part01-基础概念与理论知识
1.1 PostgreSQL健康检查概念
PostgreSQL健康检查是对数据库系统进行全面评估的过程,通过检查各项指标判断数据库的健康状况。健康检查包括可用性检查、性能检查、安全检查、容量检查等多个维度,是保障数据库稳定运行的重要手段。更多视频教程www.fgedu.net.cn
- 可用性检查:服务状态、连接状态、复制状态
- 性能检查:响应时间、吞吐量、资源利用率
- 安全检查:权限配置、审计日志、备份状态
- 容量检查:存储空间、连接数、表大小
- 一致性检查:数据完整性、索引有效性
1.2 PostgreSQL健康检查架构
# 1. 检查层次
# 基础层:服务可用性、网络连通性
# 资源层:CPU、内存、磁盘、I/O
# 数据库层:连接、事务、锁、缓冲区
# 应用层:查询性能、业务指标
# 2. 健康评分体系
# 优秀(90-100):各项指标正常
# 良好(80-89):大部分指标正常,有小问题
# 一般(70-79):存在一些问题,需要关注
# 较差(60-69):存在较多问题,需要处理
# 危险(<60):存在严重问题,需要立即处理
# 3. 检查流程
┌─────────────────────────────────────────────────┐
│ 健康检查启动 │
└─────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│可用性 │ │性能 │ │安全 │
│检查 │ │检查 │ │检查 │
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
└───────────────┼───────────────┘
│
▼
┌─────────────────────┐
│ 综合评分与报告 │
└─────────────────────┘
# 4. 检查工具
# pg_stat_activity:会话和连接状态
# pg_stat_database:数据库统计信息
# pg_stat_bgwriter:后台写入统计
# pg_stat_replication:复制状态
# pg_stat_user_tables:表统计信息
1.3 PostgreSQL健康检查项目
PostgreSQL健康检查项目:
- 服务可用性:数据库服务状态、监听端口、连接响应
- 连接健康:连接数、活跃会话、空闲事务
- 复制健康:主从状态、复制延迟、WAL状态
- 存储健康:磁盘空间、表空间、表膨胀
- 性能健康:缓存命中率、检查点、锁等待
- 安全健康:用户权限、密码策略、审计日志
Part02-生产环境规划与建议
2.1 PostgreSQL健康检查规划
# 1. 检查频率规划
# 快速检查:每5分钟(服务可用性)
# 标准检查:每小时(关键指标)
# 全面检查:每天(所有指标)
# 深度检查:每周(性能分析)
# 2. 检查项目规划
# 快速检查:
# – 数据库服务状态
# – 连接响应时间
# – 主从复制状态
# 标准检查:
# – 连接数使用率
# – 复制延迟
# – 锁等待数量
# – 磁盘使用率
# 全面检查:
# – 所有标准检查项目
# – 表空间使用情况
# – 表膨胀率
# – 索引使用率
# – 缓存命中率
# 深度检查:
# – 所有全面检查项目
# – 慢查询分析
# – 执行计划分析
# – 安全审计
# – 备份验证
# 3. 告警阈值规划
# 紧急(红色):
# – 数据库服务不可用
# – 磁盘使用率 > 95%
# – 复制中断
# 警告(黄色):
# – 连接数使用率 > 80%
# – 磁盘使用率 > 85%
# – 复制延迟 > 10MB
# 提示(蓝色):
# – 表膨胀率 > 20%
# – 索引未使用
# – 慢查询增加
2.2 PostgreSQL健康检查阈值
PostgreSQL健康检查阈值:
- 连接数使用率:正常<70%,警告70-90%,危险>90%
- 磁盘使用率:正常<80%,警告80-90%,危险>90%
- 复制延迟:正常<1MB,警告1-10MB,危险>10MB
- 缓存命中率:正常>95%,警告90-95%,危险<90%
- 表膨胀率:正常<10%,警告10-30%,危险>30%
2.3 PostgreSQL健康检查报告
# ========================================
# PostgreSQL数据库健康检查报告
# ========================================
# 基本信息
检查时间:YYYY-MM-DD HH:MM:SS
数据库主机:192.168.1.100
数据库版本:PostgreSQL 18.0
数据库名称:fgedudb
# 健康评分
总评分:85分(良好)
# 分类评分
可用性评分:95分(优秀)
性能评分:80分(良好)
安全评分:75分(一般)
容量评分:90分(优秀)
# 详细检查结果
## 1. 可用性检查
[✓] 数据库服务状态:正常
[✓] 连接响应时间:5ms
[✓] 主从复制状态:正常
## 2. 性能检查
[✓] 缓存命中率:98.5%
[!] 连接数使用率:75%
[!] 锁等待数量:5
## 3. 安全检查
[✓] 超级用户数量:2
[!] 密码过期用户:3
[✓] 审计日志状态:正常
## 4. 容量检查
[✓] 磁盘使用率:65%
[!] 表膨胀率:15%
[✓] 表空间使用率:60%
# 问题与建议
1. 连接数使用率较高,建议增加max_connections
2. 存在密码过期用户,建议及时更新
3. 部分表存在膨胀,建议执行VACUUM
# 处理计划
1. 本周内调整连接数配置
2. 通知用户更新密码
3. 安排VACUUM维护窗口
Part03-生产环境项目实施方案
3.1 PostgreSQL数据库健康检查实战
3.1.1 服务可用性检查
# 1. 检查数据库服务状态
$ pg_isready -h fgedu.localhost -p 5432 -U fgedu -d fgedudb
fgedu.localhost:5432 – accepting connections
# 2. 检查数据库进程
$ ps -ef | grep postgres | grep -v grep
pgsql 12345 1 0 Apr01 ? 00:20:00 /postgresql/fgapp/bin/postgres -D /postgresql/fgdata
pgsql 12346 12345 0 Apr01 ? 00:15:00 postgres: checkpointer
pgsql 12347 12345 0 Apr01 ? 00:10:00 postgres: background writer
pgsql 12348 12345 0 Apr01 ? 00:05:00 postgres: walwriter
pgsql 12349 12345 0 Apr01 ? 00:08:00 postgres: autovacuum launcher
pgsql 12350 12345 0 Apr01 ? 00:02:00 postgres: stats collector
pgsql 12351 12345 0 Apr01 ? 00:01:00 postgres: logical replication launcher
# 3. 检查监听端口
$ netstat -tlnp | grep 5432
tcp 0 0 192.168.1.100:5432 0.0.0.0:* LISTEN 12345/postgres
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 12345/postgres
# 4. 测试连接响应时间
$ time psql -h fgedu.localhost -p 5432 -U fgedu -d fgedudb -c “SELECT 1;” > /dev/null
real 0m0.005s
user 0m0.001s
sys 0m0.002s
# 5. 检查数据库运行时间
$ psql -U fgedu -d fgedudb -c ”
SELECT
pg_postmaster_start_time() as start_time,
now() – pg_postmaster_start_time() as uptime;
”
start_time | uptime
————————+——————-
2026-04-01 10:00:00 | 6 days 02:30:00
(1 row)
3.1.2 连接健康检查
# 1. 检查连接数使用情况
$ psql -U fgedu -d fgedudb -c ”
SELECT
setting::int as max_connections,
(SELECT count(*) FROM pg_stat_activity) as current_connections,
round((SELECT count(*) FROM pg_stat_activity)::numeric / setting::int * 100, 2) as usage_percent,
(SELECT count(*) FROM pg_stat_activity WHERE state = ‘active’) as active_connections,
(SELECT count(*) FROM pg_stat_activity WHERE state = ‘idle’) as idle_connections,
(SELECT count(*) FROM pg_stat_activity WHERE state = ‘idle in transaction’) as idle_in_transaction
FROM pg_settings
WHERE name = ‘max_connections’;
”
max_connections | current_connections | usage_percent | active_connections | idle_connections | idle_in_transaction
—————–+———————+—————+——————–+——————+———————
300 | 50 | 16.67 | 10 | 35 | 5
(1 row)
# 2. 检查长时间空闲事务
$ psql -U fgedu -d fgedudb -c ”
SELECT
pid,
usename,
fgapplication_name,
client_addr,
now() – state_change as idle_duration,
left(query, 50) as query_preview
FROM pg_stat_activity
WHERE state = ‘idle in transaction’
AND now() – state_change > interval ‘5 minutes’
ORDER BY idle_duration DESC;
”
pid | usename | fgapplication_name | client_addr | idle_duration | query_preview
——+———+——————+—————+—————-+—————————————————-
9876 | fgapp_usr | fgedu_fgapp | 192.168.1.51 | 00:15:00.12345 | UPDATE fgedu_orders SET status = ‘completed’ WHER
(1 row)
# 3. 检查长时间运行的查询
$ psql -U fgedu -d fgedudb -c ”
SELECT
pid,
usename,
fgapplication_name,
client_addr,
now() – query_start as duration,
state,
left(query, 50) as query_preview
FROM pg_stat_activity
WHERE state = ‘active’
AND now() – query_start > interval ‘5 minutes’
AND query NOT LIKE ‘%pg_stat_activity%’
ORDER BY duration DESC;
”
pid | usename | fgapplication_name | client_addr | duration | state | query_preview
——+———+——————+—————+—————–+——–+—————————————————-
8765 | fgapp_usr | fgedu_fgapp | 192.168.1.50 | 00:10:30.123456 | active | SELECT * FROM fgedu_orders WHERE create_time > ‘2
(1 row)
# 4. 检查连接来源分布
$ psql -U fgedu -d fgedudb -c ”
SELECT
client_addr,
count(*) as connection_count,
count(*) FILTER (WHERE state = ‘active’) as active,
count(*) FILTER (WHERE state = ‘idle’) as idle
FROM pg_stat_activity
WHERE client_addr IS NOT NULL
GROUP BY client_addr
ORDER BY connection_count DESC;
”
client_addr | connection_count | active | idle
—————+——————+——–+——
192.168.1.50 | 20 | 5 | 15
192.168.1.51 | 15 | 3 | 12
192.168.1.52 | 10 | 2 | 8
(3 rows)
3.2 PostgreSQL性能健康检查实战
3.2.1 缓存和缓冲区检查
# 1. 检查缓存命中率
$ psql -U fgedu -d fgedudb -c ”
SELECT
datname,
blks_read,
blks_hit,
round(100.0 * blks_hit / (blks_hit + blks_read), 2) as cache_hit_ratio
FROM pg_stat_database
WHERE datname = ‘fgedudb’;
”
datname | blks_read | blks_hit | cache_hit_ratio
———-+———–+———–+—————–
fgedudb | 1234567 | 123456789 | 99.01
(1 row)
# 2. 检查后台写入统计
$ psql -U fgedu -d fgedudb -c ”
SELECT
checkpoints_timed,
checkpoints_req,
round(100.0 * checkpoints_req / (checkpoints_timed + checkpoints_req), 2) as checkpoint_req_ratio,
buffers_checkpoint,
buffers_clean,
buffers_backend
FROM pg_stat_bgwriter;
”
checkpoints_timed | checkpoints_req | checkpoint_req_ratio | buffers_checkpoint | buffers_clean | buffers_backend
——————-+—————–+———————-+——————–+—————+—————–
1234 | 56 | 4.34 | 123456789 | 1234567 | 12345678
(1 row)
# 3. 检查检查点频率
$ psql -U fgedu -d fgedudb -c ”
SELECT
name,
setting,
unit
FROM pg_settings
WHERE name IN (‘checkpoint_timeout’, ‘max_wal_size’, ‘checkpoint_completion_target’);
”
name | setting | unit
—————————–+———+——
checkpoint_timeout | 300 | s
max_wal_size | 2048 | MB
checkpoint_completion_target| 0.9 |
(3 rows)
# 4. 检查WAL写入统计
$ psql -U fgedu -d fgedudb -c ”
SELECT
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), ‘0/0’)) as wal_written,
count(*) as wal_files
FROM pg_ls_waldir();
”
wal_written | wal_files
————-+———–
50 GB | 25
(1 row)
3.2.2 锁和事务检查
# 1. 检查锁等待情况
$ psql -U fgedu -d fgedudb -c ”
SELECT
locktype,
mode,
count(*) as lock_count,
count(*) FILTER (WHERE NOT granted) as waiting_count
FROM pg_locks
GROUP BY locktype, mode
ORDER BY lock_count DESC;
”
locktype | mode | lock_count | waiting_count
————–+———————+————+—————
relation | AccessShareLock | 150 | 0
transactionid| ExclusiveLock | 50 | 0
tuple | ExclusiveLock | 20 | 0
relation | RowExclusiveLock | 15 | 0
relation | ShareUpdateExclusiv | 5 | 0
(5 rows)
# 2. 检查阻塞会话
$ psql -U fgedu -d fgedudb -c ”
SELECT
blocked.pid as blocked_pid,
blocked.usename as blocked_user,
blocking.pid as blocking_pid,
blocking.usename as blocking_user,
blocked.mode as blocked_mode,
now() – blocked.query_start as wait_duration
FROM pg_locks blocked
JOIN pg_stat_activity blocked ON blocked.pid = blocked.pid
JOIN pg_locks blocking ON blocked.locktype = blocking.locktype
AND blocked.database IS NOT DISTINCT FROM blocking.database
AND blocked.relation IS NOT DISTINCT FROM blocking.relation
AND blocked.page IS NOT DISTINCT FROM blocking.page
AND blocked.tuple IS NOT DISTINCT FROM blocking.tuple
AND blocked.pid != blocking.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking.pid
WHERE NOT blocked.granted;
”
blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_mode | wait_duration
————-+————–+————–+—————+————–+—————
12345 | fgapp_usr | 12346 | fgapp_usr | Exclusive | 00:00:30
(1 row)
# 3. 检查事务统计
$ psql -U fgedu -d fgedudb -c ”
SELECT
datname,
xact_commit,
xact_rollback,
round(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) as rollback_ratio
FROM pg_stat_database
WHERE datname = ‘fgedudb’;
”
datname | xact_commit | xact_rollback | rollback_ratio
———-+————-+—————+—————-
fgedudb | 123456789 | 123456 | 0.10
(1 row)
# 4. 检查死锁情况
$ psql -U fgedu -d fgedudb -c ”
SELECT
datname,
deadlocks,
conflicts
FROM pg_stat_database
WHERE datname = ‘fgedudb’;
”
datname | deadlocks | conflicts
———-+———–+———–
fgedudb | 0 | 0
(1 row)
3.3 PostgreSQL安全健康检查实战
# 1. 检查超级用户数量
$ psql -U fgedu -d fgedudb -c ”
SELECT
rolname,
rolsuper,
rolcreaterole,
rolcreatedb,
rolvaliduntil
FROM pg_roles
WHERE rolsuper = true;
”
rolname | rolsuper | rolcreaterole | rolcreatedb | rolvaliduntil
———+———-+—————+————-+—————
pgsql | t | t | t |
fgedu | t | t | t |
(2 rows)
# 2. 检查密码过期用户
$ psql -U fgedu -d fgedudb -c ”
SELECT
rolname,
rolvaliduntil,
CASE
WHEN rolvaliduntil IS NULL THEN ‘永不过期’
WHEN rolvaliduntil < now() THEN '已过期'
ELSE '正常'
END as password_status
FROM pg_roles
WHERE rolcanlogin = true
ORDER BY rolvaliduntil;
"
rolname | rolvaliduntil | password_status
----------+------------------------+-----------------
fgapp_usr1 | 2026-03-01 00:00:00 | 已过期
fgapp_usr2 | 2026-04-01 00:00:00 | 已过期
fgapp_usr3 | 2026-05-01 00:00:00 | 正常
fgedu | | 永不过期
(4 rows)
# 3. 检查public模式权限
$ psql -U fgedu -d fgedudb -c "
SELECT
nspname,
has_schema_privilege('public', nspname, 'CREATE') as public_create,
has_schema_privilege('public', nspname, 'USAGE') as public_usage
FROM pg_namespace
WHERE nspname NOT LIKE 'pg_%'
AND nspname != 'information_schema';
"
nspname | public_create | public_usage
---------+---------------+--------------
public | t | t
(1 row)
# 4. 检查SSL连接配置
$ psql -U fgedu -d fgedudb -c "
SELECT
name,
setting,
short_desc
FROM pg_settings
WHERE name LIKE '%ssl%';
"
name | setting | short_desc
------------------+---------+--------------------------------------------------
ssl | on | Enables SSL connections.
ssl_cert_file | server.crt | Location of the SSL server certificate file.
ssl_key_file | server.key | Location of the SSL server private key file.
(3 rows)
# 5. 检查审计日志配置
$ psql -U fgedu -d fgedudb -c "
SELECT
name,
setting,
short_desc
FROM pg_settings
WHERE name IN ('log_connections', 'log_disconnections', 'log_statement', 'log_destination');
"
name | setting | short_desc
-------------------+---------+--------------------------------------------------
log_connections | on | Logs each successful connection.
log_disconnections| on | Logs end of a session, including duration.
log_statement | ddl | Sets the type of statements logged.
log_destination | csvlog | Sets the destination for server log output.
(3 rows)
Part04-生产案例与实战讲解
4.1 PostgreSQL全面健康检查案例
# pg_health_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# PostgreSQL全面健康检查脚本
PGHOME=/postgresql/fgapp
PGHOST=fgedu.localhost
PGPORT=5432
PGUSER=fgedu
PGDATABASE=fgedudb
REPORT_FILE=/postgresql/scripts/reports/health_check_$(date +%Y%m%d).log
# 初始化评分
TOTAL_SCORE=100
SCORE_DEDUCT=0
log_message() {
echo “[$(date ‘+%Y-%m-%d %H:%M:%S’)] $1” | tee -a ${REPORT_FILE}
}
check_service() {
log_message “=== 服务可用性检查 ===”
# 检查服务状态
if ${PGHOME}/bin/pg_isready -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} > /dev/null 2>&1; then
log_message “[✓] 数据库服务状态:正常”
else
log_message “[✗] 数据库服务状态:异常”
SCORE_DEDUCT=$((SCORE_DEDUCT + 20))
fi
# 检查连接响应时间
RESPONSE_TIME=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -c “SELECT 1;” 2>&1 | grep -o “real.*” | awk ‘{print $2}’)
log_message “[✓] 连接响应时间:${RESPONSE_TIME}”
}
check_connections() {
log_message “”
log_message “=== 连接健康检查 ===”
# 检查连接数使用率
CONN_USAGE=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c ”
SELECT round((SELECT count(*) FROM pg_stat_activity)::numeric / setting::numeric * 100, 2)
FROM pg_settings WHERE name = ‘max_connections’;
” 2>/dev/null | tr -d ‘ ‘)
if [ $(echo “$CONN_USAGE > 90” | bc) -eq 1 ]; then
log_message “[✗] 连接数使用率:${CONN_USAGE}% (危险)”
SCORE_DEDUCT=$((SCORE_DEDUCT + 15))
elif [ $(echo “$CONN_USAGE > 70” | bc) -eq 1 ]; then
log_message “[!] 连接数使用率:${CONN_USAGE}% (警告)”
SCORE_DEDUCT=$((SCORE_DEDUCT + 5))
else
log_message “[✓] 连接数使用率:${CONN_USAGE}% (正常)”
fi
# 检查空闲事务
IDLE_TRANS=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c ”
SELECT count(*) FROM pg_stat_activity
WHERE state = ‘idle in transaction’
AND now() – state_change > interval ‘5 minutes’;
” 2>/dev/null | tr -d ‘ ‘)
if [ “$IDLE_TRANS” -gt 5 ]; then
log_message “[!] 长时间空闲事务:${IDLE_TRANS}个”
SCORE_DEDUCT=$((SCORE_DEDUCT + 5))
else
log_message “[✓] 长时间空闲事务:${IDLE_TRANS}个”
fi
}
check_replication() {
log_message “”
log_message “=== 复制健康检查 ===”
IS_PRIMARY=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c “SELECT NOT pg_is_in_recovery();” 2>/dev/null | tr -d ‘ ‘)
if [ “${IS_PRIMARY}” = “t” ]; then
log_message “[✓] 角色:主库”
REPL_COUNT=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c “SELECT count(*) FROM pg_stat_replication;” 2>/dev/null | tr -d ‘ ‘)
log_message “[✓] 备库数量:${REPL_COUNT}”
REPL_LAG=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c ”
SELECT COALESCE(max(pg_wal_lsn_diff(sent_lsn, replay_lsn)), 0) FROM pg_stat_replication;
” 2>/dev/null | tr -d ‘ ‘)
if [ “$REPL_LAG” -gt 10485760 ]; then
log_message “[!] 复制延迟:${REPL_LAG} bytes (警告)”
SCORE_DEDUCT=$((SCORE_DEDUCT + 10))
else
log_message “[✓] 复制延迟:${REPL_LAG} bytes (正常)”
fi
else
log_message “[✓] 角色:备库”
fi
}
check_storage() {
log_message “”
log_message “=== 存储健康检查 ===”
# 检查磁盘使用率
DISK_USAGE=$(df -h /postgresql | tail -1 | awk ‘{print $5}’ | tr -d ‘%’)
if [ “$DISK_USAGE” -gt 90 ]; then
log_message “[✗] 磁盘使用率:${DISK_USAGE}% (危险)”
SCORE_DEDUCT=$((SCORE_DEDUCT + 15))
elif [ “$DISK_USAGE” -gt 80 ]; then
log_message “[!] 磁盘使用率:${DISK_USAGE}% (警告)”
SCORE_DEDUCT=$((SCORE_DEDUCT + 5))
else
log_message “[✓] 磁盘使用率:${DISK_USAGE}% (正常)”
fi
}
check_performance() {
log_message “”
log_message “=== 性能健康检查 ===”
# 检查缓存命中率
CACHE_HIT=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c ”
SELECT round(100.0 * blks_hit / (blks_hit + blks_read), 2)
FROM pg_stat_database WHERE datname = ‘${PGDATABASE}’;
” 2>/dev/null | tr -d ‘ ‘)
if [ $(echo “$CACHE_HIT < 90" | bc) -eq 1 ]; then log_message "[!] 缓存命中率:${CACHE_HIT}% (警告)" SCORE_DEDUCT=$((SCORE_DEDUCT + 10)) else log_message "[✓] 缓存命中率:${CACHE_HIT}% (正常)" fi # 检查锁等待 LOCK_WAIT=$(${PGHOME}/bin/psql -h ${PGHOST} -p ${PGPORT} -U ${PGUSER} -d ${PGDATABASE} -t -c " SELECT count(*) FROM pg_locks WHERE NOT granted; " 2>/dev/null | tr -d ‘ ‘)
if [ “$LOCK_WAIT” -gt 10 ]; then
log_message “[!] 锁等待数量:${LOCK_WAIT}个 (警告)”
SCORE_DEDUCT=$((SCORE_DEDUCT + 5))
else
log_message “[✓] 锁等待数量:${LOCK_WAIT}个 (正常)”
fi
}
# 主函数
main() {
log_message “==========================================”
log_message “PostgreSQL数据库健康检查报告”
log_message “检查时间: $(date)”
log_message “数据库主机: $(hostname)”
log_message “==========================================”
check_service
check_connections
check_replication
check_storage
check_performance
# 计算最终评分
FINAL_SCORE=$((TOTAL_SCORE – SCORE_DEDUCT))
log_message “”
log_message “==========================================”
log_message “健康评分: ${FINAL_SCORE}分”
if [ “$FINAL_SCORE” -ge 90 ]; then
log_message “健康状态: 优秀”
elif [ “$FINAL_SCORE” -ge 80 ]; then
log_message “健康状态: 良好”
elif [ “$FINAL_SCORE” -ge 70 ]; then
log_message “健康状态: 一般”
elif [ “$FINAL_SCORE” -ge 60 ]; then
log_message “健康状态: 较差”
else
log_message “健康状态: 危险”
fi
log_message “==========================================”
}
main
4.2 PostgreSQL问题诊断案例
# 场景:数据库响应缓慢问题诊断
# 1. 问题现象
# 应用反馈数据库响应缓慢,查询超时
# 2. 检查当前会话状态
$ psql -U fgedu -d fgedudb -c ”
SELECT
state,
count(*) as session_count,
count(*) FILTER (WHERE wait_event IS NOT NULL) as waiting_count
FROM pg_stat_activity
GROUP BY state;
”
state | session_count | waiting_count
————+—————+—————
active | 50 | 30
idle | 100 | 0
idle in tr | 20 | 0
(3 rows)
# 3. 检查等待事件
$ psql -U fgedu -d fgedudb -c ”
SELECT
wait_event_type,
wait_event,
count(*) as wait_count
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY wait_count DESC;
”
wait_event_type | wait_event | wait_count
—————–+——————–+————
Lock | relation | 20
IO | DataFileRead | 10
(3 rows)
# 4. 检查锁详情
$ psql -U fgedu -d fgedudb -c ”
SELECT
l.locktype,
l.mode,
l.granted,
a.query,
a.pid
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT l.granted
ORDER BY a.query_start;
”
locktype | mode | granted | query | pid
———-+——————–+———+————————————————–+——-
relation | RowExclusiveLock | f | UPDATE fgedu_orders SET status = ‘completed’ … | 12345
relation | RowExclusiveLock | f | UPDATE fgedu_orders SET status = ‘completed’ … | 12346
(3 rows)
# 5. 检查阻塞源
$ psql -U fgedu -d fgedudb -c ”
SELECT
blocking.pid as blocking_pid,
blocking.query as blocking_query,
blocked.pid as blocked_pid,
blocked.query as blocked_query
FROM pg_locks blocked
JOIN pg_stat_activity blocked ON blocked.pid = blocked.pid
JOIN pg_locks blocking ON blocked.locktype = blocking.locktype
AND blocked.relation = blocking.relation
AND blocked.pid != blocking.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking.pid
WHERE NOT blocked.granted;
”
blocking_pid | blocking_query | blocked_pid | blocked_query
————–+——————————————————+————-+——————————————————
12340 | VACUUM FULL fgedu_orders | 12345 | UPDATE fgedu_orders SET status = ‘completed’ WHERE
12340 | VACUUM FULL fgedu_orders | 12346 | UPDATE fgedu_orders SET status = ‘completed’ WHERE
(2 rows)
# 6. 问题原因
# 发现VACUUM FULL操作阻塞了业务查询
# 7. 解决方案
# 取消VACUUM FULL操作
$ psql -U fgedu -d fgedudb -c “SELECT pg_cancel_backend(12340);”
pg_cancel_backend
——————-
t
(1 row)
# 8. 验证问题解决
$ psql -U fgedu -d fgedudb -c ”
SELECT count(*) FROM pg_locks WHERE NOT granted;
”
count
——-
0
(1 row)
4.3 PostgreSQL优化建议案例
# 场景:根据健康检查结果给出优化建议
# 1. 连接数优化建议
# 问题:连接数使用率75%,接近警告阈值
# 建议:
$ psql -U fgedu -d fgedudb -c ”
— 增加最大连接数
ALTER SYSTEM SET max_connections = 400;
— 配置连接池
— 建议使用PgBouncer连接池
”
# 2. 缓存命中率优化建议
# 问题:缓存命中率95%,略低于理想值
# 建议:
$ psql -U fgedu -d fgedudb -c ”
— 增加共享缓冲区
ALTER SYSTEM SET shared_buffers = ’16GB’;
— 增加有效缓存大小
ALTER SYSTEM SET effective_cache_size = ’48GB’;
”
# 3. 表膨胀优化建议
# 问题:部分表膨胀率超过15%
# 建议:
$ psql -U fgedu -d fgedudb -c ”
— 执行VACUUM清理
VACUUM VERBOSE ANALYZE fgedu_orders;
— 调整autovacuum参数
ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.1;
ALTER SYSTEM SET autovacuum_analyze_scale_factor = 0.05;
”
# 4. 索引优化建议
# 问题:存在未使用索引
# 建议:
$ psql -U fgedu -d fgedudb -c ”
— 查看未使用索引
SELECT
schemaname || ‘.’ || relname as table_name,
indexrelname as index_name,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
idx_scan as index_scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE ‘%_pkey’
ORDER BY pg_relation_size(indexrelid) DESC;
— 删除未使用索引(确认后执行)
— DROP INDEX public.idx_unused_index;
”
# 5. 查询优化建议
# 问题:存在慢查询
# 建议:
$ psql -U fgedu -d fgedudb -c ”
— 启用pg_stat_statements扩展
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
— 查看慢查询
SELECT
query,
calls,
total_exec_time / 1000 / 60 as total_minutes,
mean_exec_time / 1000 as mean_seconds
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
— 建议对慢查询进行优化
— 1. 添加合适的索引
— 2. 优化查询语句
— 3. 更新统计信息
”
Part05-风哥经验总结与分享
5.1 PostgreSQL健康检查最佳实践
PostgreSQL健康检查最佳实践:
- 定期检查:建立定期健康检查制度
- 全面覆盖:检查所有关键指标
- 量化评分:建立健康评分体系
- 问题跟踪:记录问题并跟踪处理
- 趋势分析:对比历史数据发现趋势
- 持续改进:根据实际情况优化检查内容
5.2 PostgreSQL健康检查脚本
/postgresql/scripts/health_check/
├── pg_health_check.sh # 全面健康检查
├── pg_quick_check.sh # 快速健康检查
├── pg_security_check.sh # 安全健康检查
├── pg_performance_check.sh # 性能健康检查
├── pg_replication_check.sh # 复制健康检查
└── conf/
└── health_check.conf # 健康检查配置
5.3 PostgreSQL健康检查清单
# 服务可用性
– [ ] 数据库服务状态
– [ ] 监听端口状态
– [ ] 连接响应时间
– [ ] 数据库运行时间
# 连接健康
– [ ] 连接数使用率
– [ ] 活跃会话数量
– [ ] 空闲事务数量
– [ ] 长时间运行查询
# 复制健康
– [ ] 主从复制状态
– [ ] 复制延迟大小
– [ ] WAL状态
– [ ] 复制槽状态
# 存储健康
– [ ] 磁盘使用率
– [ ] 表空间使用率
– [ ] 表膨胀率
– [ ] 索引膨胀率
# 性能健康
– [ ] 缓存命中率
– [ ] 检查点频率
– [ ] 锁等待数量
– [ ] 死锁数量
# 安全健康
– [ ] 超级用户数量
– [ ] 密码过期用户
– [ ] SSL连接配置
– [ ] 审计日志状态
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
