PostgreSQL教程FG176-PG复制进度跟踪:监控与自定义统计
本文档风哥主要介绍PostgreSQL数据库复制进度跟踪的监控与自定义统计,包括复制进度视图、监控指标、告警策略、自定义统计函数等内容,风哥教程参考PostgreSQL官方文档Monitoring、Replication Progress Tracking等内容,适合数据库运维人员在生产环境中使用。
Part01-基础概念与理论知识
1.1 PostgreSQL数据库复制进度跟踪概述
复制进度跟踪是PostgreSQL提供的监控机制,用于实时监控主从复制状态、延迟情况和数据一致性。更多视频教程www.fgedu.net.cn。通过复制进度跟踪,管理员可以及时发现复制延迟、复制中断等问题,确保数据同步的可靠性和实时性。PostgreSQL提供了丰富的系统视图和函数用于复制监控。
- 实时监控主从复制状态
- 计算复制延迟时间和数据量
- 跟踪WAL发送和接收进度
- 监控复制槽状态
- 记录复制历史数据
1.2 PostgreSQL数据库复制进度视图
PostgreSQL提供了多个系统视图用于复制监控:pg_stat_replication显示WAL发送进程状态;pg_stat_wal_receiver显示WAL接收进程状态;pg_replication_slots显示复制槽信息;pg_stat_subscription显示订阅状态。学习交流加群风哥微信: itpux-com。
SELECT
pid,
usename,
fgapplication_name,
client_addr,
state,
sync_state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn
FROM pg_stat_replication;
— 输出结果
pid | usename | fgapplication_name | client_addr | state | sync_state | sent_lsn | write_lsn | flush_lsn | replay_lsn
——-+———+——————+—————+———–+————+———–+———–+———–+————
12345 | fgedu | fgedu_standby1 | 192.168.1.101 | streaming | sync | 0/1A2B3C8 | 0/1A2B3C8 | 0/1A2B3C8 | 0/1A2B3C8
12346 | fgedu | fgedu_standby2 | 192.168.1.102 | streaming | async | 0/1A2B3C8 | 0/1A2B3C0 | 0/1A2B3C0 | 0/1A2B3B8
(2 rows)
— 查看从库接收状态
SELECT
pid,
status,
received_lsn,
latest_end_lsn,
sender_host,
sender_port
FROM pg_stat_wal_receiver;
— 输出结果
pid | status | received_lsn | latest_end_lsn | sender_host | sender_port
——-+———-+————–+—————-+————-+————-
23456 | streaming| 0/1A2B3C8 | 0/1A2B3C8 | 192.168.1.100| 5432
(1 row)
— 查看复制槽状态
SELECT
slot_name,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn
FROM pg_replication_slots;
— 输出结果
slot_name | slot_type | active | restart_lsn | confirmed_flush_lsn
—————–+———–+——–+————-+——————–
fgedu_standby1 | physical | t | 0/1A2B3C8 |
fgedu_logical1 | logical | t | 0/1A2B3C0 | 0/1A2B3C0
(2 rows)
1.3 PostgreSQL数据库复制进度指标
关键监控指标包括:复制延迟(时间延迟和数据延迟)、WAL发送速度、WAL接收速度、复制槽堆积量、连接状态等。这些指标可以帮助管理员全面了解复制健康状况。
SELECT
client_addr,
fgapplication_name,
state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
pg_wal_lsn_diff(sent_lsn, flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
CASE
WHEN pg_wal_lsn_diff(sent_lsn, replay_lsn) > 1024*1024*100 THEN ‘CRITICAL’
WHEN pg_wal_lsn_diff(sent_lsn, replay_lsn) > 1024*1024*10 THEN ‘WARNING’
ELSE ‘OK’
END AS lag_status
FROM pg_stat_replication;
— 输出结果
client_addr | fgapplication_name | state | lag_bytes | flush_lag_bytes | write_lag_bytes | lag_status
—————+——————+———–+———–+—————–+—————–+————
192.168.1.101 | fgedu_standby1 | streaming | 0 | 0 | 0 | OK
192.168.1.102 | fgedu_standby2 | streaming | 16777216 | 16777216 | 16777216 | WARNING
(2 rows)
— 计算时间延迟
SELECT
client_addr,
fgapplication_name,
state,
reply_time,
NOW() – reply_time AS time_lag,
CASE
WHEN NOW() – reply_time > INTERVAL ‘5 minutes’ THEN ‘CRITICAL’
WHEN NOW() – reply_time > INTERVAL ‘1 minute’ THEN ‘WARNING’
ELSE ‘OK’
END AS time_lag_status
FROM pg_stat_replication;
— 输出结果
client_addr | fgapplication_name | state | reply_time | time_lag | time_lag_status
—————+——————+———–+————————–+———–+—————–
192.168.1.101 | fgedu_standby1 | streaming | 2026-04-07 16:30:00.123 | 00:00:01 | OK
192.168.1.102 | fgedu_standby2 | streaming | 2026-04-07 16:25:00.123 | 00:05:01 | CRITICAL
(2 rows)
Part02-生产环境规划与建议
2.1 PostgreSQL数据库复制监控架构设计
复制监控架构设计要点:建立集中监控平台收集复制状态;设计合理的监控频率避免性能影响;建立历史数据存储便于趋势分析;设计分级告警机制及时响应异常。
2.2 PostgreSQL数据库复制告警策略
告警策略设计:复制延迟超过阈值告警;复制连接中断告警;复制槽堆积告警;WAL文件堆积告警。建议设置多级告警,区分紧急和普通告警。
- 复制延迟:WARNING > 10MB,CRITICAL > 100MB
- 时间延迟:WARNING > 1分钟,CRITICAL > 5分钟
- 复制槽堆积:WARNING > 1GB,CRITICAL > 10GB
- WAL文件数:WARNING > 100个,CRITICAL > 500个
2.3 PostgreSQL数据库复制报表设计
报表设计包括:实时复制状态报表、历史趋势报表、异常事件报表、容量规划报表。报表应支持按时间范围、数据库、应用等维度筛选。
Part03-生产环境项目实施方案
3.1 PostgreSQL数据库复制进度监控实现
3.1.1 创建复制监控表
CREATE TABLE fgedu_replication_history (
id SERIAL PRIMARY KEY,
check_time TIMESTAMP DEFAULT NOW(),
fgapplication_name VARCHAR(100),
client_addr VARCHAR(50),
state VARCHAR(50),
sync_state VARCHAR(10),
sent_lsn TEXT,
write_lsn TEXT,
flush_lsn TEXT,
replay_lsn TEXT,
lag_bytes BIGINT,
time_lag INTERVAL
);
CREATE INDEX idx_replication_history_time ON fgedu_replication_history(check_time);
CREATE INDEX idx_replication_history_fgapp ON fgedu_replication_history(fgapplication_name);
— 创建复制槽监控历史表
CREATE TABLE fgedu_slot_history (
id SERIAL PRIMARY KEY,
check_time TIMESTAMP DEFAULT NOW(),
slot_name VARCHAR(100),
slot_type VARCHAR(20),
active BOOLEAN,
restart_lsn TEXT,
confirmed_flush_lsn TEXT,
wal_backlog BIGINT
);
CREATE INDEX idx_slot_history_time ON fgedu_slot_history(check_time);
— 创建告警记录表
CREATE TABLE fgedu_replication_alerts (
id SERIAL PRIMARY KEY,
alert_time TIMESTAMP DEFAULT NOW(),
alert_type VARCHAR(50),
alert_level VARCHAR(20),
fgapplication_name VARCHAR(100),
message TEXT,
resolved BOOLEAN DEFAULT FALSE,
resolved_time TIMESTAMP
);
CREATE INDEX idx_alerts_time ON fgedu_replication_alerts(alert_time);
CREATE INDEX idx_alerts_unresolved ON fgedu_replication_alerts(resolved) WHERE resolved = FALSE;
3.1.2 创建监控采集函数
CREATE OR REPLACE FUNCTION fgedu_collect_replication_status()
RETURNS void
AS $$
BEGIN
INSERT INTO fgedu_replication_history(
fgapplication_name,
client_addr,
state,
sync_state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
lag_bytes,
time_lag
)
SELECT
fgapplication_name,
client_addr::TEXT,
state,
sync_state,
sent_lsn::TEXT,
write_lsn::TEXT,
flush_lsn::TEXT,
replay_lsn::TEXT,
pg_wal_lsn_diff(sent_lsn, replay_lsn),
NOW() – reply_time
FROM pg_stat_replication;
INSERT INTO fgedu_slot_history(
slot_name,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn,
wal_backlog
)
SELECT
slot_name,
slot_type,
active,
restart_lsn::TEXT,
confirmed_flush_lsn::TEXT,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
FROM pg_replication_slots;
END;
$$ LANGUAGE plpgsql;
— 测试采集函数
SELECT fgedu_collect_replication_status();
— 输出结果
fgedu_collect_replication_status
———————————-
(1 row)
— 查看采集结果
SELECT * FROM fgedu_replication_history ORDER BY check_time DESC LIMIT 5;
— 输出结果
id | check_time | fgapplication_name | client_addr | state | sync_state | sent_lsn | write_lsn | flush_lsn | replay_lsn | lag_bytes | time_lag
—-+————————–+——————+—————+———–+————+———–+———–+———–+————+———–+———-
1 | 2026-04-07 16:30:00.123 | fgedu_standby1 | 192.168.1.101 | streaming | sync | 0/1A2B3C8 | 0/1A2B3C8 | 0/1A2B3C8 | 0/1A2B3C8 | 0 | 00:00:01
2 | 2026-04-07 16:30:00.234 | fgedu_standby2 | 192.168.1.102 | streaming | async | 0/1A2B3C8 | 0/1A2B3C0 | 0/1A2B3C0 | 0/1A2B3B8 | 16777216 | 00:00:02
(2 rows)
3.2 PostgreSQL数据库自定义统计函数
3.2.1 复制延迟统计函数
CREATE OR REPLACE FUNCTION fgedu_replication_lag_stats(
p_start_time TIMESTAMP DEFAULT NOW() – INTERVAL ‘1 hour’,
p_end_time TIMESTAMP DEFAULT NOW()
)
RETURNS TABLE (
fgapplication_name VARCHAR(100),
avg_lag_bytes NUMERIC,
max_lag_bytes BIGINT,
min_lag_bytes BIGINT,
avg_time_lag INTERVAL,
max_time_lag INTERVAL,
sample_count BIGINT
)
AS $$
BEGIN
RETURN QUERY
SELECT
h.fgapplication_name,
AVG(h.lag_bytes)::NUMERIC AS avg_lag_bytes,
MAX(h.lag_bytes) AS max_lag_bytes,
MIN(h.lag_bytes) AS min_lag_bytes,
AVG(EXTRACT(EPOCH FROM h.time_lag)) * INTERVAL ‘1 second’ AS avg_time_lag,
MAX(h.time_lag) AS max_time_lag,
COUNT(*) AS sample_count
FROM fgedu_replication_history h
WHERE h.check_time BETWEEN p_start_time AND p_end_time
GROUP BY h.fgapplication_name
ORDER BY avg_lag_bytes DESC;
END;
$$ LANGUAGE plpgsql;
— 测试统计函数
SELECT * FROM fgedu_replication_lag_stats();
— 输出结果
fgapplication_name | avg_lag_bytes | max_lag_bytes | min_lag_bytes | avg_time_lag | max_time_lag | sample_count
——————+—————+—————+—————+————–+————–+————–
fgedu_standby2 | 15728640.00 | 16777216 | 0 | 00:00:05 | 00:00:10 | 60
fgedu_standby1 | 0.00 | 0 | 0 | 00:00:01 | 00:00:02 | 60
(2 rows)
— 创建复制健康检查函数
CREATE OR REPLACE FUNCTION fgedu_replication_health_check()
RETURNS TABLE (
fgapplication_name VARCHAR(100),
client_addr VARCHAR(50),
state VARCHAR(50),
sync_state VARCHAR(10),
lag_bytes BIGINT,
time_lag INTERVAL,
health_status VARCHAR(20),
health_message TEXT
)
AS $$
BEGIN
RETURN QUERY
SELECT
r.fgapplication_name::VARCHAR(100),
r.client_addr::VARCHAR(50),
r.state::VARCHAR(50),
r.sync_state::VARCHAR(10),
pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) AS lag_bytes,
NOW() – r.reply_time AS time_lag,
CASE
WHEN r.state != ‘streaming’ THEN ‘CRITICAL’
WHEN pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) > 1024*1024*100 THEN ‘CRITICAL’
WHEN pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) > 1024*1024*10 THEN ‘WARNING’
WHEN NOW() – r.reply_time > INTERVAL ‘5 minutes’ THEN ‘CRITICAL’
WHEN NOW() – r.reply_time > INTERVAL ‘1 minute’ THEN ‘WARNING’
ELSE ‘OK’
END::VARCHAR(20) AS health_status,
CASE
WHEN r.state != ‘streaming’ THEN ‘复制状态异常: ‘ || r.state
WHEN pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) > 1024*1024*100 THEN ‘复制延迟严重: ‘ || pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn))
WHEN pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) > 1024*1024*10 THEN ‘复制延迟警告: ‘ || pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn))
WHEN NOW() – r.reply_time > INTERVAL ‘5 minutes’ THEN ‘时间延迟严重: ‘ || (NOW() – r.reply_time)
WHEN NOW() – r.reply_time > INTERVAL ‘1 minute’ THEN ‘时间延迟警告: ‘ || (NOW() – r.reply_time)
ELSE ‘复制正常’
END::TEXT AS health_message
FROM pg_stat_replication r;
END;
$$ LANGUAGE plpgsql;
— 测试健康检查
SELECT * FROM fgedu_replication_health_check();
— 输出结果
fgapplication_name | client_addr | state | sync_state | lag_bytes | time_lag | health_status | health_message
——————+—————+———–+————+———–+———-+—————+—————————-
fgedu_standby1 | 192.168.1.101 | streaming | sync | 0 | 00:00:01 | OK | 复制正常
fgedu_standby2 | 192.168.1.102 | streaming | async | 16777216 | 00:00:02 | WARNING | 复制延迟警告: 16 MB
(2 rows)
3.3 PostgreSQL数据库复制历史记录
3.3.1 创建历史数据清理函数
CREATE OR REPLACE FUNCTION fgedu_cleanup_replication_history(
p_retention_days INTEGER DEFAULT 30
)
RETURNS void
AS $$
DECLARE
v_deleted INTEGER;
BEGIN
DELETE FROM fgedu_replication_history
WHERE check_time < NOW() - (p_retention_days || ' days')::INTERVAL; GET DIAGNOSTICS v_deleted = ROW_COUNT; RAISE NOTICE 'Deleted % rows from fgedu_replication_history', v_deleted; DELETE FROM fgedu_slot_history WHERE check_time < NOW() - (p_retention_days || ' days')::INTERVAL; GET DIAGNOSTICS v_deleted = ROW_COUNT; RAISE NOTICE 'Deleted % rows from fgedu_slot_history', v_deleted; END; $$ LANGUAGE plpgsql; -- 测试清理函数 SELECT fgedu_cleanup_replication_history(7); -- 输出结果 NOTICE: Deleted 0 rows from fgedu_replication_history NOTICE: Deleted 0 rows from fgedu_slot_history fgedu_cleanup_replication_history ----------------------------------- (1 row)
Part04-生产案例与实战讲解
4.1 PostgreSQL数据库复制监控仪表盘实战
本案例演示如何创建一个完整的复制监控仪表盘。学习交流加群风哥QQ113257174。
CREATE OR REPLACE VIEW fgedu_replication_dashboard AS
SELECT
‘复制状态概览’ AS section,
jsonb_build_object(
‘total_standbys’, (SELECT COUNT(*) FROM pg_stat_replication),
‘sync_standbys’, (SELECT COUNT(*) FROM pg_stat_replication WHERE sync_state = ‘sync’),
‘async_standbys’, (SELECT COUNT(*) FROM pg_stat_replication WHERE sync_state = ‘async’),
‘streaming’, (SELECT COUNT(*) FROM pg_stat_replication WHERE state = ‘streaming’),
‘total_slots’, (SELECT COUNT(*) FROM pg_replication_slots),
‘active_slots’, (SELECT COUNT(*) FROM pg_replication_slots WHERE active),
‘inactive_slots’, (SELECT COUNT(*) FROM pg_replication_slots WHERE NOT active)
) AS data
UNION ALL
SELECT
‘复制延迟统计’ AS section,
jsonb_build_object(
‘max_lag_bytes’, COALESCE((SELECT MAX(pg_wal_lsn_diff(sent_lsn, replay_lsn)) FROM pg_stat_replication), 0),
‘avg_lag_bytes’, COALESCE((SELECT AVG(pg_wal_lsn_diff(sent_lsn, replay_lsn))::BIGINT FROM pg_stat_replication), 0),
‘max_time_lag’, COALESCE((SELECT MAX(NOW() – reply_time) FROM pg_stat_replication), ’00:00:00′::INTERVAL),
‘critical_count’, (SELECT COUNT(*) FROM pg_stat_replication WHERE pg_wal_lsn_diff(sent_lsn, replay_lsn) > 1024*1024*100),
‘warning_count’, (SELECT COUNT(*) FROM pg_stat_replication WHERE pg_wal_lsn_diff(sent_lsn, replay_lsn) BETWEEN 1024*1024*10 AND 1024*1024*100)
) AS data
UNION ALL
SELECT
‘WAL状态’ AS section,
jsonb_build_object(
‘current_lsn’, pg_current_wal_lsn()::TEXT,
‘wal_files’, (SELECT COUNT(*) FROM pg_ls_waldir()),
‘wal_size’, (SELECT pg_size_pretty(SUM(size)) FROM pg_ls_waldir()),
‘wal_rate’, (SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), ‘0/0’)::BIGINT / EXTRACT(EPOCH FROM (NOW() – pg_postmaster_start_time()))::BIGINT) || ‘/s’)
) AS data;
— 查看仪表盘
SELECT * FROM fgedu_replication_dashboard;
— 输出结果
section | data
——————+————————————————————————————————————————-
复制状态概览 | {“total_standbys”: 2, “sync_standbys”: 1, “async_standbys”: 1, “streaming”: 2, “total_slots”: 2, “active_slots”: 2, “inactive_slots”: 0}
复制延迟统计 | {“max_lag_bytes”: 16777216, “avg_lag_bytes”: 8388608, “max_time_lag”: “00:00:02”, “critical_count”: 0, “warning_count”: 1}
WAL状态 | {“current_lsn”: “0/1A2B3C8”, “wal_files”: 15, “wal_size”: “240 MB”, “wal_rate”: “1.2 MB/s”}
(3 rows)
— 创建详细监控视图
CREATE OR REPLACE VIEW fgedu_replication_detail AS
SELECT
r.pid,
r.usename,
r.fgapplication_name,
r.client_addr::TEXT AS client_addr,
r.state,
r.sync_state,
r.sent_lsn::TEXT AS sent_lsn,
r.write_lsn::TEXT AS write_lsn,
r.flush_lsn::TEXT AS flush_lsn,
r.replay_lsn::TEXT AS replay_lsn,
pg_size_pretty(pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn)) AS lag_size,
pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) AS lag_bytes,
NOW() – r.reply_time AS time_lag,
CASE
WHEN r.state != ‘streaming’ THEN ‘CRITICAL’
WHEN pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) > 1024*1024*100 THEN ‘CRITICAL’
WHEN pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) > 1024*1024*10 THEN ‘WARNING’
ELSE ‘OK’
END AS status
FROM pg_stat_replication r;
— 查看详细监控
SELECT * FROM fgedu_replication_detail;
— 输出结果
pid | usename | fgapplication_name | client_addr | state | sync_state | sent_lsn | write_lsn | flush_lsn | replay_lsn | lag_size | lag_bytes | time_lag | status
——-+———+——————+—————+———–+————+———–+———–+———–+————+———-+———–+———-+——–
12345 | fgedu | fgedu_standby1 | 192.168.1.101 | streaming | sync | 0/1A2B3C8 | 0/1A2B3C8 | 0/1A2B3C8 | 0/1A2B3C8 | 0 bytes | 0 | 00:00:01 | OK
12346 | fgedu | fgedu_standby2 | 192.168.1.102 | streaming | async | 0/1A2B3C8 | 0/1A2B3C0 | 0/1A2B3C0 | 0/1A2B3B8 | 16 MB | 16777216 | 00:00:02 | WARNING
(2 rows)
4.2 PostgreSQL数据库复制槽监控实战
本案例演示如何监控复制槽状态,避免WAL堆积。更多学习教程公众号风哥教程itpux_com。
CREATE OR REPLACE VIEW fgedu_slot_monitor AS
SELECT
s.slot_name,
s.slot_type,
s.active,
s.active_pid,
s.database,
s.temporary,
s.restart_lsn::TEXT AS restart_lsn,
s.confirmed_flush_lsn::TEXT AS confirmed_flush_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn)) AS wal_backlog_size,
pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) AS wal_backlog_bytes,
CASE
WHEN NOT s.active THEN ‘INACTIVE’
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > 1024*1024*1024*10 THEN ‘CRITICAL’
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > 1024*1024*1024 THEN ‘WARNING’
ELSE ‘OK’
END AS status,
CASE
WHEN NOT s.active THEN ‘复制槽未激活,可能导致WAL堆积’
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > 1024*1024*1024*10 THEN ‘WAL堆积严重,请检查消费进程’
WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > 1024*1024*1024 THEN ‘WAL堆积警告,请关注消费进度’
ELSE ‘正常’
END AS message
FROM pg_replication_slots s;
— 查看复制槽监控
SELECT * FROM fgedu_slot_monitor;
— 输出结果
slot_name | slot_type | active | active_pid | database | temporary | restart_lsn | confirmed_flush_lsn | wal_backlog_size | wal_backlog_bytes | status | message
—————–+———–+——–+————+———-+———–+————-+———————+——————+——————-+———-+——————————
fgedu_standby1 | physical | t | 12345 | | f | 0/1A2B3C8 | NULL | 0 bytes | 0 | OK | 正常
fgedu_logical1 | logical | t | 12346 | fgedudb | f | 0/1A2B3C0 | 0/1A2B3C0 | 8 bytes | 8 | OK | 正常
fgedu_inactive | logical | f | | fgedudb | f | 0/1A00000 | 0/1A00000 | 50 GB | 53687091200 | CRITICAL | WAL堆积严重,请检查消费进程
(3 rows)
— 创建复制槽告警函数
CREATE OR REPLACE FUNCTION fgedu_slot_alert_check()
RETURNS void
AS $$
DECLARE
slot_record RECORD;
BEGIN
FOR slot_record IN SELECT * FROM fgedu_slot_monitor WHERE status != ‘OK’
LOOP
INSERT INTO fgedu_replication_alerts(
alert_type,
alert_level,
fgapplication_name,
message
)
VALUES(
‘SLOT_BACKLOG’,
slot_record.status,
slot_record.slot_name,
slot_record.message || ‘ – WAL堆积: ‘ || slot_record.wal_backlog_size
);
END LOOP;
END;
$$ LANGUAGE plpgsql;
— 测试告警检查
SELECT fgedu_slot_alert_check();
— 查看告警记录
SELECT * FROM fgedu_replication_alerts WHERE resolved = FALSE ORDER BY alert_time DESC;
— 输出结果
id | alert_time | alert_type | alert_level | fgapplication_name | message | resolved | resolved_time
—-+————————–+—————+————-+——————+——————————————————-+———-+—————
1 | 2026-04-07 16:35:00.123 | SLOT_BACKLOG | CRITICAL | fgedu_inactive | WAL堆积严重,请检查消费进程 – WAL堆积: 50 GB | f |
(1 row)
4.3 PostgreSQL数据库复制延迟分析实战
本案例演示如何分析复制延迟原因和趋势。from PostgreSQL视频:www.itpux.com。
CREATE OR REPLACE VIEW fgedu_lag_trend AS
SELECT
fgapplication_name,
check_time,
lag_bytes,
time_lag,
LAG(lag_bytes) OVER (PARTITION BY fgapplication_name ORDER BY check_time) AS prev_lag_bytes,
lag_bytes – LAG(lag_bytes) OVER (PARTITION BY fgapplication_name ORDER BY check_time) AS lag_change,
CASE
WHEN lag_bytes – LAG(lag_bytes) OVER (PARTITION BY fgapplication_name ORDER BY check_time) > 0 THEN ‘INCREASING’
WHEN lag_bytes – LAG(lag_bytes) OVER (PARTITION BY fgapplication_name ORDER BY check_time) < 0 THEN 'DECREASING' ELSE 'STABLE' END AS trend FROM fgedu_replication_history WHERE check_time > NOW() – INTERVAL ‘1 hour’
ORDER BY fgapplication_name, check_time DESC;
— 查看延迟趋势
SELECT * FROM fgedu_lag_trend LIMIT 10;
— 输出结果
fgapplication_name | check_time | lag_bytes | time_lag | prev_lag_bytes | lag_change | trend
——————+————————–+———–+———-+—————-+————+———–
fgedu_standby1 | 2026-04-07 16:40:00.123 | 0 | 00:00:01 | 0 | 0 | STABLE
fgedu_standby1 | 2026-04-07 16:39:00.123 | 0 | 00:00:01 | 0 | 0 | STABLE
fgedu_standby2 | 2026-04-07 16:40:00.123 | 16777216 | 00:00:02 | 15728640 | 1048576 | INCREASING
fgedu_standby2 | 2026-04-07 16:39:00.123 | 15728640 | 00:00:02 | 14680064 | 1048576 | INCREASING
(4 rows)
— 创建延迟分析函数
CREATE OR REPLACE FUNCTION fgedu_analyze_replication_lag(
p_fgapplication_name VARCHAR DEFAULT NULL
)
RETURNS TABLE (
fgapplication_name VARCHAR(100),
analysis_period INTERVAL,
total_samples BIGINT,
avg_lag_bytes NUMERIC,
max_lag_bytes BIGINT,
current_lag_bytes BIGINT,
lag_trend VARCHAR(20),
estimated_catchup INTERVAL,
recommendations TEXT
)
AS $$
BEGIN
RETURN QUERY
WITH stats AS (
SELECT
h.fgapplication_name,
MAX(h.check_time) – MIN(h.check_time) AS period,
COUNT(*) AS samples,
AVG(h.lag_bytes)::NUMERIC AS avg_lag,
MAX(h.lag_bytes) AS max_lag,
(SELECT lag_bytes FROM fgedu_replication_history h2
WHERE h2.fgapplication_name = h.fgapplication_name
ORDER BY h2.check_time DESC LIMIT 1) AS current_lag,
(SELECT lag_bytes FROM fgedu_replication_history h2
WHERE h2.fgapplication_name = h.fgapplication_name
ORDER BY h2.check_time DESC LIMIT 1) –
(SELECT lag_bytes FROM fgedu_replication_history h2
WHERE h2.fgapplication_name = h.fgapplication_name
ORDER BY h2.check_time DESC LIMIT 1 OFFSET 10) AS lag_change
FROM fgedu_replication_history h
WHERE h.check_time > NOW() – INTERVAL ‘1 hour’
AND (p_fgapplication_name IS NULL OR h.fgapplication_name = p_fgapplication_name)
GROUP BY h.fgapplication_name
)
SELECT
s.fgapplication_name::VARCHAR(100),
s.period,
s.samples,
s.avg_lag,
s.max_lag,
s.current_lag,
CASE
WHEN s.lag_change > 0 THEN ‘INCREASING’
WHEN s.lag_change < 0 THEN 'DECREASING'
ELSE 'STABLE'
END::VARCHAR(20),
CASE
WHEN s.lag_change < 0 THEN (s.current_lag::NUMERIC / ABS(s.lag_change::NUMERIC)) * INTERVAL '1 minute'
ELSE NULL
END AS estimated_catchup,
CASE
WHEN s.lag_change > 0 THEN ‘延迟正在增加,请检查从库性能’
WHEN s.lag_change = 0 AND s.current_lag > 1024*1024*10 THEN ‘延迟稳定但较高,请检查网络和从库负载’
WHEN s.lag_change < 0 THEN '延迟正在减少,预计' || (s.current_lag::NUMERIC / ABS(s.lag_change::NUMERIC))::INTEGER || '分钟内追上'
ELSE '复制状态正常'
END::TEXT
FROM stats s;
END;
$$ LANGUAGE plpgsql;
-- 执行延迟分析
SELECT * FROM fgedu_analyze_replication_lag();
-- 输出结果
fgapplication_name | analysis_period | total_samples | avg_lag_bytes | max_lag_bytes | current_lag_bytes | lag_trend | estimated_catchup | recommendations
------------------+-----------------+---------------+---------------+---------------+-------------------+------------+-------------------+------------------------------------------
fgedu_standby1 | 01:00:00 | 60 | 0.00 | 0 | 0 | STABLE | NULL | 复制状态正常
fgedu_standby2 | 01:00:00 | 60 | 15728640.00 | 16777216 | 16777216 | INCREASING | NULL | 延迟正在增加,请检查从库性能
(2 rows)
Part05-风哥经验总结与分享
5.1 PostgreSQL数据库复制进度跟踪最佳实践
复制进度跟踪最佳实践:建立完善的监控体系,定期采集复制状态;设置合理的告警阈值,及时发现异常;保留历史数据,便于趋势分析;定期检查复制槽状态,避免WAL堆积。
- 检查复制连接状态
- 检查复制延迟(字节和时间)
- 检查复制槽状态
- 检查WAL文件数量
- 检查磁盘空间使用
- 检查网络带宽使用
- 检查从库性能指标
5.2 PostgreSQL数据库复制问题排查技巧
问题排查技巧:检查pg_stat_replication视图了解复制状态;检查pg_replication_slots视图了解槽状态;检查WAL文件目录了解堆积情况;检查网络连接和带宽;检查从库性能和负载。
5.3 PostgreSQL数据库复制常见问题
常见问题:复制延迟过大、复制连接中断、复制槽堆积、WAL文件堆积、从库性能瓶颈。
— 问题1:复制延迟过大
SELECT
fgapplication_name,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_size,
state,
sync_state
FROM pg_stat_replication
WHERE pg_wal_lsn_diff(sent_lsn, replay_lsn) > 1024*1024*10;
— 输出结果
fgapplication_name | lag_size | state | sync_state
——————+———-+———–+————
fgedu_standby2 | 16 MB | streaming | async
(1 row)
— 问题2:复制连接中断
SELECT
fgapplication_name,
state,
NOW() – reply_time AS time_since_last_reply
FROM pg_stat_replication
WHERE state != ‘streaming’ OR NOW() – reply_time > INTERVAL ‘1 minute’;
— 输出结果
fgapplication_name | state | time_since_last_reply
——————+———–+———————–
fgedu_standby3 | catchup | 00:05:00
(1 row)
— 问题3:复制槽堆积
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS backlog
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1024*1024*1024;
— 输出结果
slot_name | active | backlog
—————–+——–+———
fgedu_inactive | f | 50 GB
(1 row)
— 解决方案:删除不需要的复制槽
SELECT pg_drop_replication_slot(‘fgedu_inactive’);
— 输出结果
pg_drop_replication_slot
————————–
(1 row)
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
