PostgreSQL教程FG111-PG高可用Patroni实战教程
本文档详细介绍PostgreSQL高可用解决方案Patroni的部署和配置方法,包括etcd集群搭建、Patroni配置、故障切换等操作,风哥教程参考PostgreSQL官方文档High Availability内容,适合DBA在生产环境中实施高可用架构。
Part01-基础概念与理论知识
1.1 PostgreSQL Patroni概念
Patroni是一个PostgreSQL高可用解决方案,基于Python开发。它使用分布式配置存储(DCS)来管理集群状态,自动处理主库故障检测和切换。Patroni支持多种DCS后端,包括etcd、ZooKeeper、Consul等,是目前最流行的PostgreSQL高可用方案之一。更多视频教程www.fgedu.net.cn
- 自动故障检测和切换
- 自动重建故障节点
- 支持同步和异步复制
- 支持级联复制
- 提供REST API管理接口
- 支持多种DCS后端
1.2 Patroni架构原理
# 1. 核心组件
# Patroni:运行在每个PostgreSQL节点上的代理
# DCS:分布式配置存储(etcd/ZooKeeper/Consul)
# PostgreSQL:数据库实例
# HAProxy:可选的负载均衡器
# 2. 工作原理
# – 每个节点运行Patroni进程
# – Patroni通过DCS协调集群状态
# – 主库持有leader锁
# – 主库故障时,其他节点竞争leader锁
# – 获得锁的节点提升为主库
# 3. 架构图
┌─────────────┐
│ DCS │
│ etcd集群 │
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Patroni │ │ Patroni │ │ Patroni │
│ PostgreSQL │ │ PostgreSQL │ │ PostgreSQL │
│ node1 │ │ node2 │ │ node3 │
│ Leader │ │ Replica │ │ Replica │
└─────────────┘ └─────────────┘ └─────────────┘
# 4. 故障切换流程
# 步骤1:主库Patroni停止更新leader锁
# 步骤2:其他节点检测到leader锁过期
# 步骤3:符合条件的节点竞争leader锁
# 步骤4:获得锁的节点提升为主库
# 步骤5:其他节点跟随新主库
# 5. 关键参数
# ttl:leader锁的生存时间
# loop_wait:检查间隔
# retry_timeout:重试超时
# maximum_lag_on_failover:切换允许的最大延迟
1.3 Patroni组件介绍
Patroni组件介绍:
- patroni:主程序,管理PostgreSQL实例
- patronictl:命令行管理工具
- patroni_raft_controller:Raft模式控制器
- etcd:推荐的DCS后端
- HAProxy:可选的负载均衡器
- keepalived:可选的VIP管理
Part02-生产环境规划与建议
2.1 Patroni部署规划
# 1. 节点规划
# node1:192.168.1.100 – PostgreSQL + Patroni + etcd
# node2:192.168.1.101 – PostgreSQL + Patroni + etcd
# node3:192.168.1.102 – PostgreSQL + Patroni + etcd
# 2. 端口规划
# PostgreSQL:5432
# Patroni REST API:8008
# etcd客户端:2379
# etcd对等:2380
# 3. 目录规划
# PostgreSQL数据:/postgresql/fgdata
# Patroni配置:/etc/patroni
# Patroni日志:/var/log/patroni
# etcd数据:/var/lib/etcd
# 4. 用户规划
# PostgreSQL超级用户:fgedu
# PostgreSQL复制用户:fgedu_repl
# etcd用户:etcd
# 5. 集群名称
# scope:fgedudb_cluster
2.2 etcd集群规划
etcd集群规划:
- 节点数量:奇数个,推荐3或5个
- 网络要求:低延迟网络,节点间通信
- 存储要求:SSD存储,确保性能
- 备份策略:定期备份etcd数据
2.3 高可用架构规划
# 1. 架构模式
# 3节点Patroni集群
# 3节点etcd集群
# HAProxy负载均衡
# Keepalived VIP管理
# 2. 故障切换策略
# 自动切换:主库故障时自动切换
# 手动切换:计划维护时手动切换
# 切换时间:30秒内完成
# 3. 数据一致性策略
# 同步复制:核心业务
# 异步复制:一般业务
# 最大延迟:10MB
# 4. 监控告警策略
# 监控指标:主库状态、复制延迟、节点健康
# 告警方式:钉钉、邮件、短信
# 告警级别:紧急、重要、一般
# 5. 备份恢复策略
# 全量备份:每天
# WAL归档:持续
# 恢复演练:每月
Part03-生产环境项目实施方案
3.1 etcd集群安装配置
3.1.1 安装etcd
# 1. 下载etcd
$ wget https://github.com/etcd-io/etcd/releases/download/v3.5.12/etcd-v3.5.12-linux-amd64.tar.gz
$ tar -xzf etcd-v3.5.12-linux-amd64.tar.gz
$ mv etcd-v3.5.12-linux-amd64/etcd* /usr/local/bin/
# 2. 创建etcd用户
$ useradd -r -s /sbin/nologin etcd
# 3. 创建数据目录
$ mkdir -p /var/lib/etcd
$ chown etcd:etcd /var/lib/etcd
# 4. 验证安装
$ etcd –version
etcd Version: 3.5.12
3.1.2 配置etcd集群
# 在node1(192.168.1.100)上配置
$ vi /etc/etcd/etcd.conf
ETCD_NAME=etcd1
ETCD_DATA_DIR=/var/lib/etcd
ETCD_LISTEN_CLIENT_URLS=http://192.168.1.100:2379,http://127.0.0.1:2379
ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.100:2379
ETCD_LISTEN_PEER_URLS=http://192.168.1.100:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.100:2380
ETCD_INITIAL_CLUSTER=etcd1=http://192.168.1.100:2380,etcd2=http://192.168.1.101:2380,etcd3=http://192.168.1.102:2380
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
ETCD_INITIAL_CLUSTER_STATE=new
# 在node2(192.168.1.101)上配置
$ vi /etc/etcd/etcd.conf
ETCD_NAME=etcd2
ETCD_DATA_DIR=/var/lib/etcd
ETCD_LISTEN_CLIENT_URLS=http://192.168.1.101:2379,http://127.0.0.1:2379
ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.101:2379
ETCD_LISTEN_PEER_URLS=http://192.168.1.101:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.101:2380
ETCD_INITIAL_CLUSTER=etcd1=http://192.168.1.100:2380,etcd2=http://192.168.1.101:2380,etcd3=http://192.168.1.102:2380
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
ETCD_INITIAL_CLUSTER_STATE=new
# 在node3(192.168.1.102)上配置
$ vi /etc/etcd/etcd.conf
ETCD_NAME=etcd3
ETCD_DATA_DIR=/var/lib/etcd
ETCD_LISTEN_CLIENT_URLS=http://192.168.1.102:2379,http://127.0.0.1:2379
ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.102:2379
ETCD_LISTEN_PEER_URLS=http://192.168.1.102:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.102:2380
ETCD_INITIAL_CLUSTER=etcd1=http://192.168.1.100:2380,etcd2=http://192.168.1.101:2380,etcd3=http://192.168.1.102:2380
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
ETCD_INITIAL_CLUSTER_STATE=new
# 创建systemd服务
$ vi /etc/systemd/system/etcd.service
[Unit]
Description=etcd service
After=network.target
[Service]
Type=notify
User=etcd
ExecStart=/usr/local/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
# 启动etcd(在所有节点执行)
$ systemctl daemon-reload
$ systemctl start etcd
$ systemctl enable etcd
# 验证etcd集群状态
$ etcdctl member list
3a57933972cb5131, started, etcd1, http://192.168.1.100:2380, http://192.168.1.100:2379
f98dc20bce6145e6, started, etcd2, http://192.168.1.101:2380, http://192.168.1.101:2379
edbf798743e44e4f, started, etcd3, http://192.168.1.102:2380, http://192.168.1.102:2379
$ etcdctl endpoint health
192.168.1.100:2379 is healthy: successfully committed proposal: took = 2.123456ms
192.168.1.101:2379 is healthy: successfully committed proposal: took = 2.234567ms
192.168.1.102:2379 is healthy: successfully committed proposal: took = 2.345678ms
3.2 Patroni安装配置实战
3.2.1 安装Patroni
# 1. 安装依赖
$ yum install -y python3 python3-pip python3-devel
# 2. 安装Patroni
$ pip3 install patroni[etcd]
# 3. 安装PostgreSQL Python驱动
$ pip3 install psycopg2-binary
# 4. 验证安装
$ patroni –version
patroni 3.2.0
$ patronictl –version
patronictl 3.2.0
3.2.2 配置Patroni
# 创建配置目录
$ mkdir -p /etc/patroni
$ mkdir -p /var/log/patroni
# 创建配置文件(在node1上)
$ vi /etc/patroni/fgedudb.yml
scope: fgedudb_cluster
namespace: /db/
name: node1
restapi:
listen: 192.168.1.100:8008
connect_address: 192.168.1.100:8008
etcd:
hosts: 192.168.1.100:2379,192.168.1.101:2379,192.168.1.102:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 10485760
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: replica
hot_standby: on
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 1GB
initdb:
– encoding: UTF8
– data-checksums
pg_hba:
– host replication fgedu_repl 192.168.1.0/24 scram-sha-256
– host all all 192.168.1.0/24 scram-sha-256
– local all all trust
users:
fgedu:
password: Fgedu@2026
options:
– createrole
– createdb
fgedu_repl:
password: Fgedu@Repl2026
options:
– replication
postgresql:
listen: 192.168.1.100:5432
connect_address: 192.168.1.100:5432
data_dir: /postgresql/fgdata
bin_dir: /postgresql/fgapp/bin
authentication:
replication:
username: fgedu_repl
password: Fgedu@Repl2026
superuser:
username: fgedu
password: Fgedu@2026
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
# 在node2和node3上创建类似配置,修改name、listen、connect_address
# 创建systemd服务
$ vi /etc/systemd/system/patroni.service
[Unit]
Description=Patroni service
After=network.target etcd.service
[Service]
Type=simple
User=pgsql
Group=fgedudb
ExecStart=/usr/local/bin/patroni /etc/patroni/fgedudb.yml
Restart=always
RestartSec=10s
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
# 启动Patroni(在所有节点执行)
$ systemctl daemon-reload
$ systemctl start patroni
$ systemctl enable patroni
3.3 Patroni集群验证实战
3.3.1 验证集群状态
# 1. 查看集群状态
$ patronictl -c /etc/patroni/fgedudb.yml list
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node1 | 192.168.1.100 | Leader | running | 1 | |
| node2 | 192.168.1.101 | Replica | running | 1 | 0 |
| node3 | 192.168.1.102 | Replica | running | 1 | 0 |
+——–+—————+———+———+—-+———–+
# 2. 查看详细状态
$ patronictl -c /etc/patroni/fgedudb.yml show-config
# 3. 查看节点信息
$ patronictl -c /etc/patroni/fgedudb.yml topology
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node1 | 192.168.1.100 | Leader | running | 1 | |
| + node2| 192.168.1.101 | Replica | running | 1 | 0 |
| + node3| 192.168.1.102 | Replica | running | 1 | 0 |
+——–+—————+———+———+—-+———–+
# 4. 通过REST API查看状态
$ curl -s http://192.168.1.100:8008/patroni | jq
{
“state”: “running”,
“postmaster_start_time”: “2026-04-07 10:00:00.123456+08”,
“role”: “master”,
“server_version”: 180000,
“cluster_unlocked”: false,
“xlog”: {
“location”: 65000128
},
“timeline”: 1,
“database_system_identifier”: “7286447892263292736”,
“patroni”: {
“version”: “3.2.0”,
“scope”: “fgedudb_cluster”
}
}
# 5. 验证数据库连接
$ psql -h 192.168.1.100 -U fgedu -d fgedudb -c “SELECT pg_is_in_recovery();”
pg_is_in_recovery
——————-
f
(1 row)
$ psql -h 192.168.1.101 -U fgedu -d fgedudb -c “SELECT pg_is_in_recovery();”
pg_is_in_recovery
——————-
t
(1 row)
Part04-生产案例与实战讲解
4.1 Patroni集群搭建案例
# 场景:搭建3节点Patroni高可用集群
# 1. 环境准备
# – 3台服务器:192.168.1.100, 192.168.1.101, 192.168.1.102
# – 操作系统:Oracle Linux 9.3
# – PostgreSQL 18.0
# – etcd 3.5.12
# – Patroni 3.2.0
# 2. 安装etcd集群(参考3.1节)
# 3. 安装Patroni(参考3.2节)
# 4. 初始化集群
# 在第一个节点启动Patroni,它会初始化数据库
$ systemctl start patroni
# 查看日志
$ journalctl -u patroni -f
Apr 07 10:00:00 node1 patroni[12345]: Initializing a new cluster
Apr 07 10:00:01 node1 patroni[12345]: Leader key acquired
Apr 07 10:00:02 node1 patroni[12345]: I am the leader
# 5. 在其他节点启动Patroni
$ systemctl start patroni
# 查看日志
$ journalctl -u patroni -f
Apr 07 10:00:05 node2 patroni[12346]: I am a replica
Apr 07 10:00:06 node2 patroni[12346]: Replica has been started
# 6. 验证集群
$ patronictl -c /etc/patroni/fgedudb.yml list
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node1 | 192.168.1.100 | Leader | running | 1 | |
| node2 | 192.168.1.101 | Replica | running | 1 | 0 |
| node3 | 192.168.1.102 | Replica | running | 1 | 0 |
+——–+—————+———+———+—-+———–+
# 7. 创建业务数据库
$ psql -h 192.168.1.100 -U fgedu -d postgres -c “CREATE DATABASE fgedudb;”
CREATE DATABASE
# 8. 验证数据同步
$ psql -h 192.168.1.100 -U fgedu -d fgedudb -c “CREATE TABLE fgedu_test (id serial, data text);”
CREATE TABLE
$ psql -h 192.168.1.101 -U fgedu -d fgedudb -c “SELECT * FROM fgedu_test;”
id | data
—-+——
(0 rows)
4.2 Patroni故障切换案例
# 场景:模拟主库故障,验证自动切换
# 1. 当前集群状态
$ patronictl -c /etc/patroni/fgedudb.yml list
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node1 | 192.168.1.100 | Leader | running | 1 | |
| node2 | 192.168.1.101 | Replica | running | 1 | 0 |
| node3 | 192.168.1.102 | Replica | running | 1 | 0 |
+——–+—————+———+———+—-+———–+
# 2. 模拟主库故障
$ ssh 192.168.1.100 “systemctl stop patroni”
# 3. 观察切换过程
$ patronictl -c /etc/patroni/fgedudb.yml list
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node2 | 192.168.1.101 | Leader | running | 2 | |
| node3 | 192.168.1.102 | Replica | running | 2 | 0 |
+——–+—————+———+———+—-+———–+
# 4. 查看切换日志
$ journalctl -u patroni -f
Apr 07 10:01:00 node2 patroni[12346]: Acquired leader lock
Apr 07 10:01:01 node2 patroni[12346]: Promoting to leader
Apr 07 10:01:02 node2 patroni[12346]: I am the leader
# 5. 验证新主库
$ psql -h 192.168.1.101 -U fgedu -d fgedudb -c “SELECT pg_is_in_recovery();”
pg_is_in_recovery
——————-
f
(1 row)
# 6. 恢复原主库
$ ssh 192.168.1.100 “systemctl start patroni”
# 7. 验证节点恢复
$ patronictl -c /etc/patroni/fgedudb.yml list
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node1 | 192.168.1.100 | Replica | running | 2 | 0 |
| node2 | 192.168.1.101 | Leader | running | 2 | |
| node3 | 192.168.1.102 | Replica | running | 2 | 0 |
+——–+—————+———+———+—-+———–+
# 8. 手动切换回原主库
$ patronictl -c /etc/patroni/fgedudb.yml switchover –master node2 –candidate node1
Current cluster topology
+ Cluster: fgedudb_cluster ——-+—-+———–+
| Member | Host | Role | State | TL | Lag in MB |
+——–+—————+———+———+—-+———–+
| node1 | 192.168.1.100 | Replica | running | 2 | 0 |
| node2 | 192.168.1.101 | Leader | running | 2 | |
| node3 | 192.168.1.102 | Replica | running | 2 | 0 |
+——–+—————+———+———+—-+———–+
Are you sure you want to switchover cluster fgedudb_cluster, demoting current master node2? [y/N]: y
2026-04-07 10:05:00.12345 Successfully switched over to “node1”
4.3 Patroni维护管理案例
# 1. 查看集群配置
$ patronictl -c /etc/patroni/fgedudb.yml show-config
# 2. 修改集群配置
$ patronictl -c /etc/patroni/fgedudb.yml edit-config
# 编辑配置后保存
# 3. 重新加载配置
$ patronictl -c /etc/patroni/fgedudb.yml reload
# 4. 重启节点
$ patronictl -c /etc/patroni/fgedudb.yml restart node1
# 5. 重建节点
$ patronictl -c /etc/patroni/fgedudb.yml reinit fgedudb_cluster node1
Are you sure you want to reinitialize members node1? [y/N]: y
Success: reinitialize for member node1
# 6. 暂停集群(维护模式)
$ patronictl -c /etc/patroni/fgedudb.yml pause
Success: cluster management is paused
# 7. 恢复集群
$ patronictl -c /etc/patroni/fgedudb.yml resume
Success: cluster management is resumed
# 8. 查看历史事件
$ patronictl -c /etc/patroni/fgedudb.yml history
+—-+———–+————————–+——————————-+
| TL | LSN | Reason | Timestamp |
+—-+———–+————————–+——————————-+
| 1 | 65000128 | no recovery target | 2026-04-07T10:00:00+08:00 |
| 2 | 75000128 | promoting because master | 2026-04-07T10:01:00+08:00 |
+—-+———–+————————–+——————————-+
# 9. 刷新DCS配置
$ patronictl -c /etc/patroni/fgedudb.yml flush fgedudb_cluster
# 10. 查看DCS内容
$ etcdctl get /db/fgedudb_cluster/ –prefix
/db/fgedudb_cluster/config
{“ttl”:30,”loop_wait”:10,”retry_timeout”:10,…}
/db/fgedudb_cluster/leader
node1
/db/fgedudb_cluster/members/node1
{“conn_url”:”postgres://192.168.1.100:5432/postgres”,…}
Part05-风哥经验总结与分享
5.1 Patroni最佳实践
Patroni最佳实践:
- 合理配置DCS:确保etcd集群稳定
- 设置合理TTL:避免误切换
- 监控集群状态:及时发现和处理问题
- 定期演练切换:确保故障切换正常
- 备份DCS数据:定期备份etcd数据
- 文档记录:记录详细的配置和操作流程
5.2 Patroni管理脚本
# patroni_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# Patroni监控脚本
PATRONI_CONFIG=/etc/patroni/fgedudb.yml
LOG_FILE=/postgresql/scripts/logs/patroni_monitor.log
log_message() {
echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – $1” >> $LOG_FILE
}
# 检查集群状态
check_cluster_status() {
local result=$(patronictl -c $PATRONI_CONFIG list 2>/dev/null)
if echo “$result” | grep -q “Leader”; then
log_message “集群状态正常”
else
log_message “ERROR: 集群没有Leader”
return 1
fi
}
# 检查节点健康
check_node_health() {
local nodes=$(patronictl -c $PATRONI_CONFIG list 2>/dev/null | grep -c “running”)
if [ “$nodes” -lt 2 ]; then
log_message “WARNING: 可用节点数不足: $nodes”
else
log_message “可用节点数: $nodes”
fi
}
# 检查复制延迟
check_replication_lag() {
local lag=$(patronictl -c $PATRONI_CONFIG list 2>/dev/null | grep Replica | awk ‘{print $7}’ | grep -v “^$” | sort -n | tail -1)
if [ -n “$lag” ] && [ “$lag” -gt 10 ]; then
log_message “WARNING: 复制延迟过大: ${lag} MB”
else
log_message “复制延迟正常”
fi
}
# 主函数
main() {
log_message “==========================================”
log_message “开始Patroni监控检查”
log_message “==========================================”
check_cluster_status
check_node_health
check_replication_lag
log_message “==========================================”
log_message “监控检查完成”
log_message “==========================================”
}
main
5.3 Patroni检查清单
# 部署前检查
– [ ] 确认网络连通性
– [ ] 确认端口可用
– [ ] 确认用户权限
– [ ] 确认目录权限
# 部署后验证
– [ ] 验证etcd集群正常
– [ ] 验证Patroni集群正常
– [ ] 验证主库选举正常
– [ ] 验证复制正常
# 日常监控
– [ ] 监控集群状态
– [ ] 监控节点健康
– [ ] 监控复制延迟
– [ ] 监控DCS状态
# 故障处理
– [ ] 准备故障切换流程
– [ ] 准备节点重建流程
– [ ] 准备回滚方案
– [ ] 准备告警通知
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
