1. 首页 > Linux教程 > 正文

Linux教程FG311-Pacemaker/Corosync高可用集群安装

内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。

本文档详细介绍Pacemaker/Corosync高可用集群的安装配置方法,包括集群软件安装、环境准备、集群初始化等核心内容,适合运维人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。

from PG视频:www.itpux.com

Part01-基础概念与理论知识

1.1 高可用集群概念

高可用集群(High Availability Cluster)是指通过多台服务器协同工作,提供服务冗余和故障转移能力,确保业务连续性的系统架构。当某个节点发生故障时,服务会自动切换到其他健康节点继续运行。

高可用集群的核心特性:

  • 故障自动检测:实时监控节点和服务状态
  • 自动故障转移:故障时自动切换服务
  • 数据一致性:确保数据不丢失、不损坏
  • 服务连续性:最小化服务中断时间

1.2 Pacemaker集群资源管理器

Pacemaker是企业级开源集群资源管理器,负责管理集群中的资源(如IP地址、服务、存储等),实现资源的启动、停止、监控和故障转移。

# Pacemaker核心功能
– 资源管理:管理IP、服务、存储等资源
– 故障检测:监控节点和资源健康状态
– 故障恢复:自动重启或迁移故障资源
– 约束管理:控制资源运行位置和顺序
– 仲裁管理:处理集群分裂场景

1.3 Corosync集群引擎

Corosync是集群通信层,提供节点间的消息传递、成员管理和仲裁服务。Pacemaker依赖Corosync实现节点间的协调通信。

# Corosync核心功能
– 消息传递:节点间可靠消息传输
– 成员管理:维护集群成员列表
– 仲裁服务:防止集群脑裂
– 心跳检测:监控节点存活状态

Part02-生产环境规划与建议

2.1 集群环境规划

生产环境集群规划需要考虑节点数量、网络架构、存储方案等多个方面。

# 集群节点规划
节点数量:最少2个节点,推荐3个或奇数个节点
节点配置:
– node1: 192.168.1.10 (ha-node1.fgedu.net.cn)
– node2: 192.168.1.11 (ha-node2.fgedu.net.cn)
– node3: 192.168.1.12 (ha-node3.fgedu.net.cn) [可选]

# VIP规划
虚拟IP: 192.168.1.100 (用于服务访问)

# 存储规划
共享存储: NFS/iSCSI/GFS2
本地存储: 各节点独立存储

2.2 网络规划建议

网络规划需要考虑业务网络、心跳网络和管理网络的隔离。

# 网络规划方案
方案一:单网络(测试环境)
– 所有流量共用一个网络

方案二:双网络(生产环境推荐)
– 业务网络:对外提供服务
– 心跳网络:节点间通信

方案三:三网络(高安全环境)
– 业务网络:对外服务
– 心跳网络:集群通信
– 管理网络:运维管理

# 端口规划
TCP 2224: pcs通信端口
UDP 5405-5406: corosync通信端口
TCP 3121: pacemaker远程连接

2.3 存储规划建议

存储规划需要根据业务需求选择合适的存储方案。

# 存储方案选择
1. 共享存储(推荐)
– NFS: 简单易用,适合文件共享
– iSCSI: 块存储,适合数据库
– GFS2: 集群文件系统,多节点并发访问

2. 本地存储
– 每个节点独立存储
– 需要数据同步机制

3. 分布式存储
– GlusterFS: 分布式文件系统
– Ceph: 统一存储平台

Part03-生产环境项目实施方案

3.1 安装集群软件包

3.1.1 在所有节点安装软件包

# 安装Pacemaker和Corosync
[root@ha-node1 ~]# dnf install -y pacemaker corosync pcs

Updating Subscription Management repositories.
Last metadata expiration check: 0:05:23 ago on Fri Apr 4 10:00:00 2026.
Dependencies resolved.
================================================================================
Package Architecture Version Repository Size
================================================================================
Installing:
pacemaker x86_64 2.1.6-1.el9 appstream 1.2 M
corosync x86_64 3.1.7-1.el9 appstream 280 k
pcs x86_64 0.11.4-1.el9 appstream 3.5 M

Transaction Summary
================================================================================
Install 3 Packages

Total download size: 5.0 M
Installed size: 25 M
Downloading Packages:
(1/3): corosync-3.1.7-1.el9.x86_64.rpm 2.8 MB/s | 280 kB 00:00
(2/3): pacemaker-2.1.6-1.el9.x86_64.rpm 5.0 MB/s | 1.2 MB 00:00
(3/3): pcs-0.11.4-1.el9.x86_64.rpm 8.0 MB/s | 3.5 MB 00:00
——————————————————————————–
Total 5.0 MB/s | 5.0 MB 00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : corosync-3.1.7-1.el9.x86_64 1/3
Installing : pacemaker-2.1.6-1.el9.x86_64 2/3
Installing : pcs-0.11.4-1.el9.x86_64 3/3
Running scriptlet: pcs-0.11.4-1.el9.x86_64 3/3
Verifying : corosync-3.1.7-1.el9.x86_64 1/3
Verifying : pacemaker-2.1.6-1.el9.x86_64 2/3
Verifying : pcs-0.11.4-1.el9.x86_64 3/3

Installed:
corosync-3.1.7-1.el9.x86_64 学习交流加群风哥QQ113257174pacemaker-2.1.6-1.el9.x86_64 pcs-0.11.4-1.el9.x86_64

Complete!

# 验证安装
[root@ha-node1 ~]# rpm -qa | grep -E “pacemaker|corosync|pcs”
pacemaker-libs-2.1.6-1.el9.x86_64
pacemaker-cluster-libs-2.1.6-1.el9.x86_64
pacemaker-2.1.6-1.el9.x86_64
corosync-3.1.7-1.el9.x86_64
pcs-0.11.4-1.el9.x86_64

3.2 配置主机名和hosts文件

3.2.1 设置主机名

# 在node1上设置主机名
[root@ha-node1 ~]# hostnamectl set-hostname ha-node1.fgedu.net.cn

# 验证主机名
[root@ha-node1 ~]# hostnamectl
Static hostname: ha-node1.fgedu.net.cn
Icon name: computer-vm
Chassis: vm
Machine ID: abc123def456789
Boot ID: xyz789uvw123456
Virtualization: vmware
Operating System: Red Hat Enterprise Linux 9.4 (Plow)
CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
Kernel: Linux 5.14.0-427.13.1.el9_4.x86_64
Architecture: x86-64
Hardware Vendor: VMware, Inc.
Hardware Model: VMware Virtual Platform

# 在node2上设置主机名
[root@ha-node2 ~]# hostnamectl set-hostname ha-node2.fgedu.net.cn

# 验证主机名
[root@ha-node2 ~]# hostnamectl
Static hostname: ha-node2.fgedu.net.cn
Icon name: computer-vm
Chassis: vm
Machine ID: def456ghi789012
Boot ID: abc123def456789
Virtualization: vmware
Operating System: Red Hat Enterprise Linux 9.4 (Plow)
CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
Kernel: Linux 5.14.0-427.13.1.el9_4.x86_64
Architecture: x86-64
Hardware Vendor: VMware, Inc.
Hardware Model: VMware Virtual Platform

3.2.2 配置hosts文件

# 在所有节点配置hosts文件
[root@ha-node1 ~]# cat >> /etc/hosts << EOF 192.168.1.10 ha-node1.fgedu.net.cn ha-node1 192.168.1.11 ha-node2.fgedu.net.cn ha-node2 EOF # 验证hosts文件 [root@ha-node1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.1.10 ha-node1.fgedu.net.cn ha-node1 192.168.1.11 ha-node2.fgedu.net.cn ha-node2 # 测试主机名解析 [root@ha-node1 ~]# ping -c 2 ha-node2.fgedu.net.cn PING ha-node2.fgedu.net.cn (192.168.1.11) 56(84) bytes of data. 64 bytes from ha-node2.fgedu.net.cn (192.168.1.11): icmp_seq=1 ttl=64 time=0.521 ms 64 bytes from ha-node2.fgedu.net.cn (192.168.1.11): icmp_seq=2 ttl=64 time=0.398 ms --- ha-node2.fgedu.net.cn ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev=0.398/0.459/0.521/0.061 ms

3.3 配置SSH免密登录

3.3.1 生成SSH密钥

# 在node1上生成SSH密钥
[root@ha-node1 ~]# ssh-keygen -t rsa -N “” -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:abc123def456ghi789jkl012mno345pqr678stu901vwx root@ha-node1.fgedu.net.cn
The key’s randomart image is:
+—[RSA 3072]—-+
| .o. |
| . o |
| . . o |
| . o .o . |
| + oo S |
| . =.+o . |
| = B.o. |
| o O.=.E |
| . =.++o |
+—-[SHA256]—–+

# 查看生成的密钥
[root@ha-node1 ~]# ls -la ~/.ssh/
total 12
drwx——. 2 root root 57 Apr 4 10:05 .
dr-xr-x—. 6 root root 243 Apr 4 10:00 ..
-rw——-. 1 root root 2602 Apr 4 10:05 id_rsa
-rw-r–r–. 1 root root 572 Apr 4 10:05 id_rsa.pub

3.3.2 分发公钥到其他节点

# 将公钥复制到node2
[root@ha-node1 ~]# ssh-copy-id ha-node2.fgedu.net.cn
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: “/root/.ssh/id_rsa.pub”
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are
already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed — if you are prompted now it is to
install the new keys
root@ha-node2.fgedu.net.cn’s password:

Number of key(s) added: 1

Now try logging into the machine, with: “ssh ‘ha-node2.fgedu.net.cn'”
and check to make sure that only the key(s) you wanted were added.

# 测试免密登录
[root@ha-node1 ~]# ssh ha-node2.fgedu.net.cn ‘hostname’
ha-node2.fgedu.net.cn

# 在node2上重复相同操作
[root@ha-node2 ~]# ssh-keygen -t rsa -N “” -f ~/.ssh/id_rsa
[root@ha-node2 ~]# ssh-copy-id ha-node1.fgedu.net.cn
[root@ha-node2 ~]# ssh ha-node1.fgedu.net.cn ‘hostname’
ha-node1.fgedu.net.cn

3.4 配置防火墙规则

3.4.1 开放集群所需端口

# 在所有节点配置防火墙
[root@ha-node1 ~]# firewall-cmd –permanent –add-service=high-availability
success

[root@ha-node1 ~]# firewall-cmd –permanent –add-port=2224/tcp
success

[root@ha-node1 ~]# firewall-cmd –permanent –add-port=3121/tcp
success

[root@ha-node1 ~]# firewall-cmd –permanent –add-port=5405-5406/udp
success

[root@ha-node1 ~]# firewall-cmd –permanent –add-port=5405-5406/tcp
success

# 重新加载防火墙
[root@ha-node1 ~]# firewall-cmd –reload
success

# 验证防火墙规则
[root@ha-node1 ~]# firewall-cmd –list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: ens33
sources:
services: cockpit dhcpv6-client high-availability ssh
ports: 2224/tcp 3121/tcp 5405-5406/udp 5405-5406/tcp
protocols:
forward: no
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:

Part04-生产案例与实战讲解

4.1 初始化集群

4.1.1 配置hacluster用户密码

# 在所有节点设置hacluster用户密码
[root@ha-node1 ~]# echo “hacluster:YourStrongPassword123!” | chpasswd

[root@ha-node2 ~]# echo “hacluster:YourStrongPassword123!” | chpasswd

# 启动pcsd服务
[root@ha-node1 ~]# systemctl start pcsd
[root@ha-node1 ~]# systemctl enable pcsd
Created symlink /etc/systemd/system/multi-user.target.wants/pcsd.service →
/usr/lib/systemd/system/pcsd.service.

[root@ha-node2 ~]# systemctl start pcsd
[root@ha-node2 ~]# systemctl enable pcsd
Created symlink /etc/systemd/system/multi-user.target.wants/pcsd.service →
/usr/lib/systemd/system/pcsd.service.

# 验证pcsd服务状态
[root@ha-node1 ~]# systemctl status pcsd
● pcsd.service – PCS GUI and remote configuration interface
Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 10:10:00 CST; 30s ago
Main PID: 12345 (pcsd)
Tasks: 1 (limit: 23456)
Memory: 25.6M
CPU: 120ms
CGroup: /system.slice/pcsd.service
└─12345 /usr/libexec/pcsd/pcsd

Apr 04 10:10:00 ha-node1.fgedu.net.cn systemd[1]: Started PCS GUI and remote configuration interface.

4.1.2 认证集群节点

# 在node1上认证所有集群节点
[root@ha-node1 ~]# pcs host auth ha-node1.fgedu.net.cn ha-node2.fgedu.net.cn -u hacluster -p
YourStrongPassword123!
ha-node1.fgedu.net.cn: Authorized
ha-node2.fgedu.net.cn: Authorized

# 验证认证状态
[root@ha-node1 ~]# pcs host list
ha-node1.fgedu.net.cn
ha-node2.fgedu.net.cn

4.1.3 创建集群

# 创建名为hacluster的集群
[root@ha-node1 ~]# pcs cluster setup hacluster ha-node1.fgedu.net.cn ha-node2.fgedu.net.cn
No addresses specified for host ‘ha-node1.fgedu.net.cn’, using ‘ha-node1.fgedu.net.cn’
No addresses specified for host ‘ha-node2.fgedu.net.cn’, using ‘ha-node2.fgedu.net.cn’
Destroying cluster on hosts: ‘ha-node1.fgedu.net.cn’学习交流加群风哥微信: itpux-com, ‘ha-node2.fgedu.net.cn’…
ha-node2.fgedu.net.cn: Successfully destroyed cluster
ha-node1.fgedu.net.cn: Successfully destroyed cluster
Requesting remove ‘pcsd settings’ from ‘ha-node1.fgedu.net.cn’, ‘ha-node2.fgedu.net.cn’
ha-node1.fgedu.net.cn: successful removal of the file ‘pcsd settings’
ha-node2.fgedu.net.cn: successful removal of the file ‘pcsd settings’
Sending ‘corosync authkey’, ‘pacemaker authkey’ to ‘ha-node1.fgedu.net.cn’, ‘ha-node2.fgedu.net.cn’
ha-node1.fgedu.net.cn: successful distribution of the file ‘corosync authkey’
ha-node1.fgedu.net.cn: successful distribution of the file ‘pacemaker authkey’
ha-node2.fgedu.net.cn: successful distribution of the file ‘corosync authkey’
ha-node2.fgedu.net.cn: successful distribution of the file ‘pacemaker authkey’
Sending ‘corosync.conf’ to ‘ha-node1.fgedu.net.cn’, ‘ha-node2.fgedu.net.cn’
ha-node1.fgedu.net.cn: successful distribution of the file ‘corosync.conf’
ha-node2.fgedu.net.cn: successful distribution of the file ‘corosync.conf’
Cluster has been successfully set up.

4.2 启动集群服务

4.2.1 启动集群

# 启动所有节点的集群服务
[root@ha-node1 ~]# pcs cluster start –all
ha-node1.fgedu.net.cn: Starting Cluster (corosync)…
ha-node1.fgedu.net.cn: Starting Cluster (pacemaker)…
ha-node2.fgedu.net.cn: Starting Cluster (corosync)…
ha-node2.fgedu.net.cn: Starting Cluster (pacemaker)…

# 设置集群服务开机自启
[root@ha-node1 ~]# pcs cluster enable –all
ha-node1.fgedu.net.cn: Cluster Enabled
ha-node2.fgedu.net.cn: Cluster Enabled

# 验证服务状态
[root@ha-node1 ~]# systemctl status corosync
● corosync.service – Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 10:15:00 CST; 1min ago
Docs: man:corosync
Main PID: 23456 (corosync)
Tasks: 2 (limit: 23456)
Memory: 12.3M
CPU: 350ms
CGroup: /system.slice/corosync.service
└─23456 corosync

Apr 04 10:15:00 ha-node1.fgedu.net.cn systemd[1]: Started Corosync Cluster Engine.

[root@ha-node1 ~]# systemctl status pacemaker
● pacemaker.service – Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 10:15:01 CST; 1min ago
Docs: man:pacemakerd
Main PID: 24567 (pacemakerd)
Tasks: 6 (limit: 23456)
Memory: 45.2M
CPU: 890ms
CGroup: /system.slice/pacemaker.service
├─24567 /usr/sbin/pacemakerd -f
├─24568 /usr/libexec/pacemaker/pacemaker-based
├─24569 /usr/libexec/pacemaker/pacemaker-controld
├─24570 /usr/libexec/pacemaker/pacemaker-attrd
├─24571 /usr/libexec/pacemaker/pacemaker-schedulerd
└─24572 /usr/libexec/pacemaker/pacemaker-fenced

Apr 04 10:15:01 ha-node1.fgedu.net.cn systemd[1]: Started Pacemaker High Availability Cluster
Manager.

4.3 验证集群状态

4.3.1 查看集群状态

# 查看集群状态
[root@ha-node1 ~]# pcs status cluster
Cluster Status:
Cluster Summary:
* Stack: corosync
* Current DC: ha-node1.fgedu.net.cn (version 2.1.6-1.el9 – eab9cfc1ae) – partition with quorum
* Last updated: Fri Apr 4 10:20:00 2026
* Last change: Fri Apr 4 10:15:01 2026 by hacluster via crmd on ha-node1.fgedu.net.cn
* 2 nodes configured
* 0 resource instances configured

# 查看节点状态
[root@ha-node1 ~]# pcs status nodes
Pacemaker Nodes:
Online: ha-node1.fgedu.net.cn ha-node2.fgedu.net.cn
Standby:
Maintenance:
Offline:

# 查看完整状态
[root@ha-node1 ~]# pcs status
Cluster name: hacluster
Cluster Summary:
* Stack: corosync
* Current DC: ha-node1.fgedu.net.cn (version 2.1.6-1.el9 – eab9cfc1ae) – partition with quorum
* Last updated: Fri Apr 4 10:20:00 2026
* Last change: Fri Apr 4 10:15:01 2026 by hacluster via crmd on ha-node1.fgedu.net.cn
* 2 nodes configured
* 0 resource instances configured

Node List:
* Online: [ ha-node1.fgedu.net.cn ha-node2.fgedu.net.cn ]

Full List of Resources:
* No resources

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

4.3.2 验证Corosync配置

# 查看corosync配置
[root@ha-node1 ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: hacluster
secauth: off
transport: knet
}

nodelist {
node {
ring0_addr: ha-node1.fgedu.net.cn
name: ha-node1.fgedu.net.cn
nodeid: 1
}
node {
ring0_addr: ha-node2.fgedu.net.cn
name: ha-node2.fgedu.net.cn
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}

# 验证corosync成员
[root@ha-node1 ~]# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members更多学习教程公众号风哥教程itpux_com.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.1.10)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.1.11)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

风哥针对生产环境建议:

  • 至少使用2个节点,推荐3个或奇数个节点
  • 配置独立的网络用于心跳通信
  • 确保时间同步(NTP/Chrony)
  • 定期备份集群配置
  • 监控集群状态和资源运行情况

Part05-风哥经验总结与分享

5.1 高可用集群最佳实践

高可用集群部署的最佳实践包括:

  • 节点规划:至少2个节点,推荐3个节点避免脑裂
  • 网络规划:业务网络和心跳网络分离
  • 存储规划:使用共享存储确保数据一致性
  • 监控告警:实时监控集群状态和资源运行情况
  • 定期演练:定期进行故障转移测试

5.2 常见问题与解决方案

# 问题1:节点无法加入集群
解决方案:
1. 检查网络连通性
2. 检查防火墙规则
3. 检查hosts文件解析
4. 检查时间同步

# 问题2:集群脑裂
解决方案:
1. 使用奇数个节点
2. 配置仲裁设备
3. 配置fence设备

# 问题3:资源无法启动
解决方案:
1. 检查资源配置
2. 检查资源依赖
3. 查看集群日志
4. 手动测试资源

5.3 集群管理工具推荐

常用集群管理工具:

  • pcs命令:集群配置和管理的主要工具
  • crm_mon:实时监控集群状态
  • pcs-web-ui:Web界面管理集群
  • Hawk2:SUSE提供的Web管理界面
风哥提示:高可用集群的部署需要充分考虑业务需求和环境特点,建议在测试环境充分验证后再部署到生产环境。定期进行故障演练,确保故障转移机制正常工作。学习交流加群风哥微信:
itpux-com

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息