本文档风哥主要介绍Kafka高可用与ISR实战,包括Kafka高可用架构设计、Kafka ISR机制原理、Kafka Leader选举机制、Kafka故障转移处理等内容,风哥教程参考Kafka官方文档Design、Replication等内容,适合大数据开发运维人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。更多视频教程www.fgedu.net.cn
Part01-基础概念与理论知识
1.1 Kafka高可用架构原理
Kafka通过副本机制实现高可用,每个分区可以有多个副本分布在不同Broker上,当Leader副本故障时,Follower副本可以接管成为新的Leader。学习交流加群风哥微信: itpux-com
- 副本机制:每个分区有多个副本,分布在不同Broker
- Leader/Follower:Leader处理读写,Follower同步数据
- ISR机制:同步副本集合,只有ISR中的副本可以选举为Leader
- Controller:集群控制器,负责分区Leader选举
- ZooKeeper:存储集群元数据,选举Controller
1.2 Kafka ISR机制详解
ISR(In-Sync Replicas)是Kafka保证数据一致性的核心机制:
1. ISR定义
– 与Leader保持同步的副本集合
– 包括Leader和所有同步中的Follower
– 由Leader维护并更新
2. ISR同步条件
– 副本在replica.lag.time.max.ms时间内同步
– 默认30秒内必须同步一次
– 超时则从ISR中移除
3. ISR作用
– 只有ISR中的副本可以成为Leader
– 保证新Leader数据完整性
– 配合acks=all保证数据不丢失
# ISR相关配置
replica.lag.time.max.ms=30000 # 同步超时时间
min.insync.replicas=2 # 最小ISR数量
unclean.leader.election.enable=false # 禁止非ISR选举
1.3 Kafka Leader选举机制
Kafka Leader选举机制详解:
- Controller选举:ZooKeeper选举集群Controller
- 分区Leader选举:Controller负责分区Leader选举
- 优先副本选举:优先选举AR中的第一个副本
- 非ISR选举:ISR为空时是否允许非ISR副本成为Leader
Part02-生产环境规划与建议
2.1 高可用集群规划
Kafka高可用集群规划建议:
– 最小生产集群:3 Broker + 3 ZooKeeper
– 推荐生产集群:5 Broker + 3 ZooKeeper
– 大规模集群:7+ Broker + 5 ZooKeeper
# 副本规划
– 副本因子:3(每个分区3个副本)
– 最小ISR:2(至少2个副本同步)
– 副本分布:不同Broker、不同机架
# 节点规划示例
节点1(192.168.1.51):Broker-1 + ZooKeeper-1
节点2(192.168.1.52):Broker-2 + ZooKeeper-2
节点3(192.168.1.53):Broker-3 + ZooKeeper-3
节点4(192.168.1.54):Broker-4
节点5(192.168.1.55):Broker-5
# 机架规划(跨机架部署)
机架A:Broker-1, Broker-2
机架B:Broker-3, Broker-4
机架C:Broker-5
2.2 ISR参数配置规划
ISR相关参数配置建议:
# ISR同步超时
replica.lag.time.max.ms=30000
# 最小ISR数量
min.insync.replicas=2
# 禁止非ISR选举
unclean.leader.election.enable=false
# 副本拉取配置
replica.fetch.min.bytes=1
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
# Topic级别配置
# 创建Topic时指定
$ kafka-topics.sh –create \
–topic fgedu-orders \
–partitions 6 \
–replication-factor 3 \
–config min.insync.replicas=2
# 动态修改Topic配置
$ kafka-configs.sh –alter \
–entity-type topics \
–entity-name fgedu-orders \
–add-config min.insync.replicas=2
2.3 机架感知配置规划
机架感知配置可以提高数据可靠性:
# 在server.properties中添加
broker.rack=rack-a
# 不同机架的Broker配置
# Broker-1
broker.id=1
broker.rack=rack-a
# Broker-2
broker.id=2
broker.rack=rack-a
# Broker-3
broker.id=3
broker.rack=rack-b
# Broker-4
broker.id=4
broker.rack=rack-b
# Broker-5
broker.id=5
broker.rack=rack-c
# 机架感知效果
# 副本会自动分布在不同机架
# 机架故障时数据仍然可用
Part03-生产环境项目实施方案
3.1 ISR状态监控实战
3.1.1 查看ISR状态
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders TopicId: xyz789 PartitionCount: 6 ReplicationFactor: 3
Topic: fgedu-orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: fgedu-orders Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Topic: fgedu-orders Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
Topic: fgedu-orders Partition: 3 Leader: 1 Replicas: 1,3,2 Isr: 1,3,2
Topic: fgedu-orders Partition: 4 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3
Topic: fgedu-orders Partition: 5 Leader: 3 Replicas: 3,2,1 Isr: 3,2,1
# 查看ISR收缩情况
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092 | grep -v “Isr:.*1,2,3\|Isr:.*2,3,1\|Isr:.*3,1,2”
# 使用JMX监控ISR
# JMX指标:kafka.server:type=ReplicaManager,name=IsrShrinksPerSec
# JMX指标:kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
3.1.2 ISR监控脚本
# isr_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
KAFKA_HOME=/bigdata/app/kafka
BOOTSTRAP_SERVER=192.168.1.51:9092
echo “=== Kafka ISR Status Check ===”
echo “Time: $(date)”
echo “”
# 获取所有Topic
TOPICS=$($KAFKA_HOME/bin/kafka-topics.sh –list –bootstrap-server $BOOTSTRAP_SERVER)
for TOPIC in $TOPICS; do
echo “Topic: $TOPIC”
$KAFKA_HOME/bin/kafka-topics.sh –describe \
–topic $TOPIC \
–bootstrap-server $BOOTSTRAP_SERVER | \
awk ‘{if(NR>1) print ” Partition:”, $2, “Leader:”, $4, “ISR:”, $6}’
echo “”
done
# 检查ISR数量小于副本数量的分区
echo “=== ISR Shrink Warning ===”
for TOPIC in $TOPICS; do
$KAFKA_HOME/bin/kafka-topics.sh –describe \
–topic $TOPIC \
–bootstrap-server $BOOTSTRAP_SERVER | \
awk -F'[\\t ]+’ ‘{
if(NR>1) {
split($6, isr, “,”)
split($5, replicas, “,”)
if(length(isr) < length(replicas)) {
print "Warning: " $2 " ISR count < Replica count"
}
}
}'
done
# 执行脚本
$ chmod +x /bigdata/scripts/isr_check.sh
$ /bigdata/scripts/isr_check.sh
=== Kafka ISR Status Check ===
Time: Mon Apr 8 10:30:00 CST 2026
Topic: fgedu-orders
Partition: 0 Leader: 1 ISR: 1,2,3
Partition: 1 Leader: 2 ISR: 2,3,1
Partition: 2 Leader: 3 ISR: 3,1,2
=== ISR Shrink Warning ===
3.2 Leader故障转移实战
3.2.1 模拟Leader故障
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
# 停止Broker-1模拟Leader故障
$ ssh 192.168.1.51 “systemctl stop kafka”
# 观察Leader变化
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.52:9092
Topic: fgedu-orders Partition: 0 Leader: 2 Replicas: 1,2,3 Isr: 2,3
# 可以看到Leader从1变为2,ISR从1,2,3变为2,3
# 恢复Broker-1
$ ssh 192.168.1.51 “systemctl start kafka”
# 观察ISR恢复
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders Partition: 0 Leader: 2 Replicas: 1,2,3 Isr: 2,3,1
3.2.2 手动触发Leader选举
$ /bigdata/app/kafka/bin/kafka-leader-election.sh \
–bootstrap-server 192.168.1.51:9092 \
–topic fgedu-orders \
–partition 0 \
–election-type preferred
Successfully completed leader election (PREFERRED) for partition fgedu-orders-0
# 批量重新选举所有分区
$ /bigdata/app/kafka/bin/kafka-leader-election.sh \
–bootstrap-server 192.168.1.51:9092 \
–all-topic-partitions \
–election-type preferred
# 验证选举结果
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
3.3 Broker故障处理实战
3.3.1 Broker故障检测
$ /bigdata/app/kafka/bin/kafka-broker-api-versions.sh \
–bootstrap-server 192.168.1.51:9092
192.168.1.51:9092 (id: 1 rack: null) -> (
Produce(0): 0 to 9,
Fetch(1): 0 to 13,
…
)
192.168.1.52:9092 (id: 2 rack: null) -> (
Produce(0): 0 to 9,
Fetch(1): 0 to 13,
…
)
# 查看集群元数据
$ /bigdata/app/kafka/bin/kafka-metadata.sh \
–snapshot /bigdata/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
–command “broker”
Broker 1: alive=true
Broker 2: alive=true
Broker 3: alive=true
# 检查Controller
$ /bigdata/app/kafka/bin/kafka-metadata.sh \
–snapshot /bigdata/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
–command “controller”
Controller: Broker 1
3.3.2 Broker恢复后处理
# 1. Broker启动后会自动加入ISR
# 2. 从Leader同步数据
# 3. 追上进度后成为ISR成员
# 检查副本同步进度
$ /bigdata/app/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell \
–broker-list 192.168.1.51:9092 \
–topic fgedu-orders
fgedu-orders:0:10000
fgedu-orders:1:8000
fgedu-orders:2:12000
# 查看副本日志末端位置
$ /bigdata/app/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell \
–broker-list 192.168.1.52:9092 \
–topic fgedu-orders
fgedu-orders:0:10000
fgedu-orders:1:8000
fgedu-orders:2:12000
# 如果副本落后太多,可能需要手动干预
# 1. 检查网络连接
# 2. 检查磁盘IO
# 3. 调整replica.fetch.max.bytes
Part04-生产案例与实战讲解
4.1 ISR收缩问题处理
4.1.1 ISR收缩原因分析
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2
# ISR收缩常见原因
1. Follower副本同步超时
– replica.lag.time.max.ms超时
– 网络延迟或带宽不足
2. Follower副本故障
– Broker宕机
– 进程异常退出
3. 磁盘IO瓶颈
– 磁盘写入速度慢
– 磁盘空间不足
4. GC停顿
– JVM Full GC时间过长
– 导致副本同步线程阻塞
# 排查步骤
# 1. 检查Broker日志
$ tail -100 /bigdata/app/kafka/logs/server.log | grep -i “isr\|replica”
# 2. 检查网络延迟
$ ping 192.168.1.52
$ traceroute 192.168.1.52
# 3. 检查磁盘IO
$ iostat -x 1 10
# 4. 检查JVM GC
$ jstat -gcutil
4.1.2 ISR恢复处理
# 1. 调整同步超时时间(谨慎)
# 在server.properties中修改
replica.lag.time.max.ms=60000
# 2. 优化磁盘IO
# 使用SSD
# 调整IO调度算法
$ echo deadline > /sys/block/sdb/queue/scheduler
# 3. 优化JVM参数
# 在kafka-server-start.sh中
export KAFKA_HEAP_OPTS=”-Xms8g -Xmx8g”
export KAFKA_JVM_PERFORMANCE_OPTS=”-XX:+UseG1GC -XX:MaxGCPauseMillis=20″
# 4. 检查并恢复故障Broker
$ systemctl status kafka
$ systemctl start kafka
# 5. 验证ISR恢复
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
4.2 Leader卡死问题处理
$ /bigdata/app/kafka/bin/kafka-topics.sh –describe \
–topic fgedu-orders \
–bootstrap-server 192.168.1.51:9092
Topic: fgedu-orders Partition: 0 Leader: -1 Replicas: 1,2,3 Isr: 2,3
# 排查步骤
# 1. 检查Controller状态
$ /bigdata/app/kafka/bin/kafka-metadata.sh \
–snapshot /bigdata/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
–command “controller”
# 2. 检查ZooKeeper连接
$ echo “stat” | nc localhost 2181
# 3. 检查Broker状态
$ jps | grep Kafka
# 解决方案
# 1. 触发Leader选举
$ /bigdata/app/kafka/bin/kafka-leader-election.sh \
–bootstrap-server 192.168.1.51:9092 \
–topic fgedu-orders \
–partition 0 \
–election-type preferred
# 2. 如果ISR为空,需要评估是否允许非ISR选举
# 注意:可能导致数据丢失
$ /bigdata/app/kafka/bin/kafka-leader-election.sh \
–bootstrap-server 192.168.1.51:9092 \
–topic fgedu-orders \
–partition 0 \
–election-type unclean
4.3 高可用常见问题处理
4.3.1 Controller故障
# 排查步骤
# 1. 检查ZooKeeper中的Controller节点
$ /bigdata/app/kafka/bin/zookeeper-shell.sh localhost:2181 << EOF
get /controller
EOF
# 输出示例
{"version":1,"brokerid":1,"timestamp":"1234567890"}
# 2. 如果Controller节点不存在,会自动选举
# 3. 检查Controller选举日志
$ grep "Controller" /bigdata/app/kafka/logs/server.log
# 4. 手动触发Controller选举
# 重启ZooKeeper或Kafka集群
4.3.2 ZooKeeper故障
# 1. 无法创建新Topic
# 2. 无法进行分区重分配
# 3. Controller选举失败
# 4. 消费者组管理异常
# 检查ZooKeeper集群状态
$ echo “stat” | nc 192.168.1.51 2181
Zookeeper version: 3.8.3
Mode: follower
Node count: 123
# 检查ZooKeeper集群连接
$ /bigdata/app/kafka/bin/zookeeper-shell.sh 192.168.1.51:2181 << EOF
ls /
EOF
# 恢复ZooKeeper
$ systemctl restart zookeeper
Part05-风哥经验总结与分享
5.1 高可用最佳实践
Kafka高可用最佳实践建议:
1. 至少3个Broker节点
2. 副本因子设置为3
3. 跨机架部署
4. 配置机架感知
# 参数配置最佳实践
1. min.insync.replicas=2
2. unclean.leader.election.enable=false
3. replica.lag.time.max.ms=30000
4. 生产者acks=all
# 运维最佳实践
1. 监控ISR状态
2. 监控Leader分布
3. 定期检查副本同步
4. 制定故障恢复预案
5.2 高可用检查清单
高可用检查清单:
- Broker数量是否满足高可用要求
- 副本因子是否设置为3
- ISR数量是否正常
- Leader分布是否均衡
- 机架感知是否配置正确
- ZooKeeper集群是否正常
- Controller是否正常
- 监控告警是否配置
5.3 监控工具推荐
高可用监控工具:
- JMX Exporter:导出ISR、Leader等JMX指标
- Burrow:消费者延迟监控
- Kafka Manager:集群可视化管理
- Prometheus + Grafana:指标监控和告警
- Kafka Eagle:综合监控管理平台
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
