本篇文章详细介绍Rancher企业级运维最佳实践,包括日常运维流程、监控告警、故障处理、安全加固、性能优化等实战内容。风哥教程参考Rancher官方文档运维管理与企业实践相关章节。
目录大纲
Part01-基础概念与理论知识
1.1 企业级运维体系
企业级运维体系包括:监控告警、故障处理、变更管理、容量规划、安全管理、备份恢复。运维目标:高可用、高性能、高安全、低成本。运维原则:自动化、标准化、可视化、可追溯。更多视频教程www.fgedu.net.cn
1.2 运维标准化流程
运维标准化流程包括:日常巡检、故障响应、变更审批、发布流程、应急处理。建立运维知识库,记录常见问题和解决方案。制定运维手册,明确操作步骤和注意事项。学习交流加群风哥微信: itpux-com
Part02-生产环境规划与建议
2.1 运维团队建设
运维团队角色:运维经理、高级运维工程师、运维工程师、运维实习生。技能要求:Kubernetes、Docker、Linux、网络、存储、监控。培训计划:定期培训、技术分享、实战演练。学习交流加群风哥QQ113257174
2.2 运维工具选型
运维工具选型:监控工具(Prometheus+Grafana)、日志工具(ELK/EFK)、CI/CD工具(Jenkins/GitLab CI)、配置管理(Ansible/Terraform)、工单系统(Jira/禅道)。工具集成:统一告警平台、自动化运维平台。更多学习教程公众号风哥教程itpux_com
Part03-生产环境项目实施方案
3.1 日常巡检脚本
编写Rancher日常巡检脚本。
> $LOG_FILE echo “Rancher日常巡检报告 – $(date)” >> $LOG_FILE echo “========================================” >> $LOG_FILE echo “” >> $LOG_FILE echo “1. 检查Rancher服务状态” >> $LOG_FILE systemctl is-active rancher-server >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “2. 检查Rancher容器状态” >> $LOG_FILE docker ps | grep rancher >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “3. 检查集群节点状态” >> $LOG_FILE kubectl get nodes >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “4. 检查系统Pod状态” >> $LOG_FILE kubectl get pods -n kube-system >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “5. 检查Rancher Pod状态” >> $LOG_FILE kubectl get pods -n cattle-system >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “6. 检查ETCD集群状态” >> $LOG_FILE ETCDCTL_API=3 etcdctl \ –endpoints=https://192.168.1.100:2379 \ –cacert=/etc/kubernetes/ssl/kube-ca.pem \ –cert=/etc/kubernetes/ssl/kube-etcd-192-168-1-100.pem \ –key=/etc/kubernetes/ssl/kube-etcd-192-168-1-100-key.pem \ endpoint health >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “7. 检查磁盘使用率” >> $LOG_FILE df -h >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “8. 检查内存使用率” >> $LOG_FILE free -h >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “9. 检查CPU负载” >> $LOG_FILE uptime >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “10. 检查网络连接” >> $LOG_FILE netstat -an | grep ESTABLISHED | wc -l >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “巡检完成!” >> $LOG_FILE
======================================== Rancher日常巡检报告 - Fri Apr 10 22:00:00 CST 2026 ======================================== 1. 检查Rancher服务状态 active 2. 检查Rancher容器状态 abc123def456 rancher/rancher:latest "entrypoint.sh" 30 days ago Up 30 days 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp rancher 3. 检查集群节点状态 NAME STATUS ROLES AGE VERSION fgedu-node-1 Ready30d v1.28.5 fgedu-node-2 Ready 30d v1.28.5 fgedu-node-3 Ready 30d v1.28.5 4. 检查系统Pod状态 NAME READY STATUS RESTARTS AGE coredns-abc123def456-ghi78 1/1 Running 0 30d coredns-abc123def456-jkl90 1/1 Running 0 30d etcd-fgedu-node-1 1/1 Running 0 30d kube-apiserver-fgedu-node-1 1/1 Running 0 30d kube-controller-manager-fgedu-node-1 1/1 Running 0 30d kube-proxy-abc123def456 1/1 Running 0 30d kube-scheduler-fgedu-node-1 1/1 Running 0 30d 5. 检查Rancher Pod状态 NAME READY STATUS RESTARTS AGE cattle-cluster-agent-abc123def456-ghi78 1/1 Running 0 30d cattle-node-agent-jkl012mno345 1/1 Running 0 30d cattle-node-agent-pqr456stu789 1/1 Running 0 30d 6. 检查ETCD集群状态 https://192.168.1.100:2379 is healthy: successfully committed proposal: took = 12.345678ms https://192.168.1.101:2379 is healthy: successfully committed proposal: took = 11.234567ms https://192.168.1.102:2379 is healthy: successfully committed proposal: took = 13.456789ms 7. 检查磁盘使用率 Filesystem Size Used Avail Use% Mounted on /dev/sda1 500G 200G 300G 40% / /dev/sdb1 1.0T 500G 500G 50% /Rancher/fgdata 8. 检查内存使用率 total used free shared buff/cache available Mem: 32Gi 16Gi 8Gi 2Gi 8Gi 14Gi Swap: 16Gi 2Gi 14Gi 9. 检查CPU负载 22:00:00 up 30 days, 5:30, 3 users, load average: 0.50, 0.60, 0.70 10. 检查网络连接 1234 巡检完成! from Rancher视频:www.itpux.com
3.2 监控告警配置
配置Rancher监控告警规则。
80 for: 5m labels: severity: warning annotations: summary: “CPU使用率过高” description: “节点 {{ \$labels.instance }} CPU使用率超过80%” – alert: HighMemoryUsage expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80 for: 5m labels: severity: warning annotations: summary: “内存使用率过高” description: “节点 {{ \$labels.instance }} 内存使用率超过80%” – alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint=”/”} / node_filesystem_size_bytes{mountpoint=”/”}) * 100 < 20 for: 5m labels: severity: critical annotations: summary: "磁盘空间不足" description: "节点 {{ \$labels.instance }} 根分区剩余空间不足20%" EOFprometheusrule.monitoring.coreos.com/rancher-alerts createdalertmanager.monitoring.coreos.com/fgedu-alertmanager createdalertmanagerconfig.monitoring.coreos.com/fgedu-alertconfig created3.3 安全加固配置
配置Rancher安全加固。
podsecuritypolicy.policy/fgedu-restricted creatednetworkpolicy.networking.k8s.io/fgedu-default-deny creatednetworkpolicy.networking.k8s.io/fgedu-allow-ingress createdPart04-生产案例与实战讲解
4.1 故障处理流程
演示Rancher故障处理流程。
pod "cattle-cluster-agent-abc123def456-ghi78" deletedNAME READY STATUS RESTARTS AGE cattle-cluster-agent-abc123def456-ghi78 0/1 Terminating 0 30d cattle-cluster-agent-abc123def456-jkl90 0/1 Pending 0 0s cattle-cluster-agent-abc123def456-jkl90 0/1 Pending 0 0s cattle-cluster-agent-abc123def456-jkl90 0/1 ContainerCreating 0 0s cattle-cluster-agent-abc123def456-jkl90 0/1 ContainerCreating 0 0s cattle-cluster-agent-abc123def456-jkl90 1/1 Running 0 5stime="2026-04-10T22:30:00Z" level=info msg="Starting cattle-cluster-agent" time="2026-04-10T22:30:01Z" level=info msg="Connecting to Rancher server" time="2026-04-10T22:30:02Z" level=info msg="Connected to Rancher server successfully" time="2026-04-10T22:30:03Z" level=info msg="Starting cluster sync" time="2026-04-10T22:30:05Z" level=info msg="Cluster sync completed"4.2 性能优化实战
优化Rancher性能。
deployment.apps/rancher patchedFinished defragmenting etcd member[https://192.168.1.100:2379]pod "fgedu-web-abc123" deleted pod "fgedu-api-def456" deletedpod "fgedu-job-ghi789" deleted4.3 应急响应实战
演示应急响应流程。
Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-0 Healthy {"health":"true","reason":""} etcd-1 Healthy {"health":"true","reason":""} etcd-2 Healthy {"health":"true","reason":""}Name: fgedu-node-1 Roles:Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=fgedu-node-1 kubernetes.io/os=linux Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"00:11:22:33:44:55"} flannel.alpha.coreos.com/backend-type: vxlan kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 10 Mar 2026 00:00:00 +0800 Taints: Unschedulable: false Conditions: Type Status LastHeartbeatTime Reason Message ---- ------ ----------------- ------ ------- MemoryPressure False Fri, 10 Apr 2026 22:30:00 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Fri, 10 Apr 2026 22:30:00 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Fri, 10 Apr 2026 22:30:00 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Fri, 10 Apr 2026 22:30:00 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 192.168.1.100 Hostname: fgedu-node-1 Capacity: cpu: 8 ephemeral-storage: 500Gi hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32768000Ki pods: 110 Allocatable: cpu: 8 ephemeral-storage: 500Gi hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32768000Ki pods: 110 LAST SEEN TYPE REASON OBJECT MESSAGE 5m Normal Pulling pod/cattle-cluster-agent-abc123 Pulling image "rancher/rancher-agent:v2.8.0" 4m Normal Pulled pod/cattle-cluster-agent-abc123 Successfully pulled image "rancher/rancher-agent:v2.8.0" 3m Normal Created pod/cattle-cluster-agent-abc123 Created container cattle-cluster-agent 2m Normal Started pod/cattle-cluster-agent-abc123 Started container cattle-cluster-agentPart05-风哥经验总结与分享
5.1 生产环境最佳实践
1. 建立完善的监控告警体系
2. 制定详细的运维手册和流程
3. 定期执行备份和恢复演练
4. 实施安全加固和访问控制
5. 配置资源限制和配额管理
6. 建立知识库记录常见问题
7. 定期进行性能优化和容量规划
8. 建立应急响应团队和流程5.2 常见问题与解决方案
1. Pod启动失败:检查镜像、资源限制、网络策略
2. 节点NotReady:检查kubelet、网络、证书
3. ETCD性能下降:执行碎片整理、增加资源
4. 网络不通:检查网络策略、CNI配置
5. 存储挂载失败:检查PV/PVC、存储类配置
6. 证书过期:更新证书、重启服务
7. 内存不足:优化资源限制、扩容节点
8. 监控数据丢失:检查Prometheus存储、增加保留时间本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
