1. 首页 > Rancher教程 > 正文

Rancher教程FG029-Rancher企业级运维最佳实践与手册

本篇文章详细介绍Rancher企业级运维最佳实践,包括日常运维流程、监控告警、故障处理、安全加固、性能优化等实战内容。风哥教程参考Rancher官方文档运维管理与企业实践相关章节。

目录大纲

Part01-基础概念与理论知识

1.1 企业级运维体系

企业级运维体系包括:监控告警、故障处理、变更管理、容量规划、安全管理、备份恢复。运维目标:高可用、高性能、高安全、低成本。运维原则:自动化、标准化、可视化、可追溯。更多视频教程www.fgedu.net.cn

1.2 运维标准化流程

运维标准化流程包括:日常巡检、故障响应、变更审批、发布流程、应急处理。建立运维知识库,记录常见问题和解决方案。制定运维手册,明确操作步骤和注意事项。学习交流加群风哥微信: itpux-com

Part02-生产环境规划与建议

2.1 运维团队建设

运维团队角色:运维经理、高级运维工程师、运维工程师、运维实习生。技能要求:Kubernetes、Docker、Linux、网络、存储、监控。培训计划:定期培训、技术分享、实战演练。学习交流加群风哥QQ113257174

2.2 运维工具选型

运维工具选型:监控工具(Prometheus+Grafana)、日志工具(ELK/EFK)、CI/CD工具(Jenkins/GitLab CI)、配置管理(Ansible/Terraform)、工单系统(Jira/禅道)。工具集成:统一告警平台、自动化运维平台。更多学习教程公众号风哥教程itpux_com

Part03-生产环境项目实施方案

3.1 日常巡检脚本

编写Rancher日常巡检脚本。

> $LOG_FILE echo “Rancher日常巡检报告 – $(date)” >> $LOG_FILE echo “========================================” >> $LOG_FILE echo “” >> $LOG_FILE echo “1. 检查Rancher服务状态” >> $LOG_FILE systemctl is-active rancher-server >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “2. 检查Rancher容器状态” >> $LOG_FILE docker ps | grep rancher >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “3. 检查集群节点状态” >> $LOG_FILE kubectl get nodes >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “4. 检查系统Pod状态” >> $LOG_FILE kubectl get pods -n kube-system >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “5. 检查Rancher Pod状态” >> $LOG_FILE kubectl get pods -n cattle-system >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “6. 检查ETCD集群状态” >> $LOG_FILE ETCDCTL_API=3 etcdctl \ –endpoints=https://192.168.1.100:2379 \ –cacert=/etc/kubernetes/ssl/kube-ca.pem \ –cert=/etc/kubernetes/ssl/kube-etcd-192-168-1-100.pem \ –key=/etc/kubernetes/ssl/kube-etcd-192-168-1-100-key.pem \ endpoint health >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “7. 检查磁盘使用率” >> $LOG_FILE df -h >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “8. 检查内存使用率” >> $LOG_FILE free -h >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “9. 检查CPU负载” >> $LOG_FILE uptime >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “10. 检查网络连接” >> $LOG_FILE netstat -an | grep ESTABLISHED | wc -l >> $LOG_FILE 2>&1 echo “” >> $LOG_FILE echo “巡检完成!” >> $LOG_FILE



========================================
Rancher日常巡检报告 - Fri Apr 10 22:00:00 CST 2026
========================================

1. 检查Rancher服务状态
active

2. 检查Rancher容器状态
abc123def456   rancher/rancher:latest   "entrypoint.sh"   30 days ago   Up 30 days   0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   rancher

3. 检查集群节点状态
NAME            STATUS   ROLES    AGE   VERSION
fgedu-node-1    Ready       30d   v1.28.5
fgedu-node-2    Ready       30d   v1.28.5
fgedu-node-3    Ready       30d   v1.28.5

4. 检查系统Pod状态
NAME                                    READY   STATUS    RESTARTS   AGE
coredns-abc123def456-ghi78              1/1     Running   0          30d
coredns-abc123def456-jkl90              1/1     Running   0          30d
etcd-fgedu-node-1                       1/1     Running   0          30d
kube-apiserver-fgedu-node-1             1/1     Running   0          30d
kube-controller-manager-fgedu-node-1    1/1     Running   0          30d
kube-proxy-abc123def456                 1/1     Running   0          30d
kube-scheduler-fgedu-node-1             1/1     Running   0          30d

5. 检查Rancher Pod状态
NAME                                    READY   STATUS    RESTARTS   AGE
cattle-cluster-agent-abc123def456-ghi78   1/1     Running   0          30d
cattle-node-agent-jkl012mno345           1/1     Running   0          30d
cattle-node-agent-pqr456stu789           1/1     Running   0          30d

6. 检查ETCD集群状态
https://192.168.1.100:2379 is healthy: successfully committed proposal: took = 12.345678ms
https://192.168.1.101:2379 is healthy: successfully committed proposal: took = 11.234567ms
https://192.168.1.102:2379 is healthy: successfully committed proposal: took = 13.456789ms

7. 检查磁盘使用率
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       500G  200G  300G  40% /
/dev/sdb1       1.0T  500G  500G  50% /Rancher/fgdata

8. 检查内存使用率
              total        used        free      shared  buff/cache   available
Mem:           32Gi       16Gi       8Gi       2Gi       8Gi       14Gi
Swap:         16Gi        2Gi       14Gi

9. 检查CPU负载
 22:00:00 up 30 days,  5:30,  3 users,  load average: 0.50, 0.60, 0.70

10. 检查网络连接
1234

巡检完成!

from Rancher视频:www.itpux.com

3.2 监控告警配置

配置Rancher监控告警规则。

80 for: 5m labels: severity: warning annotations: summary: “CPU使用率过高” description: “节点 {{ \$labels.instance }} CPU使用率超过80%” – alert: HighMemoryUsage expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80 for: 5m labels: severity: warning annotations: summary: “内存使用率过高” description: “节点 {{ \$labels.instance }} 内存使用率超过80%” – alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint=”/”} / node_filesystem_size_bytes{mountpoint=”/”}) * 100 < 20 for: 5m labels: severity: critical annotations: summary: "磁盘空间不足" description: "节点 {{ \$labels.instance }} 根分区剩余空间不足20%" EOF
prometheusrule.monitoring.coreos.com/rancher-alerts created
alertmanager.monitoring.coreos.com/fgedu-alertmanager created
alertmanagerconfig.monitoring.coreos.com/fgedu-alertconfig created

3.3 安全加固配置

配置Rancher安全加固。

podsecuritypolicy.policy/fgedu-restricted created
networkpolicy.networking.k8s.io/fgedu-default-deny created
networkpolicy.networking.k8s.io/fgedu-allow-ingress created

Part04-生产案例与实战讲解

4.1 故障处理流程

演示Rancher故障处理流程。

pod "cattle-cluster-agent-abc123def456-ghi78" deleted
NAME                                    READY   STATUS              RESTARTS   AGE
cattle-cluster-agent-abc123def456-ghi78   0/1     Terminating         0          30d
cattle-cluster-agent-abc123def456-jkl90   0/1     Pending             0          0s
cattle-cluster-agent-abc123def456-jkl90   0/1     Pending             0          0s
cattle-cluster-agent-abc123def456-jkl90   0/1     ContainerCreating   0          0s
cattle-cluster-agent-abc123def456-jkl90   0/1     ContainerCreating   0          0s
cattle-cluster-agent-abc123def456-jkl90   1/1     Running             0          5s
time="2026-04-10T22:30:00Z" level=info msg="Starting cattle-cluster-agent"
time="2026-04-10T22:30:01Z" level=info msg="Connecting to Rancher server"
time="2026-04-10T22:30:02Z" level=info msg="Connected to Rancher server successfully"
time="2026-04-10T22:30:03Z" level=info msg="Starting cluster sync"
time="2026-04-10T22:30:05Z" level=info msg="Cluster sync completed"

4.2 性能优化实战

优化Rancher性能。

deployment.apps/rancher patched
Finished defragmenting etcd member[https://192.168.1.100:2379]
pod "fgedu-web-abc123" deleted
pod "fgedu-api-def456" deleted
pod "fgedu-job-ghi789" deleted

4.3 应急响应实战

演示应急响应流程。

Warning: v1 ComponentStatus is deprecated in v1.19+
NAME                 STATUS    MESSAGE                         ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health":"true","reason":""}
etcd-1               Healthy   {"health":"true","reason":""}
etcd-2               Healthy   {"health":"true","reason":""}
Name:               fgedu-node-1
Roles:              
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=fgedu-node-1
                    kubernetes.io/os=linux
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"00:11:22:33:44:55"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 10 Mar 2026 00:00:00 +0800
Taints:             
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 Reason                       Message
  ----                 ------  -----------------                 ------                       -------
  MemoryPressure       False   Fri, 10 Apr 2026 22:30:00 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 10 Apr 2026 22:30:00 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 10 Apr 2026 22:30:00 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 10 Apr 2026 22:30:00 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.1.100
  Hostname:    fgedu-node-1
Capacity:
  cpu:                8
  ephemeral-storage:  500Gi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32768000Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  500Gi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32768000Ki
  pods:               110
LAST SEEN   TYPE      REASON    OBJECT                             MESSAGE
5m          Normal    Pulling   pod/cattle-cluster-agent-abc123   Pulling image "rancher/rancher-agent:v2.8.0"
4m          Normal    Pulled    pod/cattle-cluster-agent-abc123   Successfully pulled image "rancher/rancher-agent:v2.8.0"
3m          Normal    Created   pod/cattle-cluster-agent-abc123   Created container cattle-cluster-agent
2m          Normal    Started   pod/cattle-cluster-agent-abc123   Started container cattle-cluster-agent

Part05-风哥经验总结与分享

5.1 生产环境最佳实践

1. 建立完善的监控告警体系
2. 制定详细的运维手册和流程
3. 定期执行备份和恢复演练
4. 实施安全加固和访问控制
5. 配置资源限制和配额管理
6. 建立知识库记录常见问题
7. 定期进行性能优化和容量规划
8. 建立应急响应团队和流程

5.2 常见问题与解决方案

1. Pod启动失败:检查镜像、资源限制、网络策略
2. 节点NotReady:检查kubelet、网络、证书
3. ETCD性能下降:执行碎片整理、增加资源
4. 网络不通:检查网络策略、CNI配置
5. 存储挂载失败:检查PV/PVC、存储类配置
6. 证书过期:更新证书、重启服务
7. 内存不足:优化资源限制、扩容节点
8. 监控数据丢失:检查Prometheus存储、增加保留时间

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息