1. 首页 > KubeSphere教程 > 正文

KubeSphere-049-生产环境最佳实践和运维手册

Production Environment Best Practices and O&M Manual

HTML-GF-Middleware 培训文档

目录

1. 基础概念

1.1 生产环境特点

生产环境是企业业务运行的核心环境,具有以下特点:

  • 高可用性:系统需要保证99.99%以上的可用性
  • 高性能:系统需要支持高并发访问
  • 高安全性:系统需要保证数据安全和访问控制
  • 可扩展性:系统需要支持水平扩展
  • 可观测性:系统需要提供完整的监控和日志
  • 可维护性:系统需要易于维护和升级

1.2 运维最佳实践

运维最佳实践包括:

  • 自动化运维:使用自动化工具减少人工操作
  • 标准化管理:建立标准化的运维流程
  • 监控告警:建立完善的监控告警体系
  • 备份恢复:建立完善的备份恢复机制
  • 文档管理:建立完善的运维文档
  • 团队协作:建立高效的团队协作机制

1.3 KubeSphere生产环境

KubeSphere生产环境需要考虑: 风哥提示: 学习交流加群风哥微信: itpux-com 学习交流加群风哥QQ113257174 更多视频教程www.fgedu.net.cn 更多学习教程公众号风哥教程itpux_com from K8S+DB视频:www.itpux.com

  • 集群架构:设计高可用的集群架构
  • 资源规划:合理规划集群资源
  • 安全配置:配置安全策略和访问控制
  • 监控告警:配置监控和告警
  • 备份恢复:配置备份和恢复
  • 升级维护:规划升级和维护流程

2. 生产环境规划

2.1 集群架构规划

2.1.1 高可用架构

# 高可用架构
# – 3个Master节点
# – 3个ETCD节点
# – 6个Worker节点
# – 负载均衡器
# – 多可用区部署

2.1.2 网络架构

# 网络架构
# – Pod网络:10.233.0.0/16
# – Service网络:10.96.0.0/12
# – 网络插件:Calico
# – 网络策略:启用

2.2 资源规划

2.2.1 节点资源

# 节点资源
# – Master节点:4 CPU, 16GB RAM
# – Worker节点:8 CPU, 32GB RAM
# – 总资源:60 CPU, 240GB RAM

2.2.2 存储资源

# 存储资源
# – 系统存储:100GB
# – 数据存储:1TB
# – 日志存储:500GB
# – 备份存储:1TB

2.3 安全规划

2.3.1 网络安全

# 网络安全
# – 防火墙规则
# – 网络隔离
# – 访问控制

2.3.2 访问控制

# 访问控制
# – RBAC权限
# – Pod安全策略
# – 审计日志

3. 实施步骤

3.1 集群初始化

3.1.1 节点准备

# 节点准备
# 配置主机名
hostnamectl set-hostname k8s-master-1
hostnamectl set-hostname k8s-master-2
hostnamectl set-hostname k8s-master-3
hostnamectl set-hostname k8s-worker-1
hostnamectl set-hostname k8s-worker-2
hostnamectl set-hostname k8s-worker-3
hostname set successfully

# 配置hosts文件
cat >> /etc/hosts <<EOF
192.168.1.101 k8s-master-1
192.168.1.102 k8s-master-2
192.168.1.103 k8s-master-3
192.168.1.201 k8s-worker-1
192.168.1.202 k8s-worker-2
192.168.1.203 k8s-worker-3
EOF
hosts file updated

# 关闭Swap
swapoff -a
sed -i ‘/ swap / s/^\(.*\)$/#\1/g’ /etc/fstab
swap disabled

# 配置内核参数
cat >> /etc/sysctl.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
sysctl -p
kernel parameters updated

3.1.2 安装Docker

# 安装Docker
yum install -y yum-utils device-mapper-persistent-data lvm2
yum-utils installed

yum-config-manager –add-repo https://download.docker.com/linux/centos/docker-ce.repo
repo added

yum install -y docker-ce docker-ce-cli containerd.io
docker installed

# 配置Docker
mkdir -p /etc/docker
cat > /etc/docker/daemon.json <<EOF
{
“registry-mirrors”: [“https://mirror.example.com”],
“exec-opts”: [“native.cgroupdriver=systemd”],
“log-driver”: “json-file”,
“log-opts”: {
“max-size”: “100m”
},
“storage-driver”: “overlay2”
}
EOF
docker configured

systemctl enable docker
systemctl start docker
docker started

# 验证Docker
docker version
Client: Docker Engine – Community
Version: 24.0.7
Server: Docker Engine – Community
Version: 24.0.7

3.2 安装KubeSphere

3.2.1 安装KubeKey

# 安装KubeKey

curl -sfL https://get-kk.kubesphere.io | VERSION=v3.0.13 sh –
Downloading kk …
kk installed successfully

# 创建配置文件
cat > config.yaml <<EOF
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
name: sample
spec:
hosts:
– {name: k8s-master-1, address: 192.168.1.101, internalAddress: 192.168.1.101, user: root, password: “root123”}
– {name: k8s-master-2, address: 192.168.1.102, internalAddress: 192.168.1.102, user: root, password: “root123”}
– {name: k8s-master-3, address: 192.168.1.103, internalAddress: 192.168.1.103, user: root, password: “root123”}
– {name: k8s-worker-1, address: 192.168.1.201, internalAddress: 192.168.1.201, user: root, password: “root123”}
– {name: k8s-worker-2, address: 192.168.1.202, internalAddress: 192.168.1.202, user: root, password: “root123”}
– {name: k8s-worker-3, address: 192.168.1.203, internalAddress: 192.168.1.203, user: root, password: “root123”}
roleGroups:
etcd:
– k8s-master-1
– k8s-master-2
– k8s-master-3
control-plane:
– k8s-master-1
– k8s-master-2
– k8s-master-3
worker:
– k8s-worker-1
– k8s-worker-2
– k8s-worker-3
controlPlaneEndpoint:
internalLoadbalancer: haproxy
domain: lb.kubesphere.local
address: “”
port: 6443
kubernetes:
version: v1.28.0
clusterName: cluster.local
autoRenewCerts: true
containerManager: docker
etcd:
type: kubekey
network:
plugin: calico
kubePodsCIDR: 10.233.0.0/16
kubeServiceCIDR: 10.96.0.0/12
registry:
privateRegistry: “”
namespaceOverride: “”
registryMirrors: []
insecureRegistries: []
addons: []

apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
name: ks-installer
namespace: kubesphere-system
labels:
version: v3.4.1
spec:
persistence:
storageClass: “”
authentication:
jwtSecret: “”
local_registry: “”
namespace_override: “”
etcd:
monitoring: false
endpointIps: 192.168.1.101,192.168.1.102,192.168.1.103
port: 2379
tlsEnable: true
common:
core:
console:
enableMultiLogin: true
port: 30880
type: NodePort

alerting:
enabled: true
auditing:
enabled: true
devops:
enabled: true
jenkinsCpuReq: 0.5
jenkinsCpuLim: 1
jenkinsMemoryReq: 4Gi
jenkinsMemoryLim: 4Gi
jenkinsVolumeSize: 8Gi
events:
enabled: true
logging:
enabled: true
logsidecar:
enabled: true
replicas: 2
metrics_server:
enabled: false
monitoring:
storageClass: “”
prometheusMemoryRequest: 400Mi
prometheusVolumeSize: 20Gi
alertmanagerVolumeSize: 2Gi
multicluster:
clusterRole: none
network:
networkpolicy:
enabled: true
ippool:
type: calico
topology:
type: none
openpitrix:
store:
enabled: true
servicemesh:
enabled: true
istio:
components:
ingressGateways:
– name: istio-ingressgateway
enabled: false
cni:
enabled: false
edgeruntime:
enabled: false
gatekeeper:
enabled: false
terminal:
timeout: 600
EOF
config.yaml created

3.2.2 安装KubeSphere

# 安装KubeSphere
./kk create cluster -f config.yaml
Cluster installation started successfully

# 查看安装状态
kubectl logs -n kubesphere-system deployment/ks-installer -f
Waiting for installation to complete…
Installation completed successfully!

4. 实战案例

4.1 集群运维

4.1.1 节点维护

# 节点维护
# 标记节点为不可调度
kubectl cordon k8s-worker-1
node/k8s-worker-1 cordoned

# 驱逐节点上的Pod
kubectl drain k8s-worker-1 –ignore-daemonsets –delete-emptydir-data
node/k8s-worker-1 drained

# 维护节点
# … 执行维护操作 …

# 恢复节点调度
kubectl uncordon k8s-worker-1
node/k8s-worker-1 uncordoned

4.1.2 集群升级

# 集群升级
# 备份ETCD
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
–cacert=/etc/kubernetes/pki/etcd/ca.crt \
–cert=/etc/kubernetes/pki/etcd/server.crt \
–key=/etc/kubernetes/pki/etcd/server.key
Snapshot saved at snapshot.db

# 升级Kubernetes
kubeadm upgrade plan
Components that must be upgraded manually after you have upgraded the control plane with ‘kubeadm upgrade apply’:
COMPONENT CURRENT AVAILABLE
kubelet 1 x v1.27.0 v1.28.0

# 升级控制平面
kubeadm upgrade apply v1.28.0
[upgrade/successful] SUCCESS! Your cluster was upgraded to “v1.28.0”. Enjoy!

# 升级kubelet
yum install -y kubelet-1.28.0 kubeadm-1.28.0 kubectl-1.28.0
systemctl daemon-reload
systemctl restart kubelet
kubelet upgraded successfully

4.2 备份恢复

4.2.1 备份集群

# 备份集群
# 安装Velero
velero install –provider aws \
–plugins velero/velero-plugin-for-aws:v1.8.0 \
–bucket velero-backups \
–secret-file ./credentials-velero \
–use-volume-snapshots=true \
–backup-location-config region=minio
Velero is installed!

# 创建备份
velero backup create myapp-backup –include-namespaces myapp
Backup request “myapp-backup” submitted successfully.
Run `velero backup describe myapp-backup` or `velero backup logs myapp-backup` for more details.

# 查看备份
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
myapp-backup Completed 0 0 2026-01-15 10:00:00 +0000 UTC 29d default <none>

4.2.2 恢复集群

# 恢复集群
# 执行恢复
velero restore create –from-backup myapp-backup
Restore request “myapp-backup-20260115102000” submitted successfully.
Run `velero restore describe myapp-backup-20260115102000` or `velero restore logs myapp-backup-20260115102000` for more details.

# 查看恢复状态
velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
myapp-backup-20260115102000 myapp-backup Completed 2026-01-15 10:20:00 +0000 UTC 2026-01-15 10:20:30 +0000 UTC 0 0 2026-01-15 10:20:00 +0000 UTC <none>

5. 经验总结

5.1 最佳实践

5.1.1 集群管理最佳实践

  • 高可用架构:部署高可用的集群架构
  • 资源规划:合理规划集群资源
  • 安全配置:配置安全策略和访问控制
  • 监控告警:配置监控和告警
  • 备份恢复:配置备份和恢复

5.1.2 运维管理最佳实践

  • 自动化运维:使用自动化工具减少人工操作
  • 标准化管理:建立标准化的运维流程
  • 文档管理:建立完善的运维文档
  • 团队协作:建立高效的团队协作机制
  • 持续改进:持续改进运维流程

5.2 常见问题

5.2.1 集群问题

  • 问题1:节点NotReady
  • 解决方案:检查kubelet和容器运行时
  • 问题2:Pod无法启动
  • 解决方案:检查资源配额和镜像
  • 问题3:网络不通
  • 解决方案:检查网络插件和策略

5.2.2 运维问题

  • 问题1:备份失败
  • 解决方案:检查存储和权限
  • 问题2:升级失败
  • 解决方案:检查版本兼容性
  • 问题3:恢复失败
  • 解决方案:检查备份完整性

5.3 安全建议

5.3.1 网络安全

  • 网络隔离:使用网络策略隔离网络
  • 防火墙规则:配置防火墙规则限制访问
  • 加密传输:使用TLS加密传输
  • 证书管理:定期更新证书

5.3.2 访问控制

  • RBAC权限:配置RBAC权限控制
  • Pod安全:使用Pod安全策略保护Pod
  • 密钥管理:使用Secret管理敏感信息
  • 审计日志:启用审计日志记录操作

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息