1. 自动化概述
系统运维自动化通过脚本和工具减少人工操作,提高效率和可靠性。更多学习教程www.fgedu.net.cn
运维自动化体系:
┌─────────────────────────────────────────────────────┐
│ 自动化平台 │
│ (Ansible/Jenkins/自研) │
└───────────────────────┬─────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
v v v
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 配置管理 │ │ 部署管理 │ │ 监控管理 │
│ (Ansible) │ │ (Jenkins) │ │ (Prometheus) │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
v v v
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 批量执行 │ │ CI/CD │ │ 告警通知 │
│ 脚本库 │ │ 流水线 │ │ 自动处理 │
└───────────────┘ └───────────────┘ └───────────────┘
# 查看自动化工具
# which ansible
/usr/bin/ansible
# ansible –version
ansible 2.9.27
config file = /etc/ansible/ansible.cfg
configured module search path = [u’/root/.ansible/plugins/modules’, u’/usr/share/ansible/plugins/modules’]
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5
# 查看Jenkins状态
# systemctl status jenkins
● jenkins.service – LSB: Jenkins Automation Server
Loaded: loaded (/etc/rc.d/init.d/jenkins; generated)
Active: active (running) since Fri 2026-04-03 10:00:00 CST; 2h ago
# 查看定时任务
# crontab -l
0 2 * * * /opt/scripts/backup.sh
0 6 * * * /opt/scripts/monitor.sh
*/5 * * * * /opt/scripts/health_check.sh
2. Shell脚本自动化
Shell脚本是运维自动化的基础工具。学习交流加群风哥微信: itpux-com
# cat > /opt/scripts/system_check.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/system_check_$(date +%Y%m%d).log" echo "==========================================" >> $LOG_FILE
echo “系统巡检报告 – $(date)” >> $LOG_FILE
echo “==========================================” >> $LOG_FILE
# 1. 系统信息
echo “” >> $LOG_FILE
echo “【系统信息】” >> $LOG_FILE
echo “fgedu.net.cn: $(hostname)” >> $LOG_FILE
echo “系统版本: $(cat /etc/redhat-release)” >> $LOG_FILE
echo “内核版本: $(uname -r)” >> $LOG_FILE
echo “运行时间: $(uptime | awk -F’up ‘ ‘{print $2}’)” >> $LOG_FILE
# 2. CPU检查
echo “” >> $LOG_FILE
echo “【CPU状态】” >> $LOG_FILE
echo “CPU核心数: $(nproc)” >> $LOG_FILE
echo “CPU使用率: $(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’)%” >> $LOG_FILE
echo “负载均衡: $(cat /proc/loadavg | awk ‘{print $1,$2,$3}’)” >> $LOG_FILE
# 3. 内存检查
echo “” >> $LOG_FILE
echo “【内存状态】” >> $LOG_FILE
free -h >> $LOG_FILE
MEM_USED=$(free | grep Mem | awk ‘{printf “%.1f”, $3/$2 * 100.0}’)
echo “内存使用率: ${MEM_USED}%” >> $LOG_FILE
# 4. 磁盘检查
echo “” >> $LOG_FILE
echo “【磁盘状态】” >> $LOG_FILE
df -h >> $LOG_FILE
DISK_USAGE=$(df -h / | tail -1 | awk ‘{print $5}’ | tr -d ‘%’)
if [ $DISK_USAGE -gt 80 ]; then
echo “警告: 根分区使用率超过80%” >> $LOG_FILE
fi
# 5. 网络检查
echo “” >> $LOG_FILE
echo “【网络状态】” >> $LOG_FILE
echo “网络连接数: $(netstat -an | wc -l)” >> $LOG_FILE
echo “ESTABLISHED连接: $(netstat -an | grep ESTABLISHED | wc -l)” >> $LOG_FILE
# 6. 进程检查
echo “” >> $LOG_FILE
echo “【关键进程】” >> $LOG_FILE
for proc in nginx java mysql; do
if pgrep -x “$proc” > /dev/null; then
echo “$proc: 运行中” >> $LOG_FILE
else
echo “$proc: 未运行” >> $LOG_FILE
fi
done
# 7. 端口检查
echo “” >> $LOG_FILE
echo “【端口状态】” >> $LOG_FILE
for port in 22 80 443 3306; do
if netstat -tuln | grep -q “:$port “; then
echo “端口 $port: 监听中” >> $LOG_FILE
else
echo “端口 $port: 未监听” >> $LOG_FILE
fi
done
echo “” >> $LOG_FILE
echo “==========================================” >> $LOG_FILE
echo “系统巡检完成,报告已保存到 $LOG_FILE”
EOF
# chmod +x /opt/scripts/system_check.sh
# /opt/scripts/system_check.sh
系统巡检完成,报告已保存到 /var/log/system_check_20260403.log
# 查看巡检报告
# cat /var/log/system_check_20260403.log
==========================================
系统巡检报告 – Fri Apr 3 10:00:00 CST 2026
==========================================
【系统信息】
fgedu.net.cn: fgedu-app01
系统版本: Red Hat Enterprise Linux Server release 7.9
内核版本: 3.10.0-1160.el7.x86_64
运行时间: 30 days, 4:05,
【CPU状态】
CPU核心数: 8
CPU使用率: 15.3%
负载均衡: 0.52 0.58 0.59
【内存状态】
total used free shared buff/cache available
Mem: 31G 2.1G 28G 128M 1.2G 28G
Swap: 8.0G 0B 8.0G
内存使用率: 6.8%
【磁盘状态】
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 15G 36G 30% /
/dev/sda2 200G 80G 120G 40% /data
tmpfs 16G 0 16G 0% /dev/shm
【网络状态】
网络连接数: 256
ESTABLISHED连接: 45
【关键进程】
nginx: 运行中
java: 运行中
mysql: 运行中
【端口状态】
端口 22: 监听中
端口 80: 监听中
端口 443: 监听中
端口 3306: 监听中
==========================================
3. Ansible自动化
Ansible是强大的配置管理和自动化工具。学习交流加群风哥QQ113257174
# cat > /etc/ansible/hosts << 'EOF' [web_servers] fgedu-web01 ansible_host=192.168.1.10 fgedu-web02 ansible_host=192.168.1.11 fgedu-web03 ansible_host=192.168.1.12 [app_servers] fgedu-app01 ansible_host=192.168.1.20 fgedu-app02 ansible_host=192.168.1.21 [db_servers] fgedu-db01 ansible_host=192.168.1.30 fgedu-db02 ansible_host=192.168.1.31 [all:vars] ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa EOF # 测试连接 # ansible all -m ping fgedu-web01 | SUCCESS => {
“changed”: false,
“ping”: “pong”
}
fgedu-web02 | SUCCESS => {
“changed”: false,
“ping”: “pong”
}
fgedu-app01 | SUCCESS => {
“changed”: false,
“ping”: “pong”
}
# 批量执行命令
# ansible web_servers -a “uptime”
fgedu-web01 | CHANGED | rc=0 >>
10:00:00 up 30 days, 4:05, 2 users, load average: 0.52, 0.58, 0.59
fgedu-web02 | CHANGED | rc=0 >>
10:00:00 up 30 days, 4:05, 2 users, load average: 0.48, 0.55, 0.57
# 使用Playbook
# cat > /opt/ansible/playbooks/system_init.yml << 'EOF'
---
- name: 系统初始化配置
hosts: all
become: yes
tasks:
- name: 更新系统包
yum:
name: '*'
state: latest
update_cache: yes
- name: 安装常用工具
yum:
name:
- vim
- wget
- curl
- net-tools
- htop
- iotop
state: present
- name: 配置时间同步
yum:
name: chrony
state: present
- name: 启动chrony服务
service:
name: chronyd
state: started
enabled: yes
- name: 配置系统参数
sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
with_items:
- { name: 'net.core.somaxconn', value: '65535' }
- { name: 'vm.swappiness', value: '10' }
- { name: 'fs.file-max', value: '655350' }
- name: 创建运维用户
user:
name: ops
groups: wheel
shell: /bin/bash
state: present
- name: 配置SSH
lineinfile:
path: /etc/ssh/sshd_config
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
with_items:
- { regexp: '^#?PermitRootLogin', line: 'PermitRootLogin no' }
- { regexp: '^#?PasswordAuthentication', line: 'PasswordAuthentication no' }
notify: restart sshd
handlers:
- name: restart sshd
service:
name: sshd
state: restarted
EOF
# 执行Playbook
# ansible-playbook /opt/ansible/playbooks/system_init.yml
PLAY [系统初始化配置] ***********************************************************
TASK [Gathering Facts] *********************************************************
ok: [fgedu-web01]
ok: [fgedu-web02]
TASK [更新系统包] ***************************************************************
changed: [fgedu-web01]
changed: [fgedu-web02]
TASK [安装常用工具] *************************************************************
changed: [fgedu-web01]
changed: [fgedu-web02]
TASK [配置时间同步] *************************************************************
ok: [fgedu-web01]
ok: [fgedu-web02]
TASK [启动chrony服务] **********************************************************
ok: [fgedu-web01]
ok: [fgedu-web02]
PLAY RECAP *********************************************************************
fgedu-web01 : ok=6 changed=2 unreachable=0 failed=0
fgedu-web02 : ok=6 changed=2 unreachable=0 failed=0
4. 定时任务调度
定时任务实现周期性自动化操作。更多学习教程公众号风哥教程itpux_com
# crontab -e
# 系统维护任务
0 2 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1
0 3 * * * /opt/scripts/log_clean.sh >> /var/log/log_clean.log 2>&1
0 4 * * 0 /opt/scripts/system_update.sh >> /var/log/update.log 2>&1
# 监控任务
*/5 * * * * /opt/scripts/health_check.sh >> /var/log/health.log 2>&1
*/10 * * * * /opt/scripts/service_monitor.sh >> /var/log/service.log 2>&1
0 */1 * * * /opt/scripts/resource_monitor.sh >> /var/log/resource.log 2>&1
# 报告任务
0 8 * * * /opt/scripts/daily_report.sh | mail -s “Daily Report” admin@fgedu.net.cn
0 9 * * 1 /opt/scripts/weekly_report.sh | mail -s “Weekly Report” admin@fgedu.net.cn
# 查看定时任务
# crontab -l
0 2 * * * /opt/scripts/backup.sh >> /var/log/backup.log 2>&1
0 3 * * * /opt/scripts/log_clean.sh >> /var/log/log_clean.log 2>&1
*/5 * * * * /opt/scripts/health_check.sh >> /var/log/health.log 2>&1
# 日志清理脚本
# cat > /opt/scripts/log_clean.sh << 'EOF'
#!/bin/bash
LOG_DIR="/var/log"
RETENTION_DAYS=30
echo "开始清理日志文件..."
echo "保留天数: $RETENTION_DAYS"
# 清理旧日志文件
find $LOG_DIR -name "*.log" -mtime +$RETENTION_DAYS -exec rm -f {} \;
find $LOG_DIR -name "*.log.*" -mtime +$RETENTION_DAYS -exec rm -f {} \;
# 清理journal日志
journalctl --vacuum-time=${RETENTION_DAYS}d
# 统计清理结果
CLEANED=$(find $LOG_DIR -name "*.log" -mtime +$RETENTION_DAYS 2>/dev/null | wc -l)
echo “清理完成,共删除 $CLEANED 个文件”
# 查看磁盘空间
df -h $LOG_DIR
EOF
# chmod +x /opt/scripts/log_clean.sh
# 系统更新脚本
# cat > /opt/scripts/system_update.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/system_update_$(date +%Y%m%d).log"
echo "==========================================" >> $LOG_FILE
echo “系统更新开始: $(date)” >> $LOG_FILE
echo “==========================================” >> $LOG_FILE
# 检查可用更新
echo “检查可用更新…” >> $LOG_FILE
yum check-update >> $LOG_FILE 2>&1
# 执行更新
echo “执行系统更新…” >> $LOG_FILE
yum update -y >> $LOG_FILE 2>&1
# 清理缓存
echo “清理yum缓存…” >> $LOG_FILE
yum clean all >> $LOG_FILE 2>&1
# 检查是否需要重启
if [ -f /var/run/reboot-required ]; then
echo “系统需要重启” >> $LOG_FILE
echo “系统需要重启” | mail -s “系统更新通知” admin@fgedu.net.cn
fi
echo “==========================================” >> $LOG_FILE
echo “系统更新完成: $(date)” >> $LOG_FILE
echo “==========================================” >> $LOG_FILE
EOF
# chmod +x /opt/scripts/system_update.sh
5. 批量操作
批量操作提高多服务器管理效率。author:www.itpux.com
# cat > /opt/scripts/batch_exec.sh << 'EOF' #!/bin/bash HOSTS_FILE="/etc/ansible/hosts" COMMAND="$1" LOG_FILE="/var/log/batch_exec_$(date +%Y%m%d_%H%M%S).log" if [ -z "$COMMAND" ]; then echo "Usage: $0
exit 1
fi
echo “批量执行命令: $COMMAND”
echo “目标主机:”
grep -E “ansible_host=” $HOSTS_FILE | awk -F= ‘{print $2}’
echo “”
for host in $(grep -E “ansible_host=” $HOSTS_FILE | awk -F= ‘{print $2}’); do
echo “========================================” >> $LOG_FILE
echo “主机: $host” >> $LOG_FILE
echo “命令: $COMMAND” >> $LOG_FILE
echo “时间: $(date)” >> $LOG_FILE
echo “—————————————-” >> $LOG_FILE
ssh -o StrictHostKeyChecking=no $host “$COMMAND” >> $LOG_FILE 2>&1
echo “” >> $LOG_FILE
done
echo “执行完成,日志: $LOG_FILE”
EOF
# chmod +x /opt/scripts/batch_exec.sh
# 批量文件分发
# cat > /opt/scripts/batch_copy.sh << 'EOF'
#!/bin/bash
HOSTS_FILE="/etc/ansible/hosts"
SOURCE_FILE="$1"
DEST_DIR="$2"
if [ -z "$SOURCE_FILE" ] || [ -z "$DEST_DIR" ]; then
echo "Usage: $0
exit 1
fi
echo “批量分发文件: $SOURCE_FILE”
echo “目标目录: $DEST_DIR”
echo “”
for host in $(grep -E “ansible_host=” $HOSTS_FILE | awk -F= ‘{print $2}’); do
echo “分发到: $host”
scp -o StrictHostKeyChecking=no $SOURCE_FILE $host:$DEST_DIR/
done
echo “分发完成”
EOF
# chmod +x /opt/scripts/batch_copy.sh
# 批量服务管理
# cat > /opt/scripts/batch_service.sh << 'EOF'
#!/bin/bash
HOSTS_FILE="/etc/ansible/hosts"
SERVICE="$1"
ACTION="$2"
if [ -z "$SERVICE" ] || [ -z "$ACTION" ]; then
echo "Usage: $0
exit 1
fi
echo “批量管理服务: $SERVICE”
echo “操作: $ACTION”
echo “”
for host in $(grep -E “ansible_host=” $HOSTS_FILE | awk -F= ‘{print $2}’); do
echo “主机: $host”
ssh -o StrictHostKeyChecking=no $host “systemctl $ACTION $SERVICE”
echo “”
done
echo “操作完成”
EOF
# chmod +x /opt/scripts/batch_service.sh
# 执行示例
# /opt/scripts/batch_exec.sh “uptime”
批量执行命令: uptime
目标主机:
192.168.1.10
192.168.1.11
192.168.1.12
========================================
主机: 192.168.1.10
命令: uptime
时间: Fri Apr 3 10:00:00 CST 2026
—————————————-
10:00:00 up 30 days, 4:05, 2 users, load average: 0.52, 0.58, 0.59
========================================
主机: 192.168.1.11
命令: uptime
时间: Fri Apr 3 10:00:00 CST 2026
—————————————-
10:00:00 up 30 days, 4:05, 2 users, load average: 0.48, 0.55, 0.57
6. 自动化部署
自动化部署实现应用的快速发布。
# cat > /opt/scripts/deploy_app.sh << 'EOF' #!/bin/bash APP_NAME="fgedu-webapp" APP_VERSION="$1" DEPLOY_DIR="/opt/app" BACKUP_DIR="/opt/backup" LOG_FILE="/var/log/deploy_${APP_NAME}_$(date +%Y%m%d_%H%M%S).log" if [ -z "$APP_VERSION" ]; then echo "Usage: $0
exit 1
fi
log() {
echo “[$(date ‘+%Y-%m-%d %H:%M:%S’)] $1” | tee -a $LOG_FILE
}
log “==========================================”
log “开始部署应用: $APP_NAME”
log “版本: $APP_VERSION”
log “==========================================”
# 1. 备份当前版本
log “1. 备份当前版本…”
if [ -d “$DEPLOY_DIR/$APP_NAME” ]; then
BACKUP_NAME=”${APP_NAME}_backup_$(date +%Y%m%d_%H%M%S)”
cp -r $DEPLOY_DIR/$APP_NAME $BACKUP_DIR/$BACKUP_NAME
log “备份完成: $BACKUP_DIR/$BACKUP_NAME”
fi
# 2. 下载新版本
log “2. 下载新版本…”
wget -q http://nexus.fgedu.net.cn/releases/${APP_NAME}-${APP_VERSION}.tar.gz -O /tmp/${APP_NAME}.tar.gz
if [ $? -ne 0 ]; then
log “错误: 下载失败”
exit 1
fi
log “下载完成”
# 3. 停止服务
log “3. 停止服务…”
systemctl stop ${APP_NAME}
sleep 5
log “服务已停止”
# 4. 部署新版本
log “4. 部署新版本…”
rm -rf $DEPLOY_DIR/$APP_NAME
mkdir -p $DEPLOY_DIR/$APP_NAME
tar -xzf /tmp/${APP_NAME}.tar.gz -C $DEPLOY_DIR/$APP_NAME
log “部署完成”
# 5. 更新配置
log “5. 更新配置…”
cp /opt/config/${APP_NAME}/* $DEPLOY_DIR/$APP_NAME/config/
log “配置更新完成”
# 6. 设置权限
log “6. 设置权限…”
chown -R app:app $DEPLOY_DIR/$APP_NAME
chmod -R 755 $DEPLOY_DIR/$APP_NAME/bin
log “权限设置完成”
# 7. 启动服务
log “7. 启动服务…”
systemctl start ${APP_NAME}
sleep 10
# 8. 健康检查
log “8. 健康检查…”
for i in {1..30}; do
if curl -s http://fgedudb:8080/health | grep -q “OK”; then
log “健康检查通过”
break
fi
if [ $i -eq 30 ]; then
log “错误: 健康检查失败”
exit 1
fi
sleep 2
done
# 9. 清理临时文件
log “9. 清理临时文件…”
rm -f /tmp/${APP_NAME}.tar.gz
log “清理完成”
log “==========================================”
log “部署成功完成”
log “==========================================”
# 发送通知
echo “应用 $APP_NAME 版本 $APP_VERSION 部署成功” | mail -s “部署通知” admin@fgedu.net.cn
EOF
# chmod +x /opt/scripts/deploy_app.sh
# Jenkins Pipeline配置
# cat > /opt/jenkins/pipelines/fgedu-webapp/Jenkinsfile << 'EOF'
pipeline {
agent any
environment {
APP_NAME = 'fgedu-webapp'
DEPLOY_SERVERS = 'fgedu-web01,fgedu-web02,fgedu-web03'
}
stages {
stage('Checkout') {
steps {
git branch: 'main', url: 'http://git.fgedu.net.cn/fgedu-webapp.git'
}
}
stage('Build') {
steps {
sh 'mvn clean package -DskipTests'
}
}
stage('Test') {
steps {
sh 'mvn test'
}
}
stage('Package') {
steps {
sh 'tar -czf ${APP_NAME}-${BUILD_NUMBER}.tar.gz target/*.jar config/'
archiveArtifacts artifacts: '*.tar.gz', fingerprint: true
}
}
stage('Deploy') {
steps {
script {
def servers = env.DEPLOY_SERVERS.split(',')
for (server in servers) {
sh "scp ${APP_NAME}-${BUILD_NUMBER}.tar.gz ${server}:/tmp/"
sh "ssh ${server} '/opt/scripts/deploy_app.sh ${BUILD_NUMBER}'"
}
}
}
}
stage('Verify') {
steps {
script {
def servers = env.DEPLOY_SERVERS.split(',')
for (server in servers) {
sh "curl -s http://${server}:8080/health | grep 'OK'"
}
}
}
}
}
post {
success {
mail to: 'admin@fgedu.net.cn',
subject: "部署成功: ${APP_NAME}",
body: "应用 ${APP_NAME} 部署成功,版本: ${BUILD_NUMBER}"
}
failure {
mail to: 'admin@fgedu.net.cn',
subject: "部署失败: ${APP_NAME}",
body: "应用 ${APP_NAME} 部署失败,请检查日志"
}
}
}
EOF
7. 自动化监控
自动化监控持续跟踪系统状态。
# cat > /opt/scripts/service_monitor.sh << 'EOF' #!/bin/bash SERVICES="nginx java mysql redis" LOG_FILE="/var/log/service_monitor.log" ALERT_EMAIL="admin@fgedu.net.cn" for service in $SERVICES; do if ! systemctl is-active --quiet $service; then echo "[$(date)] 服务 $service 已停止,尝试重启..." >> $LOG_FILE
# 尝试重启服务
systemctl restart $service
# 等待服务启动
sleep 5
# 检查服务状态
if systemctl is-active –quiet $service; then
echo “[$(date)] 服务 $service 重启成功” >> $LOG_FILE
else
echo “[$(date)] 服务 $service 重启失败,发送告警” >> $LOG_FILE
echo “服务 $service 重启失败,请立即处理” | mail -s “服务告警” $ALERT_EMAIL
fi
fi
done
EOF
# chmod +x /opt/scripts/service_monitor.sh
# 资源监控脚本
# cat > /opt/scripts/resource_monitor.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/resource_monitor.log"
ALERT_EMAIL="admin@fgedu.net.cn"
# CPU告警阈值
CPU_THRESHOLD=80
# 内存告警阈值
MEM_THRESHOLD=85
# 磁盘告警阈值
DISK_THRESHOLD=80
# 检查CPU使用率
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print int($2)}')
if [ $CPU_USAGE -gt $CPU_THRESHOLD ]; then
echo "[$(date)] CPU使用率过高: ${CPU_USAGE}%" >> $LOG_FILE
echo “CPU使用率过高: ${CPU_USAGE}%” | mail -s “CPU告警” $ALERT_EMAIL
fi
# 检查内存使用率
MEM_USAGE=$(free | grep Mem | awk ‘{printf “%.0f”, $3/$2 * 100}’)
if [ ${MEM_USAGE%.*} -gt $MEM_THRESHOLD ]; then
echo “[$(date)] 内存使用率过高: ${MEM_USAGE}%” >> $LOG_FILE
echo “内存使用率过高: ${MEM_USAGE}%” | mail -s “内存告警” $ALERT_EMAIL
fi
# 检查磁盘使用率
for mount in / /data; do
DISK_USAGE=$(df -h $mount | tail -1 | awk ‘{print $5}’ | tr -d ‘%’)
if [ $DISK_USAGE -gt $DISK_THRESHOLD ]; then
echo “[$(date)] 磁盘 $mount 使用率过高: ${DISK_USAGE}%” >> $LOG_FILE
echo “磁盘 $mount 使用率过高: ${DISK_USAGE}%” | mail -s “磁盘告警” $ALERT_EMAIL
fi
done
EOF
# chmod +x /opt/scripts/resource_monitor.sh
# 日志监控脚本
# cat > /opt/scripts/log_monitor.sh << 'EOF'
#!/bin/bash
LOG_FILES="/var/log/nginx/error.log /var/log/application/error.log"
KEYWORDS="ERROR FATAL Exception OutOfMemory"
LOG_FILE="/var/log/log_monitor.log"
ALERT_EMAIL="admin@fgedu.net.cn"
for log_file in $LOG_FILES; do
if [ -f "$log_file" ]; then
for keyword in $KEYWORDS; do
COUNT=$(grep -c "$keyword" $log_file 2>/dev/null || echo 0)
if [ $COUNT -gt 10 ]; then
echo “[$(date)] $log_file 中发现 $COUNT 个 $keyword” >> $LOG_FILE
tail -100 $log_file | grep “$keyword” | mail -s “日志告警: $keyword” $ALERT_EMAIL
fi
done
fi
done
EOF
# chmod +x /opt/scripts/log_monitor.sh
8. 自动化备份
自动化备份确保数据安全。
# cat > /opt/scripts/backup.sh << 'EOF' #!/bin/bash BACKUP_DIR="/backup" DATE=$(date +%Y%m%d) LOG_FILE="/var/log/backup_${DATE}.log" RETENTION_DAYS=30 log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE } log "==========================================" log "开始系统备份" log "==========================================" # 1. 备份配置文件 log "1. 备份配置文件..." tar -czf $BACKUP_DIR/config_${DATE}.tar.gz \ /etc/nginx \ /etc/systemd/system \ /opt/app/config \ 2>/dev/null
log “配置文件备份完成”
# 2. 备份应用数据
log “2. 备份应用数据…”
tar -czf $BACKUP_DIR/app_data_${DATE}.tar.gz \
/opt/app/data \
/opt/app/logs \
2>/dev/null
log “应用数据备份完成”
# 3. 备份数据库
log “3. 备份数据库…”
mysqldump -u backup -p’Fgedu@Backup123′ –all-databases > $BACKUP_DIR/mysql_${DATE}.sql
gzip $BACKUP_DIR/mysql_${DATE}.sql
log “数据库备份完成”
# 4. 备份到远程存储
log “4. 同步到远程存储…”
rsync -avz –delete $BACKUP_DIR/ backup-server.fgedu.net.cn:/backup/fgedu-app01/
log “远程同步完成”
# 5. 清理旧备份
log “5. 清理旧备份…”
find $BACKUP_DIR -name “*.tar.gz” -mtime +$RETENTION_DAYS -delete
find $BACKUP_DIR -name “*.sql.gz” -mtime +$RETENTION_DAYS -delete
log “旧备份清理完成”
# 6. 备份报告
log “6. 生成备份报告…”
BACKUP_SIZE=$(du -sh $BACKUP_DIR | awk ‘{print $1}’)
log “备份总大小: $BACKUP_SIZE”
log “==========================================”
log “备份完成”
log “==========================================”
# 发送通知
echo “系统备份完成,大小: $BACKUP_SIZE” | mail -s “备份通知” admin@fgedu.net.cn
EOF
# chmod +x /opt/scripts/backup.sh
# 数据库备份脚本
# cat > /opt/scripts/db_backup.sh << 'EOF'
#!/bin/bash
DB_HOST="192.168.1.30"
DB_USER="backup"
DB_PASS="Fgedu@Backup123"
BACKUP_DIR="/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=7
mkdir -p $BACKUP_DIR
echo "开始数据库备份..."
# 获取所有数据库
DATABASES=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "SHOW DATABASES;" | grep -Ev "Database|information_schema|performance_schema")
for db in $DATABASES; do
echo "备份数据库: $db"
mysqldump -h $DB_HOST -u $DB_USER -p$DB_PASS \
--single-transaction \
--routines \
--triggers \
--events \
$db | gzip > $BACKUP_DIR/${db}_${DATE}.sql.gz
done
echo “清理旧备份…”
find $BACKUP_DIR -name “*.sql.gz” -mtime +$RETENTION_DAYS -delete
echo “数据库备份完成”
EOF
# chmod +x /opt/scripts/db_backup.sh
9. 自动化告警
自动化告警及时通知运维人员。
# cat > /opt/scripts/alert_notify.sh << 'EOF' #!/bin/bash ALERT_TYPE="$1" ALERT_MESSAGE="$2" ALERT_LEVEL="$3" EMAIL_RECIPIENTS="admin@fgedu.net.cn ops@fgedu.net.cn" WEBHOOK_URL="https://hooks.slack.com/services/xxx" send_email() { local subject="[$ALERT_LEVEL] $ALERT_TYPE 告警" local body=" 告警类型: $ALERT_TYPE 告警级别: $ALERT_LEVEL 告警时间: $(date) 告警内容: $ALERT_MESSAGE fgedu.net.cn: $(hostname) IP地址: $(hostname -I | awk '{print $1}') " echo "$body" | mail -s "$subject" $EMAIL_RECIPIENTS } send_webhook() { local payload='{ "text": "['"$ALERT_LEVEL"'] '"$ALERT_TYPE"' 告警", "attachments": [{ "color": "danger", "fields": [ {"title": "告警内容", "value": "'"$ALERT_MESSAGE"'", "short": false}, {"title": "主机", "value": "'"$(hostname)"'", "short": true}, {"title": "时间", "value": "'"$(date)"'", "short": true} ] }] }' curl -s -X POST -H 'Content-type: application/json' --data "$payload" $WEBHOOK_URL } # 发送告警 send_email send_webhook echo "告警已发送: $ALERT_TYPE - $ALERT_MESSAGE" EOF # chmod +x /opt/scripts/alert_notify.sh # 告警聚合脚本 # cat > /opt/scripts/alert_aggregate.sh << 'EOF' #!/bin/bash ALERT_LOG="/var/log/alerts.log" ALERT_CACHE="/tmp/alert_cache" COOLDOWN=300 process_alert() { local alert_key="$1" local alert_message="$2" local now=$(date +%s) local last_alert=$(cat $ALERT_CACHE/$alert_key 2>/dev/null || echo 0)
local diff=$((now – last_alert))
if [ $diff -gt $COOLDOWN ]; then
echo $now > $ALERT_CACHE/$alert_key
/opt/scripts/alert_notify.sh “System” “$alert_message” “WARNING”
echo “[$(date)] 发送告警: $alert_message” >> $ALERT_LOG
else
echo “[$(date)] 告警冷却中,跳过: $alert_message” >> $ALERT_LOG
fi
}
mkdir -p $ALERT_CACHE
# 检查系统指标并触发告警
CPU_USAGE=$(top -bn1 | grep “Cpu(s)” | awk ‘{print int($2)}’)
if [ $CPU_USAGE -gt 80 ]; then
process_alert “cpu_high” “CPU使用率过高: ${CPU_USAGE}%”
fi
MEM_USAGE=$(free | grep Mem | awk ‘{printf “%.0f”, $3/$2 * 100}’)
if [ ${MEM_USAGE%.*} -gt 85 ]; then
process_alert “mem_high” “内存使用率过高: ${MEM_USAGE}%”
fi
EOF
# chmod +x /opt/scripts/alert_aggregate.sh
10. 最佳实践
运维自动化最佳实践确保系统稳定可靠。
# cat > /opt/docs/automation_best_practices.md << 'EOF' # 运维自动化最佳实践 ## 1. 脚本管理 - 所有脚本使用版本控制(Git) - 脚本添加详细注释 - 使用统一的编码规范 - 定期审查和优化脚本 ## 2. 安全管理 - 敏感信息使用加密存储 - 使用SSH密钥认证 - 限制脚本执行权限 - 记录操作审计日志 ## 3. 错误处理 - 完善的错误检查机制 - 失败时自动回滚 - 详细的错误日志 - 告警通知机制 ## 4. 测试验证 - 在测试环境验证脚本 - 编写自动化测试用例 - 灰度发布机制 - 回滚预案 ## 5. 文档管理 - 维护操作手册 - 记录变更历史 - 编写故障处理文档 - 定期更新文档 ## 6. 监控告警 - 全面的监控覆盖 - 合理的告警阈值 - 告警分级处理 - 告警静默机制 ## 7. 备份恢复 - 定期备份验证 - 异地备份存储 - 恢复演练 - 备份加密 ## 8. 持续改进 - 定期回顾优化 - 引入新技术工具 - 自动化覆盖率提升 - 团队技能提升 EOF # 自动化健康检查脚本 # cat > /opt/scripts/automation_health_check.sh << 'EOF' #!/bin/bash echo "==========================================" echo "自动化系统健康检查" echo "==========================================" # 1. 检查定时任务 echo "" echo "1. 定时任务状态" crontab -l | grep -v "^#" | grep -v "^$" # 2. 检查脚本权限 echo "" echo "2. 脚本权限检查" find /opt/scripts -name "*.sh" -exec ls -la {} \; | awk '{print $1, $9}' # 3. 检查日志文件 echo "" echo "3. 最近日志文件" ls -lt /var/log/*.log 2>/dev/null | head -5
# 4. 检查Ansible连接
echo “”
echo “4. Ansible连接状态”
ansible all -m ping -f 5 2>/dev/null | grep -E “SUCCESS|FAILED”
# 5. 检查备份状态
echo “”
echo “5. 备份文件状态”
ls -lh /backup/*.tar.gz 2>/dev/null | tail -5
# 6. 检查监控服务
echo “”
echo “6. 监控服务状态”
systemctl is-active prometheus grafana alertmanager 2>/dev/null
echo “”
echo “==========================================”
echo “健康检查完成”
echo “==========================================”
EOF
# chmod +x /opt/scripts/automation_health_check.sh
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
