内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。
风哥提示:
本文档介绍Linux系统服务崩溃时的快速恢复方法和故障排查步骤。
Part01-服务状态检查
1.1 检查服务状态
[root@fgedu-server ~]# systemctl status nginx
● nginx.service – The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sat 2026-01-15 10:00:00 CST; 5min ago
Process: 1234 ExecStart=/usr/sbin/nginx (code=exited, status=1/FAILURE)
Main PID: 1234 (code=exited, status=1/FAILURE)
Jan 15 10:00:00 fgedu-server systemd[1]: Starting The nginx HTTP and reverse proxy server…
Jan 15 10:00:00 fgedu-server nginx[1234]: nginx: [emerg] invalid number of arguments in “listen” directive in /etc/nginx/nginx.conf:10
Jan 15 10:00:00 fgedu-server systemd[1]: nginx.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 10:00:00 fgedu-server systemd[1]: nginx.service: Failed with result ‘exit-code’.
Jan 15 10:00:00 fgedu-server systemd[1]: Failed to start The nginx HTTP and reverse proxy server.
# 检查服务日志
[root@fgedu-server ~]# journalctl -u nginx
— Logs begin at Sat 2026-01-15 09:00:00 CST, end at Sat 2026-01-15 10:05:00 CST. —
Jan 15 10:00:00 fgedu-server systemd[1]: Starting The nginx HTTP and reverse proxy server…
Jan 15 10:00:00 fgedu-server nginx[1234]: nginx: [emerg] invalid number of arguments in “listen” directive in /etc/nginx/nginx.conf:10
Jan 15 10:00:00 fgedu-server systemd[1]: nginx.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 10:00:00 fgedu-server systemd[1]: nginx.service: Failed with result ‘exit-code’.
Jan 15 10:00:00 fgedu-server systemd[1]: Failed to start The nginx HTTP and reverse proxy server.
# 检查系统日志
[root@fgedu-server ~]# tail -50 /var/log/messages | grep nginx
Jan 15 10:00:00 fgedu-server nginx[1234]: nginx: [emerg] invalid number of arguments in “listen” directive in /etc/nginx/nginx.conf:10
# 检查服务配置文件
[root@fgedu-server ~]# nginx -t
nginx: [emerg] invalid number of arguments in “listen” directive in /etc/nginx/nginx.conf:10
nginx: configuration file /etc/nginx/nginx.conf test failed
Part02-服务故障排查
2.1 配置文件检查
[root@fgedu-server ~]# cat -n /etc/nginx/nginx.conf | grep -A 5 -B 5 “listen”
5 server {
6 listen 80 default_server
7 server_name localhost;
8 root /usr/share/nginx/html;
9 index index.html index.htm;
10 listen 8080
11 location / {
12 try_files $uri $uri/ =404;
13 }
14 }
# 修复配置文件错误
[root@fgedu-server ~]# sed -i ‘s/listen 80 default_server/lis学习交流加群风哥QQ113257174ten 80 default_server;/’ /etc/nginx/nginx.conf
[root@fgedu-server ~]# sed -i ‘s/listen 8080/listen 8080;/’ /etc/nginx/nginx.conf
# 验证配置文件
[root@fgedu-server ~]# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
2.2 资源检查
[root@fgedu-server ~]# free -h
total used free shared buff/cache available
Mem: 8.0G 6.0G 500M 100M 1.5G 1.2G
Swap: 4.0G 2.0G 2.0G
# 检查磁盘空间
[root@fgedu-server ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 48G 1.2G 98% /
/dev/sda2 200G 80G 110G 43% /data
/dev/sdb1 500G 200G 280G 42% /backup
# 检查文件权限
[root@fgedu-server ~]# ls -la /etc/nginx/
total 64
drwxr-xr-x. 4 root root 4096 Jan 15 09:00 .
drwxr-xr-x. 78 root root 4096 Jan 15 09:00 ..
drwxr-xr-x. 2 root root 4096 Jan 15 09:00 conf.d
drwxr-xr-x. 2 root root 4096 Jan 15 09:00 default.d
-rw-r–r–. 1 root root 1000 Jan 15 10:00 nginx.conf
-rw-r–r–. 1 root root 2853 Jan 15 09:00 mime.types
# 检查端口占用
[root@fgedu-server ~]# netstat -tuln | grep 80
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN
tcp6 0 0 :::80 :::* LISTEN
# 检查进程状态
[root@fgedu-server ~]# ps aux | grep nginx
root 5678 0.0 0.0 112812 980 pts/0 S+ 10:05 0:00 grep –color=auto nginx
Part03-服务恢复
3.1 重启服务
[root@fgedu-server ~]# systemctl restart nginx
# 检查服务状态
[root@fgedu-server ~]# systemctl status nginx
● nginx.service – The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2026-01-15 10:06:00 CST; 5s ago
Main PID: 7890 (nginx)
Tasks: 2
Memory: 2.3M
CPU: 25ms
CGroup: /system.slice/nginx.service
├─7890 nginx: master process /usr/sbin/nginx
└─7891 nginx: worker process
Jan 15 10:06:00 fgedu-server systemd[1]: Starting The nginx HTTP and reverse proxy server…
Jan 15 10:06:00 fgedu-server systemd[1]: Started The nginx HTTP and reverse proxy server.
# 验证服务运行
[root@fgedu-server ~]# curl -I http://localhost
HTTP/1.1 200 OK
Server: nginx/1.20.1
Date: Sat, 15 Jan 2026 02:06:30 GMT
Content-Type: text/html
Content-Length: 4833
Last-Modified: Sat, 15 Jan 2026 01:00:00 GMT
Connection: keep-alive
ETag: “60000000-12e1”
Accept-Ranges: bytes
# 检查服务日志
[root@fgedu-server ~]# journalctl -u nginx -n 10
— Logs begin at Sat 2026-01-15 09:00:00 CST, end at Sat 2026-01-15 10:06:30 CST. —
Jan 15 10:06:00 fgedu-server systemd[1]: Starting The nginx HTTP and reverse proxy server…
Jan 15 10:06:00 fgedu-server systemd[1]: Started The nginx HTTP and reverse proxy server.
3.2 服务故障分类处理
[root@fgedu-server ~]# systemctl stop nginx
[root@fgedu-server ~]# nginx -t
[root@fgedu-server ~]# systemctl start nginx
[root@fgedu-server ~]# systemctl enable nginx
# 数据库服务恢复(MySQL)
[root@fgedu-server ~]# systemctl stop mysqld
[root@fgedu-server ~]# mysqld –check –all-databases
[root@fgedu-server ~]# systemctl start mysqld
[root@fgedu-server ~]# systemctl enable mysqld
# 应用服务恢复(Tomcat)
[root@fgedu-server ~]# systemctl stop tomcat
[root@fgedu-server ~]# ps aux | grep tomcat | grep -v grep | awk ‘{print $2}’ | xargs kill -9
[root@fgedu-server ~]# systemctl start tomcat
[root@fgedu-server ~]# systemctl enable tomcat
# 网络服务恢复(SSH)
[root@fgedu-server ~]# systemctl stop sshd
[root@fgedu-server ~]# systemctl start sshd
[root@fgedu-server ~]# systemctl enable sshd
# 邮件服务恢复(Postfix)
[root@fgedu-server ~]# systemctl stop postfix
[root@fgedu-server ~]# systemctl start postfix
[root@fgedu-server ~]# systemctl enable postfix
Part04-服务崩溃预防
4.1 服务监控
[root@fgedu-server ~]# cat > /usr/local/bin/service-monitor.sh << 'EOF' #!/bin/bash # service-monitor.sh # from:www.itpux.com.qq113257174.wx:itpux-com # web: http://www.fgedu.net.cn ALERT_EMAIL="admin@fgedu.net.cn" SERVICES=( "nginx" "mysqld" "httpd" "sshd" "postfix" "tomcat" ) for service in "${SERVICES[@]}"; do echo "检查服务: $service" # 检查服务状态 systemctl is-active $service > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo “告警: 服务 $service 未运行”
echo “服务 $service 未运行,正在尝试重启…”
# 尝试重启服务
systemctl restart $service
# 检查重启是否成功
sleep 3
systemctl is-active $service > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo “告警: 服务 $service 重启失败”
echo “服务 $service 重启失败,请手动检查” | mail -s “告警: 服务 $service 崩溃” $ALERT_EMAIL
else
echo “服务 $service 重启成功”
echo “服务 $service 已成功重启” | mail -s “通知: 服务 $service 已重启” $ALERT_EMAIL
fi
else
echo “服务 $service 运行正常”
fi
done
echo “服务监控完成: $(date)”
EOF
[root@fgedu-server ~]# chmod +x /usr/local/bin/service-monitor.sh
# 配置定时监控
[root@fgedu-server ~]# cat > /etc/cron.d/service-monitor << 'EOF'
# 服务监控
* * * * * root /usr/local/bin/service-monitor.sh
EOF
# 创建服务自动恢复脚本
[root@fgedu-server ~]# cat > /usr/local/bin/service-recover.sh << 'EOF'
#!/bin/bash
# service-recover.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
SERVICES=(
"nginx"
"mysqld"
"httpd"
"sshd"
"postfix"
"tomcat"
)
echo "开始服务恢复..."
for service in "${SERVICES[@]}"; do
echo "处理服务: $service"
# 停止服务
systemctl stop $service > /dev/null 2>&1
# 清理残留进程
pids=$(ps aux | grep $service | grep -v grep | awk ‘{print $2}’)
if [ ! -z “$pids” ]; then
echo “清理残留进程: $pids”
kill -9 $pids > /dev/null 2>&1
fi
# 启动服务
systemctl start $service
# 检查服务状态
systemctl is-active $service > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo “服务 $service 恢复成功”
else
echo “服务 $service 恢复失败”
fi
done
echo “服务恢复完成: $(date)”
EOF
[root@fgedu-server ~]# chmod +x /usr/local/bin/service-recover.sh
Part05-服务配置最佳实践
5.1 服务配置优化
[root@fgedu-server ~]# cat > /etc/nginx/nginx.conf << 'EOF' user nginx; worker_processes auto; error_log /var/log/nginx/error.log warn; pid /var/run/nginx.pid; events { worker_connections 1024; } http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 65; gzip on; gzip_comp_level 6; gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript; include /etc/nginx/conf.d/*.conf; } EOF # MySQL配置优化 [root@fgedu-ser学习交流加群风哥微信: itpux-comver ~]# cat > /etc/my.cnf.d/custom.cnf << 'EOF' [mysqld] max_connections = 1000 innodb_buffer_pool_size = 1G innodb_log_file_size = 256M innodb_flush_log_at_trx_commit = 2 sync_binlog = 0 query_cache_size = 0 query_cache_type = 0 EOF # SSH配置优化 [root@fgedu-server ~]# cat > /etc/ssh/sshd_config.d/custom.conf << 'EOF' Port 22 Protocol 2 PermitRootLogin no MaxAuthTries 3 MaxSessions 10 ClientAliveInterval 300 ClientAliveCountMax 3 UseDNS no GSSAPIAuthentication no EOF # 重启服务应用配置 [root@fgedu-server ~]# systemctl restart nginx [root@fgedu-server ~]# systemctl restart mysqld [root@fgedu-server ~]# systemctl restart sshd
Part06-应急响应流程
6.1 服务崩溃应急响应
[root@fgedu-server ~]# cat > /usr/local/bin/emergency-response.sh << 'EOF' #!/bin/bash # emergency-response.sh # from:www.itpux.com.qq113257174.wx:itpux-com # web: http://www.fgedu.net.cn ALERT_EMAIL="admin@fgedu.net.cn" # 记录开始时间 START_TIME=$(date +"%Y-%m-%d %H:%M:%S") echo "应急响应开始: $START_TIME" # 1. 收集系统信息 echo "1. 收集系统信息..." mkdir -p /var/log/emergency/$(date +%Y%m%d) SYSTEM_LOG=/var/log/emergency/$(date +%Y%m%d)/system-info.txt echo "系统信息" > $SYSTEM_LOG
echo “============” >> $SYSTEM_LOG
uname -a >> $SYSTEM_LOG
echo “” >> $SYSTEM_LOG
uptime >> $SYSTEM_LOG
echo “” >> $SYSTEM_LOG
free -h >> $SYSTEM_LOG
echo “” >> $SYSTEM_LOG
df -h >> $SYSTEM_LOG
echo “” >> $SYSTEM_LOG
ps aux –sort=-%cpu | head -20 >> $SYSTEM_LOG
echo “” >> $SYSTEM_LOG
ps aux –sort=-%mem | head -20 >> $SYSTEM_LOG
# 2. 检查服务状态
echo “2. 检查服务状态…”
SERVICES=(
“nginx”
“mysqld”
“httpd”
“sshd”
“postfix”
“tomcat”
)
for service in “${SERVICES[@]}”; do
echo “检查服务: $service”
systemctl status $service >> $SYSTEM_LOG
echo “” >> $SYSTEM_LOG
done
# 3. 尝试恢复服务
echo “3. 尝试恢复服务…”
for service in “${SERVICES[@]}”; do
systemctl is-active $service > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo “恢复服务: $service”
systemctl restart $service
sleep 2
systemctl status $service
fi
done
# 4. 验证服务恢复
echo “4. 验证服务恢复…”
for service in “${SERVICES[@]}”; do
systemctl is-active $service > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo “服务 $service 恢复成功”
else
echo “服务 $service 恢复失败”
fi
done
# 5. 发送通知
END_TIME=$(date +”%Y-%m-%d %H:%M:%S”)
echo “应急响应结束: $END_TIME”
echo “服务崩溃应急响应已完成,请查看详细日志: $SYSTEM_LOG” | mail -s “通知: 服务崩溃应急响应完成” $ALERT_EMAIL
EOF
[root@fgedu-server ~]# chmod +x /usr/local/bin/emergency-response.sh
- 保持冷静,按照步骤逐步排查
- 首先检查服务日志和错误信息
- 验证配置文件正确性
- 检查系统资源使用情况
- 尝试重启服务
- 建立服务监控和自动恢复机制
- 定期备份服务配置
- 制定服务崩溃应急预案
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
