1. 首页 > Linux教程 > 正文

Linux教程FG031-升级过程中故障排查与回滚命令

1. 故障排查概述

RHEL9到RHEL10升级过程中可能会遇到各种故障,本教程介绍如何系统地排查和解决这些故障,以及如何使用LVM快照和GRUB进行回滚。更多学习教程www.fgedu.net.cn

参考Red Hat Enterprise Linux 10官方文档中的System administration章节

# 升级故障分类
$ cat > /backup/upgrade_failures.txt << 'EOF' 升级故障分类: 1. 升级前故障 - preupgrade检查失败 - 磁盘空间不足 - 依赖关系冲突 - 软件包不兼容 2. 升级中故障 - 网络连接中断 - 软件包下载失败 - 安装过程中断 - 配置迁移失败 3. 升级后故障 - 系统无法启动 - 内核启动失败 - 服务启动失败 - 网络连接失败 - 应用功能异常 4. 数据故障 - 数据丢失 - 数据损坏 - 配置文件丢失 - 权限问题 EOF $ cat /backup/upgrade_failures.txt 升级故障分类: 1. 升级前故障 - preupgrade检查失败 - 磁盘空间不足 - 依赖关系冲突 - 软件包不兼容 2. 升级中故障 - 网络连接中断 - 软件包下载失败 - 安装过程中断 - 配置迁移失败 3. 升级后故障 - 系统无法启动 - 内核启动失败 - 服务启动失败 - 网络连接失败 - 应用功能异常 4. 数据故障 - 数据丢失 - 数据损坏 - 配置文件丢失 - 权限问题
故障排查原则:1. 先查看日志,了解故障详情;2. 从简单到复杂,逐步排查;3. 记录所有操作和结果;4. 准备回滚方案;5. 寻求技术支持。

2. 日志分析技巧

学习如何分析各种日志文件来定位升级故障。学习交流加群风哥微信: itpux-com

# 查看leapp升级日志
$ sudo tail -100 /var/log/leapp/leapp-upgrade.log
2026-04-02 10:00:00 INFO: Starting upgrade process
2026-04-02 10:00:01 INFO: Source version: RHEL 9.5
2026-04-02 10:00:02 INFO: Target version: RHEL 10.0
2026-04-02 10:00:05 INFO: Phase 1: Preparation
2026-04-02 10:00:10 INFO: Downloading upgrade packages
2026-04-02 10:05:00 ERROR: Failed to download package: kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
2026-04-02 10:05:01 ERROR: Error: Cannot download repomd.xml: Cannot download repodata/repomd.xml: error was
2026-04-02 10:05:02 ERROR: Upgrade process failed

# 查看系统日志
$ sudo journalctl -xe
— Logs begin at Mon 2026-04-01 10:00:00 CST, end at Wed 2026-04-02 10:05:02 CST. —
Apr 02 10:05:00 rhel9-server leapp[1234]: ERROR: Failed to download package: kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
Apr 02 10:05:01 rhel9-server leapp[1234]: ERROR: Error: Cannot download repomd.xml: Cannot download repodata/repomd.xml: error was
Apr 02 10:05:02 rhel9-server systemd[1]: leapp-upgrade.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 10:05:02 rhel9-server systemd[1]: leapp-upgrade.service: Failed with result ‘exit-code’

# 查看内核日志
$ sudo dmesg | tail -50
[12345.678901] leapp[1234]: ERROR: Failed to download package
[12345.678902] leapp[1234]: Upgrade process failed
[12345.678903] systemd[1]: leapp-upgrade.service: Main process exited, code=exited, status=1/FAILURE
[12345.678904] systemd[1]: leapp-upgrade.service: Failed with result ‘exit-code’

# 查看DNF日志
$ sudo cat /var/log/dnf.log | tail -50
2026-04-02T10:00:00+0800 DEBUG Installed: leapp-0.18.0-1.el9.x86_64
2026-04-02T10:05:00+0800 ERROR Failed to download package: kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
2026-04-02T10:05:01+0800 ERROR Error: Cannot download repomd.xml: Cannot download repodata/repomd.xml: error was

# 查看升级报告
$ sudo cat /var/log/leapp/leapp-upgrade-report.txt
Leapp Upgrade Report
====================

Upgrade Date: Wed Apr 2 10:00:00 CST 2026
Source Version: RHEL 9.5
Target Version: RHEL 10.0
Status: Failed

Error Details:
Error Type: Package Download Failure
Error Message: Failed to download package: kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
Error Code: 1

Recommended Actions:
1. Check network connection
2. Verify repository availability
3. Clear DNF cache
4. Retry upgrade process

# 查看JSON格式的报告
$ sudo cat /var/log/leapp/leapp-upgrade-report.json | python3 -m json.tool
{
“upgrade_date”: “2026-04-02T10:00:00+08:00”,
“source_version”: “RHEL 9.5”,
“target_version”: “RHEL 10.0”,
“status”: “failed”,
“error”: {
“type”: “Package Download Failure”,
“message”: “Failed to download package: kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64”,
“code”: 1
},
“recommended_actions”: [
“Check network connection”,
“Verify repository availability”,
“Clear DNF cache”,
“Retry upgrade process”
]
}

风哥提示:日志分析是故障排查的第一步,通过查看各种日志文件可以快速定位问题所在。

3. 常见升级失败场景

分析升级过程中常见的失败场景及解决方法。学习交流加群风哥QQ113257174

# 场景1:网络连接中断
# 检查网络连接
$ ping -c 4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
— 8.8.8.8 ping statistics —
4 packets transmitted, 0 received, 100% packet loss, time 3005ms

# 检查网络接口
$ ip addr show ens33
2: ens33: mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether 00:0c:29:12:34:56 brd ff:ff:ff:ff:ff:ff

# 重启网络服务
$ sudo systemctl restart NetworkManager

# 验证网络连接
$ ping -c 4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=12.3 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=11.8 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=12.1 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=11.9 ms

— 8.8.8.8 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 11.832/12.058/12.317/0.198 ms

# 场景2:磁盘空间不足
# 检查磁盘空间
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 48G 2G 96% /

# 清理DNF缓存
$ sudo dnf clean all
0 files removed

# 清理journal日志
$ sudo journalctl –vacuum-size=500M
Vacuuming done, freed 1.2G of archived journals from /var/log/journal.

# 清理旧日志文件
$ sudo find /var/log -name “*.log.*” -delete
$ sudo find /var/log -name “*.gz” -delete

# 清理临时文件
$ sudo rm -rf /tmp/*
$ sudo rm -rf /var/tmp/*

# 验证磁盘空间
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 42G 8G 84% /

# 场景3:软件包依赖冲突
# 查看冲突的软件包
$ sudo dnf update –assumeno
Updating Subscription Management repositories.
Last metadata expiration check: 0:00:00 ago on Wed 02 Apr 2026 10:00:00 AM CST.
Dependencies resolved.
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
kernel x86_64 6.5.0-0.rc0.20260401git1234567.el10 rhel-10-baseos 150 M

Problem: problem with installed package custom-app-1.0.0-1.el9.x86_64
– package custom-app-1.0.0-1.el9.x86_64 requires libpython2.7.so.1.0()(64bit), but none of the providers can be installed
– cannot install both python2-2.7.18-18.el9.x86_64 and python3-3.9.16-1.el10.x86_64
– package custom-app-1.0.0-1.el9.x86_64 requires python2, but none of the providers can be installed

# 移除冲突的软件包
$ sudo dnf remove -y custom-app
Dependencies resolved.
================================================================================
Package Arch Version Repository Size
================================================================================
Removing:
custom-app x86_64 1.0.0-1.el9 @local 50 M

Transaction Summary
================================================================================
Remove 1 Packages

Installed size: 50 M

Is this ok [y/N]: y
Running transaction
Preparing : 1/1
Erasing : custom-app-1.0.0-1.el9.x86_64 1/1
Running scriptlet: custom-app-1.0.0-1.el9.x86_64 1/1
Verifying : custom-app-1.0.0-1.el9.x86_64 1/1

Removed:
custom-app-1.0.0-1.el9.x86_64

Complete!

# 场景4:配置文件迁移失败
# 查看配置文件备份
$ sudo ls -la /var/log/leapp/*.rpmsave
-rw-r–r– 1 root root 1234 Apr 2 10:00:00 /var/log/leapp/ssh_config.rpmsave
-rw-r–r– 1 root root 5678 Apr 2 10:00:00 /var/log/leapp/sshd_config.rpmsave

# 查看配置文件差异
$ sudo diff /etc/ssh/ssh_config /var/log/leapp/ssh_config.rpmsave
123c123
< PermitRootLogin yes --- > PermitRootLogin no

# 手动恢复配置文件
$ sudo cp /var/log/leapp/ssh_config.rpmsave /etc/ssh/ssh_config

# 重启相关服务
$ sudo systemctl restart sshd

# 验证服务状态
$ sudo systemctl status sshd
● sshd.service – OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:sshd(8) man:sshd_config(5)
Main PID: 1234 (sshd)
Tasks: 1 (limit: 4915)
Memory: 5.2M
CGroup: /system.slice/sshd.service
└─1234 /usr/sbin/sshd -D -oCiphers=aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr,aes128-gcm@openssh.com,aes128-ctr

常见失败场景处理:1. 网络连接中断:检查网络配置,重启网络服务;2. 磁盘空间不足:清理缓存和日志文件;3. 软件包依赖冲突:移除冲突的软件包;4. 配置文件迁移失败:手动恢复配置文件。

4. 启动失败处理

处理系统启动失败的各种情况。更多学习教程公众号风哥教程itpux_com

# 场景1:内核启动失败
# 进入GRUB菜单,选择”Advanced options”
# 选择旧内核启动

# 查看可用内核
$ sudo grubby –info=ALL | grep title
title=Red Hat Enterprise Linux (6.5.0-0.rc0.20260401git1234567.el10.x86_64) 10.0 (Ootpa)
title=Red Hat Enterprise Linux (5.14.0-427.28.1.el9_5.x86_64) 9.5 (Plow)

# 设置默认启动项为旧内核
$ sudo grubby –set-default-index=1

# 验证默认启动项
$ sudo grubby –default-index
1

# 重启系统
$ sudo reboot

# 系统启动后,验证内核版本
$ uname -r
5.14.0-427.28.1.el9_5.x86_64

# 场景2:GRUB配置损坏
# 进入GRUB救援模式
# 选择”Rescue a Red Hat Enterprise Linux system”

# 重新安装GRUB
$ sudo grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

# 重新生成GRUB配置
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file …
Found linux image: /boot/vmlinuz-6.5.0-0.rc0.20260401git1234567.el10.x86_64
Found initrd image: /boot/initramfs-6.5.0-0.rc0.20260401git1234567.el10.x86_64.img
Found linux image: /boot/vmlinuz-5.14.0-427.28.1.el9_5.x86_64
Found initrd image: /boot/initramfs-5.14.0-427.28.1.el9_5.x86_64.img
done

# 重启系统
$ sudo reboot

# 场景3:initramfs损坏
# 重新生成initramfs
$ sudo dracut –regenerate-all –force
dracut: *** Including module: bash ***
dracut: *** Including module: systemd ***
dracut: *** Including module: systemd-initrd ***
dracut: *** Including module: i18n ***
dracut: *** Including module: network ***
dracut: *** Including module: ifcfg ***
dracut: *** Including module: drm ***
dracut: *** Including module: plymouth ***
dracut: *** Including module: kernel-modules ***
dracut: *** Including module: kernel-modules-extra ***
dracut: *** Including module: resume ***
dracut: *** Including module: rootfs-block ***
dracut: *** Including module: terminfo ***
dracut: *** Including module: udev-rules ***
dracut: *** Including module: base ***
dracut: *** Including module: fs-lib ***
dracut: *** Including module: shutdown ***
dracut: *** Including module: usrmount ***
dracut: *** Including module: emergency ***
dracut: *** Including module: 99base ***
dracut: *** Including module: 99shutdown ***
dracut: *** Including module: 99emergency ***
dracut: *** Creating initramfs image file ‘/boot/initramfs-6.5.0-0.rc0.20260401git1234567.el10.x86_64.img’ ***
dracut: *** Creating initramfs image file ‘/boot/initramfs-5.14.0-427.28.1.el9_5.x86_64.img’ ***

# 重启系统
$ sudo reboot

# 场景4:文件系统损坏
# 检查文件系统
$ sudo fsck -y /dev/sda1
fsck from util-linux 2.37.4
e2fsck 1.46.5 (30-Dec-2021)
/dev/sda1: clean, 123456/6553600 files, 1234567/26214400 blocks

# 如果文件系统损坏,修复它
$ sudo fsck -y -f /dev/sda1
fsck from util-linux 2.37.4
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary
/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: 123456/6553600 files (0.2% non-contiguous), 1234567/26214400 blocks

# 重启系统
$ sudo reboot

风哥提示:启动失败时,首先尝试使用旧内核启动,如果不行,再考虑GRUB修复或文件系统修复。

5. 网络故障处理

处理升级后的网络连接问题。更多学习教程www.fgedu.net.cn

# 场景1:网络接口名称变化
# 查看网络接口
$ ip link show
1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
2: enp0s3: mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:12:34:56 brd ff:ff:ff:ff:ff:ff

# 查看旧的网络配置文件
$ ls -la /etc/sysconfig/network-scripts/
total 12
-rw-r–r– 1 root root 356 Apr 2 10:00:00 ifcfg-ens33
-rw-r–r– 1 root root 254 Apr 2 10:00:00 ifcfg-lo

# 重命名网络配置文件
$ sudo mv /etc/sysconfig/network-scripts/ifcfg-ens33 /etc/sysconfig/network-scripts/ifcfg-enp0s3

# 修改配置文件中的设备名称
$ sudo sed -i ‘s/NAME=”ens33″/NAME=”enp0s3″/’ /etc/sysconfig/network-scripts/ifcfg-enp0s3
$ sudo sed -i ‘s/DEVICE=”ens33″/DEVICE=”enp0s3″/’ /etc/sysconfig/network-scripts/ifcfg-enp0s3

# 重启网络服务
$ sudo systemctl restart NetworkManager

# 验证网络连接
$ ip addr show enp0s3
2: enp0s3: mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 00:0c:29:12:34:56 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.100/24 brd 192.168.1.255 scope global dynamic noprefixroute enp0s3
valid_lft 86399sec preferred_lft 86399sec

# 场景2:NetworkManager配置问题
# 检查NetworkManager状态
$ sudo systemctl status NetworkManager
● NetworkManager.service – Network Manager
Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:NetworkManager(8)
Main PID: 1234 (NetworkManager)
Tasks: 3 (limit: 4915)
Memory: 8.5M
CGroup: /system.slice/NetworkManager.service
├─1234 /usr/sbin/NetworkManager –no-daemon
├─1235 /usr/sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-enp0s3.pid -lf /var/lib/NetworkManager/dhclient-5fb06bd0-af35-11eb-95e7-54e1adxxxxxx-enp0s3.lease -cf /var/lib/NetworkManager/dhclient-enp0s3.conf enp0s3
└─1236 /usr/sbin/dnsmasq –no-resolv –keep-in-foreground –no-hosts –bind-interfaces –pid-file=/var/run/NetworkManager/dnsmasq-enp0s3.pid –listen-address=127.0.0.1 –cache-size=400 –dhcp-range=192.168.1.100,192.168.1.200,12h –dhcp-option=option:router,192.168.1.1 –dhcp-option=option:dns-server,8.8.8.8 –conf-file=/var/lib/NetworkManager/dnsmasq-enp0s3.conf

# 查看网络连接
$ sudo nmcli connection show
NAME UUID TYPE DEVICE
ens33 abc12345-def67-8901-2345-678901234567 ethernet enp0s3

# 重新加载网络配置
$ sudo nmcli connection reload

# 重新连接网络
$ sudo nmcli connection up ens33
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/1)

# 验证网络连接
$ ping -c 4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=12.3 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=11.8 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=12.1 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=11.9 ms

— 8.8.8.8 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 11.832/12.058/12.317/0.198 ms

# 场景3:DNS配置问题
# 检查DNS配置
$ cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 127.0.0.1

# 查看NetworkManager DNS配置
$ sudo nmcli connection show ens33 | grep dns
ipv4.dns: 8.8.8.8,8.8.4.4
ipv4.dns-search: —
ipv4.dns-options: —
ipv6.dns: —
ipv6.dns-search: —
ipv6.dns-options: —

# 重新配置DNS
$ sudo nmcli connection mod ens33 ipv4.dns “8.8.8.8 8.8.4.4”
$ sudo nmcli connection up ens33
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/1)

# 验证DNS解析
$ nslookup www.baidu.com
Server: 8.8.8.8
Address: 8.8.8.8#53

Non-authoritative answer:
www.baidu.com canonical name = www.a.shifen.com
Name: www.a.shifen.com
Address: 14.215.177.38
Name: www.a.shifen.com
Address: 14.215.177.39

网络故障处理要点:1. 检查网络接口名称是否变化;2. 检查NetworkManager配置;3. 检查DNS配置;4. 检查防火墙规则;5. 测试网络连接。

6. 服务故障处理

处理升级后服务启动失败的问题。学习交流加群风哥微信: itpux-com

# 场景1:服务启动失败
# 查看服务状态
$ sudo systemctl status httpd
● httpd.service – The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:httpd.service(8)
Process: 1234 ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND (code=exited, status=1/FAILURE)
Main PID: 1234 (code=exited, status=1/FAILURE)

Apr 02 10:00:00 rhel10-server systemd[1]: Starting The Apache HTTP Server…
Apr 02 10:00:00 rhel10-server httpd[1234]: AH00526: Syntax error on line 123 of /etc/httpd/conf/httpd.conf:
Apr 02 10:00:00 rhel10-server httpd[1234]: Invalid command ‘Require’, perhaps misspelled or defined by a module not included in the server configuration
Apr 02 10:00:00 rhel10-server systemd[1]: httpd.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 10:00:00 rhel10-server systemd[1]: httpd.service: Failed with result ‘exit-code’
Apr 02 10:00:00 rhel10-server systemd[1]: Failed to start The Apache HTTP Server.

# 查看配置文件
$ sudo sed -n ‘120,125p’ /etc/httpd/conf/httpd.conf
# Require all granted

# 检查Apache模块
$ sudo httpd -M | grep auth
authz_core_module (shared)
authz_host_module (shared)

# 启用所需的模块
$ sudo sed -i ‘s/#LoadModule authz_core_module/LoadModule authz_core_module/’ /etc/httpd/conf.modules.d/00-base.conf

# 重启服务
$ sudo systemctl restart httpd

# 验证服务状态
$ sudo systemctl status httpd
● httpd.service – The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:httpd.service(8)
Main PID: 1235 (httpd)
Tasks: 175 (limit: 4915)
Memory: 15.2M
CGroup: /system.slice/httpd.service
├─1235 /usr/sbin/httpd -DFOREGROUND
├─1236 /usr/sbin/httpd -DFOREGROUND
├─1237 /usr/sbin/httpd -DFOREGROUND
├─1238 /usr/sbin/httpd -DFOREGROUND
└─1239 /usr/sbin/httpd -DFOREGROUND

# 场景2:数据库服务启动失败
# 查看服务状态
$ sudo systemctl status mariadb
● mariadb.service – MariaDB 10.11 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:mariadbd(8)
Process: 1234 ExecStartPre=/usr/libexec/mariadb-check-socket (code=exited, status=0/SUCCESS)
Process: 1235 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir mariadb.service (code=exited, status=1/FAILURE)
Main PID: 1235 (code=exited, status=1/FAILURE)

Apr 02 10:00:00 rhel10-server systemd[1]: Starting MariaDB 10.11 database server…
Apr 02 10:00:00 rhel10-server mariadb-prepare-db-dir[1235]: File ‘/var/lib/mysql/ibdata1’ size is 0 bytes
Apr 02 10:00:00 rhel10-server mariadb-prepare-db-dir[1235]: InnoDB: Error: log file ./ib_logfile0 is of different size 0 bytes than InnoDB: specified 5242880 bytes!
Apr 02 10:00:00 rhel10-server systemd[1]: mariadb.service: Control process exited, code=exited, status=1/FAILURE
Apr 02 10:00:00 rhel10-server systemd[1]: mariadb.service: Failed with result ‘exit-code’
Apr 02 10:00:00 rhel10-server systemd[1]: Failed to start MariaDB 10.11 database server.

# 检查数据库目录
$ sudo ls -la /var/lib/mysql/
total 123456
drwxr-xr-x 2 mysql mysql 4096 Apr 2 10:00:00 .
drwxr-xr-x 3 root root 4096 Apr 2 10:00:00 ..
-rw-r—– 1 mysql mysql 0 Apr 2 10:00:00 ibdata1
-rw-r—– 1 mysql mysql 5242880 Apr 2 10:00:00 ib_logfile0
-rw-r—– 1 mysql mysql 5242880 Apr 2 10:00:00 ib_logfile1

# 删除损坏的日志文件
$ sudo rm -f /var/lib/mysql/ib_logfile0 /var/lib/mysql/ib_logfile1

# 重启服务
$ sudo systemctl restart mariadb

# 验证服务状态
$ sudo systemctl status mariadb
● mariadb.service – MariaDB 10.11 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:mariadbd(8)
Process: 1236 ExecStartPost=/usr/libexec/mysql-check-upgrade (code=exited, status=0/SUCCESS)
Main PID: 1237 (mariadbd)
Tasks: 10 (limit: 4915)
Memory: 150.2M
CGroup: /system.slice/mariadb.service
└─1237 /usr/libexec/mariadbd –basedir=/usr

# 场景3:SELinux阻止服务启动
# 检查SELinux上下文
$ sudo ls -Z /var/www/html
system_u:object_r:httpd_sys_content_t:s0 index.html

# 检查SELinux日志
$ sudo ausearch -m avc -ts recent | tail -20
type=AVC msg=audit(1234567890.123:123): avc: denied { read } for pid=1234 comm=”httpd” name=”index.html” dev=”sda1″ ino=123456 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:user_home_t:s0 tclass=file permissive=0

# 恢复SELinux上下文
$ sudo restorecon -R -v /var/www/html
restorecon reset /var/www/html context system_u:object_r:user_home_t:s0->system_u:object_r:httpd_sys_content_t:s0

# 重启服务
$ sudo systemctl restart httpd

# 验证服务状态
$ sudo systemctl status httpd
● httpd.service – The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:httpd.service(8)
Main PID: 1235 (httpd)
Tasks: 175 (limit: 4915)
Memory: 15.2M
CGroup: /system.slice/httpd.service
├─1235 /usr/sbin/httpd -DFOREGROUND
├─1236 /usr/sbin/httpd -DFOREGROUND
├─1237 /usr/sbin/httpd -DFOREGROUND
├─1238 /usr/sbin/httpd -DFOREGROUND
└─1239 /usr/sbin/httpd -DFOREGROUND

风哥提示:服务启动失败时,首先查看服务状态和日志,了解失败原因,然后针对性地解决问题。

7. LVM快照回滚

使用LVM快照进行系统回滚。学习交流加群风哥QQ113257174 更多视频教程www.fgedu.net.cn

# 查看LVM快照
$ sudo lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
root rhel owi-aotz– 50.00g
rhel10-upgrade-snapshot rhel Vri—tz-k 50.00g root 0.00
home rhel -wi-ao—- 20.00g
var rhel -wi-ao—- 20.00g
tmp rhel -wi-ao—- 10.00g
swap rhel -wi-ao—- 4.00g

# 检查快照状态
$ sudo lvdisplay /dev/rhel/rhel10-upgrade-snapshot
— Logical volume —
LV Path /dev/rhel/rhel10-upgrade-snapshot
LV Name rhel10-upgrade-snapshot
VG Name rhel
LV UUID abc123-def456-7890-1234-567890123456
LV Write Access read/write
LV Creation host, time rhel9-server, 2026-04-02 10:00:00 +0800
LV Status available
# open 0
LV Size 50.00 GiB
Current LE 12800
COW-table size 4.00 GiB
COW-table LE 1024
Allocated to snapshot 0.00%
Snapshot chunk size 4.00 KiB
Segments 1
Allocation inherit
Read ahead sectors auto
– currently set to 256
Block device 253:6

# 执行回滚前准备
$ sudo systemctl stop httpd
$ sudo systemctl stop mariadb
$ sudo systemctl stop docker

# 卸载文件系统
$ sudo umount /home
$ sudo umount /var
$ sudo umount /tmp

# 执行LVM快照合并
$ sudo lvconvert –merge /dev/rhel/rhel10-upgrade-snapshot
Merging of volume rhel/rhel10-upgrade-snapshot into rhel/root started.
rhel/root: Merged: 0.00%
rhel/root: Merged: 25.00%
rhel/root: Merged: 50.00%
rhel/root: Merged: 75.00%
rhel/root: Merged: 100.00%

# 重启系统以完成合并
$ sudo reboot

# 系统重启后,验证回滚结果
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.5 (Plow)

$ uname -r
5.14.0-427.28.1.el9_5.x86_64

# 查看LVM状态
$ sudo lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
root rhel owi-aotz– 50.00g
home rhel -wi-ao—- 20.00g
var rhel -wi-ao—- 20.00g
tmp rhel -wi-ao—- 10.00g
swap rhel -wi-ao—- 4.00g

# 快照已自动删除

# 验证系统功能
$ sudo systemctl status sshd
● sshd.service – OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2026-04-02 10:00:00 CST; 10s ago
Docs: man:sshd(8) man:sshd_config(5)
Main PID: 1234 (sshd)
Tasks: 1 (limit: 4915)
Memory: 5.2M
CGroup: /system.slice/sshd.service
└─1234 /usr/sbin/sshd -D -oCiphers=aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes256-ctr,aes128-gcm@openssh.com,aes128-ctr

$ ping -c 4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=12.3 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=119 time=11.8 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=119 time=12.1 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=119 time=11.9 ms

— 8.8.8.8 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 11.832/12.058/12.317/0.198 ms

LVM快照回滚要点:1. 停止所有服务;2. 卸载文件系统;3. 执行快照合并;4. 重启系统;5. 验证系统功能。

8. GRUB引导回滚

使用GRUB引导进行系统回滚。更多学习教程公众号风哥教程itpux_com

# 查看GRUB菜单项
$ sudo grubby –info=ALL | grep -A 2 “title”
title=Red Hat Enterprise Linux (6.5.0-0.rc0.20260401git1234567.el10.x86_64) 10.0 (Ootpa)
title=Red Hat Enterprise Linux (5.14.0-427.28.1.el9_5.x86_64) 9.5 (Plow)

# 查看当前默认启动项
$ sudo grubby –default-index
0

# 设置默认启动项为旧内核
$ sudo grubby –set-default-index=1

# 验证默认启动项
$ sudo grubby –default-index
1

# 重启系统
$ sudo reboot

# 系统启动后,验证内核版本
$ uname -r
5.14.0-427.28.1.el9_5.x86_64

# 如果GRUB菜单损坏,使用GRUB救援模式
# 1. 重启系统,在GRUB菜单出现时按’e’键
# 2. 找到以’linux16’或’linux’开头的行
# 3. 在该行末尾添加’rd.break’
# 4. 按’Ctrl+x’启动系统

# 进入救援模式后,重新挂载根文件系统
switch_root:/# mount -o remount,rw /sysroot

# 切换到根文件系统
switch_root:/# chroot /sysroot

# 重新安装GRUB
bash-4.4# grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

# 重新生成GRUB配置
bash-4.4# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file …
Found linux image: /boot/vmlinuz-6.5.0-0.rc0.20260401git1234567.el10.x86_64
Found initrd image: /boot/initramfs-6.5.0-0.rc0.20260401git1234567.el10.x86_64.img
Found linux image: /boot/vmlinuz-5.14.0-427.28.1.el9_5.x86_64
Found initrd image: /boot/initramfs-5.14.0-427.28.1.el9_5.x86_64.img
done

# 退出chroot环境
bash-4.4# exit

# 重启系统
switch_root:/# reboot

# 如果需要删除损坏的内核
# 查看已安装的内核
$ rpm -qa | grep kernel
kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
kernel-core-6.5.0-0.rc0.20260401git1234567.el10.x86_64
kernel-modules-6.5.0-0.rc0.20260401git1234567.el10.x86_64
kernel-5.14.0-427.28.1.el9_5.x86_64
kernel-core-5.14.0-427.28.1.el9_5.x86_64
kernel-modules-5.14.0-427.28.1.el9_5.x86_64

# 删除损坏的内核
$ sudo dnf remove -y kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
Dependencies resolved.
================================================================================
Package Arch Version Repository Size
================================================================================
Removing:
kernel x86_64 6.5.0-0.rc0.20260401git1234567.el10 rhel-10-baseos 150 M
kernel-core x86_64 6.5.0-0.rc0.20260401git1234567.el10 rhel-10-baseos 100 M
kernel-modules x86_64 6.5.0-0.rc0.20260401git1234567.el10 rhel-10-baseos 50 M

Transaction Summary
================================================================================
Remove 3 Packages

Installed size: 300 M

Is this ok [y/N]: y
Running transaction
Preparing : 1/1
Erasing : kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64 1/3
Erasing : kernel-core-6.5.0-0.rc0.20260401git1234567.el10.x86_64 2/3
Erasing : kernel-modules-6.5.0-0.rc0.20260401git1234567.el10.x86_64 3/3
Running scriptlet: kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64 3/3
Verifying : kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64 1/3
Verifying : kernel-core-6.5.0-0.rc0.20260401git1234567.el10.x86_64 2/3
Verifying : kernel-modules-6.5.0-0.rc0.20260401git1234567.el10.x86_64 3/3

Removed:
kernel-6.5.0-0.rc0.20260401git1234567.el10.x86_64
kernel-core-6.5.0-0.rc0.20260401git1234567.el10.x86_64
kernel-modules-6.5.0-0.rc0.20260401git1234567.el10.x86_64

Complete!

风哥提示:GRUB引导回滚是最简单的回滚方法,只需要选择旧内核启动即可。如果GRUB菜单损坏,需要使用GRUB救援模式修复。

9. 紧急恢复方案

在严重故障情况下的紧急恢复方案。更多学习教程www.fgedu.net.cn from LinuxDBA视频:www.itpux.com

# 方案1:使用Live CD恢复
# 1. 从Live CD启动系统
# 2. 挂载根文件系统
$ sudo mkdir -p /mnt/rhel
$ sudo mount /dev/sda1 /mnt/rhel

# 3. 挂载其他文件系统
$ sudo mount /dev/sda2 /mnt/rhel/boot
$ sudo mount /dev/rhel/home /mnt/rhel/home
$ sudo mount /dev/rhel/var /mnt/rhel/var

# 4. chroot到系统
$ sudo chroot /mnt/rhel

# 5. 执行恢复操作
# 例如:重新安装GRUB
bash-4.4# grub2-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

# 6. 退出chroot并重启
bash-4.4# exit
$ sudo reboot

# 方案2:从备份恢复
# 1. 从Live CD启动系统
# 2. 挂载备份磁盘
$ sudo mkdir -p /mnt/backup
$ sudo mount /dev/sdb1 /mnt/backup

# 3. 恢复系统配置文件
$ sudo tar -xzf /mnt/backup/system-config-20260402.tar.gz -C /

# 4. 恢复重要数据
$ sudo tar -xzf /mnt/backup/important-data-20260402.tar.gz -C /

# 5. 恢复数据库
$ sudo mysql < /mnt/backup/mysql-backup-20260402.sql # 6. 重启系统 $ sudo reboot

# 方案3:重新安装系统
# 1. 从安装介质启动
# 2. 选择”安装RHEL 10″
# 3. 在安装过程中选择”重新安装”
# 4. 保留数据分区
# 5. 完成安装后恢复数据

# 恢复数据
$ sudo tar -xzf /backup/important-data-20260402.tar.gz -C /

# 恢复数据库
$ sudo mysql < /backup/mysql-backup-20260402.sql # 恢复系统配置 $ sudo tar -xzf /backup/system-config-20260402.tar.gz -C / # 重启服务 $ sudo systemctl start httpd $ sudo systemctl start mariadb

紧急恢复方案:1. 使用Live CD恢复;2. 从备份恢复;3. 重新安装系统。选择哪种方案取决于故障的严重程度和备份的完整性。

10. 风哥经验总结

在生产环境中进行升级故障排查和回滚的经验总结。学习交流加群风哥微信: itpux-com

# 经验1:建立完善的备份策略
$ cat > /backup/troubleshooting_experience1.txt << 'EOF' 经验1:建立完善的备份策略 1. 备份要完整 - 系统配置文件 - 重要数据 - 数据库 - 软件包列表 - 系统信息 2. 备份要定期 - 每日增量备份 - 每周全量备份 - 升级前额外备份 - 备份要验证 3. 备份要安全 - 存储在多个位置 - 加密敏感数据 - 定期测试恢复 - 记录备份信息 EOF
# 经验2:建立完善的监控体系
$ cat > /backup/troubleshooting_experience2.txt << 'EOF' 经验2:建立完善的监控体系 1. 系统监控 - CPU使用率 - 内存使用率 - 磁盘使用率 - 网络流量 2. 服务监控 - 服务状态 - 服务响应时间 - 服务错误率 - 服务日志 3. 应用监控 - 应用功能 - 应用性能 - 应用错误 - 应用日志 EOF
# 经验3:建立完善的应急预案
$ cat > /backup/troubleshooting_experience3.txt << 'EOF' 经验3:建立完善的应急预案 1. 应急响应流程 - 问题发现 - 问题评估 - 问题处理 - 问题验证 2. 应急联系人 - 技术负责人 - 系统管理员 - 应用负责人 - 技术支持 3. 应急资源 - 备份服务器 - 备用硬件 - 应急文档 - 恢复工具 EOF
风哥提示:升级故障排查和回滚是系统运维的重要技能,通过完善的备份、监控和应急预案,可以最大程度地减少故障对业务的影响。

总结:升级过程中可能会遇到各种故障,通过系统的故障排查方法和完善的回滚方案,可以快速恢复系统。在生产环境中,建议建立完善的备份策略、监控体系和应急预案,以应对各种突发情况。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息