1. 故障诊断概述
服务器硬件故障诊断是运维工作的重要组成部分,通过系统化的方法快速定位和解决硬件问题。更多学习教程www.fgedu.net.cn
硬件故障类型:
┌─────────────────────────────────────────────────────┐
│ 硬件故障分类 │
└───────────────────────┬─────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
v v v
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 计算类故障 │ │ 存储类故障 │ │ 网络类故障 │
│ CPU/内存 │ │ 磁盘/RAID │ │ 网卡/线缆 │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
v v v
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 电源类故障 │ │ 散热类故障 │ │ 主板类故障 │
│ 电源/电池 │ │ 风扇/温度 │ │ BIOS/BMC │
└───────────────┘ └───────────────┘ └───────────────┘
# 查看系统硬件信息
# dmidecode -t system | head -20
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R740
Version: Not Specified
Serial Number: ABCD1234
UUID: 12345678-1234-1234-1234-123456789012
Wake-up Type: Power Switch
SKU Number: 0869
Family: PowerEdge
# 查看BMC信息
# ipmitool mc info
Device ID : 32
Device Revision : 1
Firmware Revision : 5.20.00
IPMI Version : 2.0
Manufacturer ID : 674
Manufacturer Name : Dell Inc.
Product ID : 256 (0x0100)
Product Name : Unknown (0x100)
Device Available : yes
Provides Device SDRs : yes
Additional Device Support :
Sensor Device
SDR Repository Device
SEL Device
FRU Inventory Device
# 查看系统日志
# journalctl -k | grep -i “hardware\|error\|fail” | tail -20
Apr 03 10:00:00 fgedu-server kernel: EDAC MC0: 1 CE memory read error on CPU0_DIMM_A1
Apr 03 10:05:00 fgedu-server kernel: mce: CPU0: Machine Check Exception: Bank 0
Apr 03 10:10:00 fgedu-server kernel: sd 0:0:1:0: [sdb] Medium error
2. CPU故障诊断
CPU故障可能导致系统崩溃或性能下降。学习交流加群风哥微信: itpux-com
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00
# 查看MCE日志
# cat /proc/mce
CPU 0 BANK 0 TSC 123456789012345
STATUS 0x9c00000000000100 MCGSTATUS 0x0
MCGCAP 0x1719 ADDR 0x0 MISC 0x0
PROCESSOR 0:6 TIME 1712120400 SOCKET 0 APIC 0
# 安装mcelog工具
# yum install -y mcelog
# 查看MCE日志
# mcelog –client
CPU 0: Machine Check Exception
BANK 0: Correctable error
STATUS: 0x9c00000000000100
MCGSTATUS: 0x0
TIME: Fri Apr 3 10:00:00 2026
# CPU压力测试
# cat > /opt/scripts/cpu_stress_test.sh << 'EOF'
#!/bin/bash
echo "CPU压力测试开始..."
echo "CPU核心数: $(nproc)"
# 安装stress工具
if ! command -v stress &> /dev/null; then
yum install -y stress
fi
# 运行压力测试
stress –cpu $(nproc) –timeout 60s –verbose
# 检查测试结果
if [ $? -eq 0 ]; then
echo “CPU压力测试通过”
else
echo “CPU压力测试失败”
fi
# 查看温度
echo “”
echo “CPU温度:”
sensors | grep -i “core\|package”
EOF
# chmod +x /opt/scripts/cpu_stress_test.sh
# 查看CPU温度
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +45.0°C (high = +85.0°C, crit = +95.0°C)
Core 0: +42.0°C (high = +85.0°C, crit = +95.0°C)
Core 1: +43.0°C (high = +85.0°C, crit = +95.0°C)
Core 2: +44.0°C (high = +85.0°C, crit = +95.0°C)
Core 3: +43.0°C (high = +85.0°C, crit = +95.0°C)
# 通过IPMI查看CPU状态
# ipmitool sensor | grep -i cpu
CPU1 Temp | 45.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
CPU2 Temp | 43.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
CPU1 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na
CPU2 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na
3. 内存故障诊断
内存故障是常见的硬件问题,可能导致数据损坏和系统崩溃。学习交流加群风哥QQ113257174
# dmidecode -t memory | head -40
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.
Handle 0x0034, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 3 TB
Error Information Handle: Not Provided
Number Of Devices: 24
Handle 0x0035, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0034
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM_A1
Bank Locator: Not Specified
Type: DDR4
Type Detail: Synchronous
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 12345678
Asset Tag: Not Specified
Part Number: M393A4K40CB2-CTD
# 查看内存使用情况
# free -h
total used free shared buff/cache available
Mem: 125Gi 2.1Gi 122Gi 128Mi 1.2Gi 122Gi
Swap: 8.0Gi 0B 8.0Gi
# 查看EDAC错误
# edac-util -v
mc0: 0 CE, 0 UE
mc0/csrow0: 0 CE, 0 UE
mc0/csrow1: 0 CE, 0 UE
mc0/csrow2: 0 CE, 0 UE
mc0/csrow3: 0 CE, 0 UE
# 查看内存错误计数
# cat /sys/devices/system/edac/mc/mc0/ce_count
0
# cat /sys/devices/system/edac/mc/mc0/ue_count
0
# 内存测试脚本
# cat > /opt/scripts/mem_test.sh << 'EOF'
#!/bin/bash
echo "内存测试开始..."
echo "总内存: $(free -h | grep Mem | awk '{print $2}')"
# 安装memtester
if ! command -v memtester &> /dev/null; then
yum install -y memtester
fi
# 测试1GB内存
echo “”
echo “执行内存测试(1GB)…”
memtester 1G 1
if [ $? -eq 0 ]; then
echo “内存测试通过”
else
echo “内存测试失败,可能存在硬件问题”
fi
EOF
# chmod +x /opt/scripts/mem_test.sh
# 通过IPMI查看内存状态
# ipmitool sensor | grep -i dimm
DIMM_A1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
DIMM_A2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
DIMM_B1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
DIMM_B2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
# 查看SEL日志中的内存错误
# ipmitool sel list | grep -i memory
1 | 04/03/2026 | 10:00:00 | Memory #0x52 | Correctable ECC | Asserted
2 | 04/03/2026 | 10:05:00 | Memory #0x53 | Uncorrectable ECC | Asserted
# 定位故障内存
# ipmitool sel list | grep -i “memory\|dimm”
1 | 04/03/2026 | 10:00:00 | Memory DIMM_A1 | Correctable ECC | Asserted
4. 磁盘故障诊断
磁盘故障可能导致数据丢失,需要及时发现和处理。更多学习教程公众号风哥教程itpux_com
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 1G 0 part /boot
├─sda2 8:2 0 100G 0 part /
├─sda3 8:3 0 200G 0 part /data
└─sda4 8:4 0 146.1G 0 part
sdb 8:16 0 447.1G 0 disk
└─sdb1 8:17 0 447.1G 0 part
# 查看SMART信息
# smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-3.10.0-1160.el7.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Dell
Device Model: DELLBOSS VD
Serial Number: ABCD1234
Firmware Version: 5.0.1-0000
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always – 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always – 8760
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always – 50
177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always – 0
183 Runtime_Bad_Block 0x0032 100 100 010 Old_age Always – 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always – 0
194 Temperature_Celsius 0x0022 065 065 000 Old_age Always – 35 (Min/Max 20/55)
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always – 0
# 磁盘健康检查脚本
# cat > /opt/scripts/disk_health_check.sh << 'EOF'
#!/bin/bash
echo "磁盘健康检查"
echo "=========================================="
for disk in $(ls /dev/sd? 2>/dev/null); do
echo “”
echo “检查磁盘: $disk”
echo “—————————————-”
# 检查SMART状态
HEALTH=$(smartctl -H $disk 2>/dev/null | grep “SMART overall-health” | awk ‘{print $NF}’)
if [ “$HEALTH” == “PASSED” ]; then
echo “SMART状态: 健康”
else
echo “SMART状态: 警告 – $HEALTH”
fi
# 检查重分配扇区
REALLOC=$(smartctl -A $disk 2>/dev/null | grep “Reallocated_Sector_Ct” | awk ‘{print $NF}’)
echo “重分配扇区: $REALLOC”
# 检查坏块
BAD_BLOCKS=$(smartctl -A $disk 2>/dev/null | grep “Runtime_Bad_Block” | awk ‘{print $NF}’)
echo “运行时坏块: $BAD_BLOCKS”
# 检查温度
TEMP=$(smartctl -A $disk 2>/dev/null | grep “Temperature_Celsius” | awk ‘{print $10}’)
echo “当前温度: ${TEMP}°C”
# 检查使用时间
HOURS=$(smartctl -A $disk 2>/dev/null | grep “Power_On_Hours” | awk ‘{print $NF}’)
echo “使用时间: ${HOURS} 小时”
done
echo “”
echo “==========================================”
EOF
# chmod +x /opt/scripts/disk_health_check.sh
# 查看磁盘I/O错误
# cat /sys/block/sda/device/ioerr_cnt
0
# 查看dmesg中的磁盘错误
# dmesg | grep -i “sda\|error\|fail” | tail -10
[ 2.123456] sd 0:0:0:0: [sda] 937703088 512-byte logical blocks: (480 GB/447 GiB)
[ 2.123457] sd 0:0:0:0: [sda] Write Protect is off
# 检查RAID状态
# storcli /c0 /vall show
Controller = 0
Status = Success
Description = None
Virtual Drives:
==============
DG/VD TYPE State Access Consist Cache Cac sCC Size Name
0/0 RAID1 Optl RW Yes RWBD – ON 446.625 GB OS_VD
1/1 RAID5 Optl RW Yes RWBD – ON 1.818 TB DATA_VD
5. 网卡故障诊断
网卡故障影响网络连接和数据传输。author:www.itpux.com
# ip link show
1: lo:
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:
link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
3: eth1:
link/ether 00:11:22:33:44:56 brd ff:ff:ff:ff:ff:ff
# 查看网卡统计
# ethtool -S eth0 | head -30
NIC statistics:
rx_packets: 123456789
tx_packets: 98765432
rx_bytes: 123456789012
tx_bytes: 98765432109
rx_broadcast: 123456
tx_broadcast: 65432
rx_multicast: 78901
tx_multicast: 34567
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 78901
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 0
# 查看网卡状态
# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
10000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
MDI-X: Unknown
Link detected: yes
# 网卡诊断脚本
# cat > /opt/scripts/nic_diag.sh << 'EOF'
#!/bin/bash
echo "网卡诊断"
echo "=========================================="
for nic in $(ls /sys/class/net/ | grep -v lo); do
echo ""
echo "网卡: $nic"
echo "----------------------------------------"
# 检查链路状态
LINK=$(cat /sys/class/net/$nic/carrier 2>/dev/null || echo “unknown”)
if [ “$LINK” == “1” ]; then
echo “链路状态: 连接”
else
echo “链路状态: 断开”
fi
# 检查速度
SPEED=$(cat /sys/class/net/$nic/speed 2>/dev/null || echo “unknown”)
echo “速度: ${SPEED} Mb/s”
# 检查双工模式
DUPLEX=$(cat /sys/class/net/$nic/duplex 2>/dev/null || echo “unknown”)
echo “双工模式: $DUPLEX”
# 检查MTU
MTU=$(cat /sys/class/net/$nic/mtu)
echo “MTU: $MTU”
# 检查错误计数
RX_ERR=$(cat /sys/class/net/$nic/statistics/rx_errors)
TX_ERR=$(cat /sys/class/net/$nic/statistics/tx_errors)
RX_DROP=$(cat /sys/class/net/$nic/statistics/rx_dropped)
TX_DROP=$(cat /sys/class/net/$nic/statistics/tx_dropped)
echo “接收错误: $RX_ERR”
echo “发送错误: $TX_ERR”
echo “接收丢包: $RX_DROP”
echo “发送丢包: $TX_DROP”
# 告警检查
if [ $RX_ERR -gt 100 ] || [ $TX_ERR -gt 100 ]; then
echo “警告: 错误计数过高”
fi
done
echo “”
echo “==========================================”
EOF
# chmod +x /opt/scripts/nic_diag.sh
# 网络连通性测试
# ping -c 4 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.123 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=0.145 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=0.134 ms
64 bytes from 192.168.1.1: icmp_seq=4 ttl=64 time=0.156 ms
— 192.168.1.1 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.123/0.139/0.156/0.015 ms
6. 电源故障诊断
电源故障可能导致服务器意外关机。
# ipmitool sensor | grep -i power
Power Supply 1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Power Supply 2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Power Units | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Pwr Consumption | 450 | Watts | ok | na | na | na | na | 800 | na
# 查看电源详细信息
# ipmitool fru | grep -A 20 “Power Supply”
Board Mfg : Dell Inc.
Board Product : Power Supply 1100W
Board Serial : CN-ABCD123-12345-678-ABCD
Board Part Number : 0H7H1MA00
# 查看电源冗余状态
# ipmitool sdr type ‘Power Supply’
Power Supply 1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Power Supply 2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
# 电源诊断脚本
# cat > /opt/scripts/psu_diag.sh << 'EOF'
#!/bin/bash
echo "电源诊断"
echo "=========================================="
# 检查电源状态
echo "电源状态:"
ipmitool sensor | grep "Power Supply" | while read line; do
STATUS=$(echo $line | awk -F'|' '{print $3}' | tr -d ' ')
NAME=$(echo $line | awk -F'|' '{print $1}' | tr -d ' ')
if [ "$STATUS" == "0x0" ]; then
echo " $NAME: 正常"
else
echo " $NAME: 异常 ($STATUS)"
fi
done
# 检查功耗
echo ""
echo "功耗信息:"
PWR=$(ipmitool sensor | grep "Pwr Consumption" | awk -F'|' '{print $2}' | tr -d ' ')
echo " 当前功耗: ${PWR}W"
# 检查电源冗余
echo ""
echo "电源冗余状态:"
REDUNDANCY=$(ipmitool sdr type 'Power Unit' | grep "Power Units" | awk -F'|' '{print $3}' | tr -d ' ')
if [ "$REDUNDANCY" == "0x0" ]; then
echo " 冗余状态: 正常"
else
echo " 冗余状态: 降级"
fi
# 检查SEL日志
echo ""
echo "电源相关事件:"
ipmitool sel list | grep -i "power" | tail -5
echo ""
echo "=========================================="
EOF
# chmod +x /opt/scripts/psu_diag.sh
# 查看电源配置
# ipmitool chassis status
System Power : on
Power Overload : false
Power Interlock : inactive
Main Power Fault : false
Power Control Fault : false
Power Restore Policy : previous
Last Power Event : command
Chassis Intrusion : inactive
Front Panel Lockout : inactive
Drive Fault : false
Cooling/Fan Fault : false
7. 风扇故障诊断
风扇故障可能导致服务器过热。
# ipmitool sensor | grep -i fan
Fan1 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan2 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan3 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan4 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan5 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan6 | 4560 | RPM | ok | na | 600.000| na | na | na | na
# 风扇诊断脚本
# cat > /opt/scripts/fan_diag.sh << 'EOF'
#!/bin/bash
echo "风扇诊断"
echo "=========================================="
# 检查风扇转速
echo "风扇状态:"
ipmitool sensor | grep "Fan" | while read line; do
NAME=$(echo $line | awk -F'|' '{print $1}' | tr -d ' ')
SPEED=$(echo $line | awk -F'|' '{print $2}' | tr -d ' ')
UNIT=$(echo $line | awk -F'|' '{print $3}' | tr -d ' ')
STATUS=$(echo $line | awk -F'|' '{print $4}' | tr -d ' ')
THRESHOLD=$(echo $line | awk -F'|' '{print $6}' | tr -d ' ')
echo " $NAME: ${SPEED} ${UNIT} (阈值: ${THRESHOLD} RPM)"
if [ "$STATUS" != "ok" ]; then
echo " 警告: 风扇状态异常"
fi
done
# 检查风扇冗余
echo ""
echo "风扇冗余状态:"
REDUNDANCY=$(ipmitool sdr type 'Fan' | grep -i "redundancy" | awk -F'|' '{print $3}' | tr -d ' ')
if [ -n "$REDUNDANCY" ]; then
if [ "$REDUNDANCY" == "0x0" ]; then
echo " 冗余状态: 正常"
else
echo " 冗余状态: 降级"
fi
fi
# 检查SEL日志
echo ""
echo "风扇相关事件:"
ipmitool sel list | grep -i "fan" | tail -5
echo ""
echo "=========================================="
EOF
# chmod +x /opt/scripts/fan_diag.sh
# 查看风扇模式
# ipmitool raw 0x30 0x45 0x01
01
# 设置风扇模式
# 自动模式
# ipmitool raw 0x30 0x45 0x01 0x00
# 手动模式
# ipmitool raw 0x30 0x45 0x01 0x01
# 设置风扇速度(手动模式下)
# ipmitool raw 0x30 0x45 0x01 0x1f
8. 温度故障诊断
温度异常可能导致硬件损坏或性能下降。
# ipmitool sensor | grep -i temp
Inlet Temp | 25.000 | degrees C | ok | na | 5.000 | 10.000 | 42.000 | 47.000 | na
CPU1 Temp | 45.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
CPU2 Temp | 43.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
Board Temp | 35.000 | degrees C | ok | na | 5.000 | 10.000 | 75.000 | 85.000 | na
DIMM Temp | 38.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
# 温度诊断脚本
# cat > /opt/scripts/temp_diag.sh << 'EOF'
#!/bin/bash
echo "温度诊断"
echo "=========================================="
# 检查各组件温度
echo "温度状态:"
ipmitool sensor | grep "Temp" | while read line; do
NAME=$(echo $line | awk -F'|' '{print $1}' | tr -d ' ')
TEMP=$(echo $line | awk -F'|' '{print $2}' | tr -d ' ')
STATUS=$(echo $line | awk -F'|' '{print $4}' | tr -d ' ')
CRITICAL=$(echo $line | awk -F'|' '{print $8}' | tr -d ' ')
WARNING=$(echo $line | awk -F'|' '{print $7}' | tr -d ' ')
echo " $NAME: ${TEMP}°C (警告: ${WARNING}°C, 临界: ${CRITICAL}°C)"
if [ "$STATUS" != "ok" ]; then
echo " 警告: 温度异常"
fi
done
# 检查温度趋势
echo ""
echo "温度趋势分析:"
for sensor in "CPU1 Temp" "CPU2 Temp" "Inlet Temp"; do
CURRENT=$(ipmitool sensor | grep "$sensor" | awk -F'|' '{print $2}' | tr -d ' ')
CRITICAL=$(ipmitool sensor | grep "$sensor" | awk -F'|' '{print $8}' | tr -d ' ')
if [ -n "$CURRENT" ] && [ -n "$CRITICAL" ]; then
RATIO=$(echo "scale=2; $CURRENT / $CRITICAL * 100" | bc)
echo " $sensor: ${RATIO}% (当前/临界)"
if [ $(echo "$RATIO > 80″ | bc) -eq 1 ]; then
echo ” 警告: 温度接近临界值”
fi
fi
done
# 检查SEL日志
echo “”
echo “温度相关事件:”
ipmitool sel list | grep -i “temp\|thermal” | tail -5
echo “”
echo “==========================================”
EOF
# chmod +x /opt/scripts/temp_diag.sh
# 使用lm_sensors查看温度
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +45.0°C (high = +85.0°C, crit = +95.0°C)
Core 0: +42.0°C (high = +85.0°C, crit = +95.0°C)
Core 1: +43.0°C (high = +85.0°C, crit = +95.0°C)
dell_smm-isa-0000
Adapter: ISA adapter
Processor Fan: 4560 RPM
CPU: +45.0°C
Ambient: +25.0°C
9. BMC故障诊断
BMC故障影响服务器远程管理功能。
# ipmitool mc info
Device ID : 32
Device Revision : 1
Firmware Revision : 5.20.00
IPMI Version : 2.0
Manufacturer ID : 674
Manufacturer Name : Dell Inc.
Product ID : 256 (0x0100)
Device Available : yes
Provides Device SDRs : yes
# 检查BMC网络配置
# ipmitool lan print 1
Set in Progress : Set Complete
Auth Type Support : NONE MD2 MD5 PASSWORD
Auth Type Enable : Callback : MD2 MD5 PASSWORD
: User : MD2 MD5 PASSWORD
: Operator : MD2 MD5 PASSWORD
: Admin : MD2 MD5 PASSWORD
: OEM : MD2 MD5 PASSWORD
IP Address Source : Static Address
IP Address : 192.168.1.100
Subnet Mask : 255.255.255.0
MAC Address : 00:11:22:33:44:55
Default Gateway IP : 192.168.1.1
Cipher Suite Priv Max : aaaaaaaaaaaaaaa
# 检查BMC用户
# ipmitool user list 1
ID Name Callin Link Auth IPMI Msg Channel Priv Limit
1 true false false NO ACCESS
2 root true false false ADMINISTRATOR
3 admin true false false ADMINISTRATOR
# BMC诊断脚本
# cat > /opt/scripts/bmc_diag.sh << 'EOF'
#!/bin/bash
echo "BMC诊断"
echo "=========================================="
# 检查BMC连接
echo "1. BMC连接状态"
if ipmitool mc info &>/dev/null; then
echo ” BMC连接: 正常”
else
echo ” BMC连接: 失败”
exit 1
fi
# 检查BMC版本
echo “”
echo “2. BMC信息”
ipmitool mc info | grep -E “Firmware|IPMI Version|Manufacturer”
# 检查BMC网络
echo “”
echo “3. BMC网络配置”
ipmitool lan print 1 | grep -E “IP Address|Subnet Mask|MAC Address|Gateway”
# 检查SEL状态
echo “”
echo “4. SEL状态”
SEL_INFO=$(ipmitool sel info)
echo “$SEL_INFO” | grep -E “Entries|Free Space”
# 检查SDR状态
echo “”
echo “5. SDR状态”
SDR_COUNT=$(ipmitool sdr list | wc -l)
echo ” 传感器数量: $SDR_COUNT”
# 检查BMC时间
echo “”
echo “6. BMC时间”
BMC_TIME=$(ipmitool sel time get)
echo ” BMC时间: $BMC_TIME”
echo ” 系统时间: $(date)”
# 检查BMC健康状态
echo “”
echo “7. BMC健康状态”
ipmitool sdr type ‘Watchdog’ | while read line; do
echo ” $line”
done
echo “”
echo “==========================================”
EOF
# chmod +x /opt/scripts/bmc_diag.sh
# 重启BMC
# ipmitool mc reset cold
Sent cold reset command to MC
# 清除SEL日志
# ipmitool sel clear
Clearing SEL. Please wait…
SEL cleared.
10. 诊断工具使用
综合诊断工具提高故障定位效率。
# cat > /opt/scripts/hardware_diag.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/hardware_diag_$(date +%Y%m%d_%H%M%S).log" echo "==========================================" echo "服务器硬件综合诊断" echo "主机: $(hostname)" echo "时间: $(date)" echo "==========================================" # 1. 系统信息 echo "" echo "【系统信息】" dmidecode -t system | grep -E "Manufacturer|Product Name|Serial Number" # 2. CPU诊断 echo "" echo "【CPU诊断】" echo "CPU型号: $(lscpu | grep "Model name" | awk -F: '{print $2}')" echo "CPU核心: $(nproc)" echo "CPU温度: $(sensors | grep "Package" | awk '{print $4}')" ipmitool sensor | grep "CPU" | grep "Temp" # 3. 内存诊断 echo "" echo "【内存诊断】" echo "总内存: $(free -h | grep Mem | awk '{print $2}')" echo "内存使用: $(free -h | grep Mem | awk '{print $3}')" echo "ECC错误: $(edac-util -v 2>/dev/null || echo ‘无错误’)”
# 4. 磁盘诊断
echo “”
echo “【磁盘诊断】”
for disk in $(ls /dev/sd? 2>/dev/null); do
HEALTH=$(smartctl -H $disk 2>/dev/null | grep “SMART overall-health” | awk ‘{print $NF}’)
echo “$disk: $HEALTH”
done
# 5. 网络诊断
echo “”
echo “【网络诊断】”
for nic in $(ls /sys/class/net/ | grep -v lo); do
LINK=$(cat /sys/class/net/$nic/carrier 2>/dev/null || echo “unknown”)
SPEED=$(cat /sys/class/net/$nic/speed 2>/dev/null || echo “unknown”)
echo “$nic: 链路=$LINK, 速度=${SPEED}Mb/s”
done
# 6. 电源诊断
echo “”
echo “【电源诊断】”
ipmitool sensor | grep “Power Supply” | while read line; do
NAME=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
STATUS=$(echo $line | awk -F’|’ ‘{print $3}’ | tr -d ‘ ‘)
[ “$STATUS” == “0x0” ] && echo “$NAME: 正常” || echo “$NAME: 异常”
done
# 7. 风扇诊断
echo “”
echo “【风扇诊断】”
ipmitool sensor | grep “Fan” | head -6 | while read line; do
NAME=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
SPEED=$(echo $line | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
echo “$NAME: ${SPEED} RPM”
done
# 8. 温度诊断
echo “”
echo “【温度诊断】”
ipmitool sensor | grep “Temp” | while read line; do
NAME=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
TEMP=$(echo $line | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
echo “$NAME: ${TEMP}°C”
done
# 9. SEL事件
echo “”
echo “【最近SEL事件】”
ipmitool sel list | tail -5
# 10. 健康状态汇总
echo “”
echo “【健康状态汇总】”
HEALTH_STATUS=”正常”
# 检查是否有异常
if ipmitool sensor | grep -v “ok” | grep -q “degrees\|RPM\|Watts”; then
HEALTH_STATUS=”异常”
fi
if dmesg | grep -qi “error\|fail\|critical”; then
HEALTH_STATUS=”异常”
fi
echo “整体健康状态: $HEALTH_STATUS”
echo “”
echo “==========================================”
echo “诊断完成”
echo “==========================================”
EOF
# chmod +x /opt/scripts/hardware_diag.sh
# 执行诊断
# /opt/scripts/hardware_diag.sh
==========================================
服务器硬件综合诊断
主机: fgedu-server01
时间: Fri Apr 3 10:00:00 CST 2026
==========================================
【系统信息}
Manufacturer: Dell Inc.
Product Name: PowerEdge R740
Serial Number: ABCD1234
【CPU诊断】
CPU型号: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU核心: 32
CPU温度: +45.0°C
CPU1 Temp | 45.000 | degrees C | ok
【内存诊断】
总内存: 125Gi
内存使用: 2.1Gi
ECC错误: mc0: 0 CE, 0 UE
【磁盘诊断】
/dev/sda: PASSED
/dev/sdb: PASSED
【网络诊断】
eth0: 链路=1, 速度=10000Mb/s
eth1: 链路=1, 速度=10000Mb/s
【电源诊断】
Power Supply 1: 正常
Power Supply 2: 正常
【风扇诊断】
Fan1: 4560 RPM
Fan2: 4560 RPM
Fan3: 4560 RPM
【温度诊断】
Inlet Temp: 25.000°C
CPU1 Temp: 45.000°C
CPU2 Temp: 43.000°C
【最近SEL事件】
1 | 04/03/2026 | 10:00:00 | System Board | System Event | Asserted
【健康状态汇总】
整体健康状态: 正常
==========================================
诊断完成
==========================================
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
