1. 首页 > IT综合教程 > 正文

IT教程FG331-服务器硬件故障诊断

1. 故障诊断概述

服务器硬件故障诊断是运维工作的重要组成部分,通过系统化的方法快速定位和解决硬件问题。更多学习教程www.fgedu.net.cn

# 硬件故障分类
硬件故障类型:
┌─────────────────────────────────────────────────────┐
│ 硬件故障分类 │
└───────────────────────┬─────────────────────────────┘

┌───────────────┼───────────────┐
│ │ │
v v v
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 计算类故障 │ │ 存储类故障 │ │ 网络类故障 │
│ CPU/内存 │ │ 磁盘/RAID │ │ 网卡/线缆 │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
v v v
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ 电源类故障 │ │ 散热类故障 │ │ 主板类故障 │
│ 电源/电池 │ │ 风扇/温度 │ │ BIOS/BMC │
└───────────────┘ └───────────────┘ └───────────────┘

# 查看系统硬件信息
# dmidecode -t system | head -20
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R740
Version: Not Specified
Serial Number: ABCD1234
UUID: 12345678-1234-1234-1234-123456789012
Wake-up Type: Power Switch
SKU Number: 0869
Family: PowerEdge

# 查看BMC信息
# ipmitool mc info
Device ID : 32
Device Revision : 1
Firmware Revision : 5.20.00
IPMI Version : 2.0
Manufacturer ID : 674
Manufacturer Name : Dell Inc.
Product ID : 256 (0x0100)
Product Name : Unknown (0x100)
Device Available : yes
Provides Device SDRs : yes
Additional Device Support :
Sensor Device
SDR Repository Device
SEL Device
FRU Inventory Device

# 查看系统日志
# journalctl -k | grep -i “hardware\|error\|fail” | tail -20
Apr 03 10:00:00 fgedu-server kernel: EDAC MC0: 1 CE memory read error on CPU0_DIMM_A1
Apr 03 10:05:00 fgedu-server kernel: mce: CPU0: Machine Check Exception: Bank 0
Apr 03 10:10:00 fgedu-server kernel: sd 0:0:1:0: [sdb] Medium error

生产环境风哥建议:建立完善的硬件监控体系,定期进行硬件巡检,配置告警阈值,准备备件库存,记录故障处理过程。

2. CPU故障诊断

CPU故障可能导致系统崩溃或性能下降。学习交流加群风哥微信: itpux-com

# 查看CPU信息
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Stepping: 7
CPU MHz: 3000.000
BogoMIPS: 6000.00

# 查看MCE日志
# cat /proc/mce
CPU 0 BANK 0 TSC 123456789012345
STATUS 0x9c00000000000100 MCGSTATUS 0x0
MCGCAP 0x1719 ADDR 0x0 MISC 0x0
PROCESSOR 0:6 TIME 1712120400 SOCKET 0 APIC 0

# 安装mcelog工具
# yum install -y mcelog

# 查看MCE日志
# mcelog –client
CPU 0: Machine Check Exception
BANK 0: Correctable error
STATUS: 0x9c00000000000100
MCGSTATUS: 0x0
TIME: Fri Apr 3 10:00:00 2026

# CPU压力测试
# cat > /opt/scripts/cpu_stress_test.sh << 'EOF' #!/bin/bash echo "CPU压力测试开始..." echo "CPU核心数: $(nproc)" # 安装stress工具 if ! command -v stress &> /dev/null; then
yum install -y stress
fi

# 运行压力测试
stress –cpu $(nproc) –timeout 60s –verbose

# 检查测试结果
if [ $? -eq 0 ]; then
echo “CPU压力测试通过”
else
echo “CPU压力测试失败”
fi

# 查看温度
echo “”
echo “CPU温度:”
sensors | grep -i “core\|package”
EOF

# chmod +x /opt/scripts/cpu_stress_test.sh

# 查看CPU温度
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +45.0°C (high = +85.0°C, crit = +95.0°C)
Core 0: +42.0°C (high = +85.0°C, crit = +95.0°C)
Core 1: +43.0°C (high = +85.0°C, crit = +95.0°C)
Core 2: +44.0°C (high = +85.0°C, crit = +95.0°C)
Core 3: +43.0°C (high = +85.0°C, crit = +95.0°C)

# 通过IPMI查看CPU状态
# ipmitool sensor | grep -i cpu
CPU1 Temp | 45.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
CPU2 Temp | 43.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
CPU1 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na
CPU2 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na

3. 内存故障诊断

内存故障是常见的硬件问题,可能导致数据损坏和系统崩溃。学习交流加群风哥QQ113257174

# 查看内存信息
# dmidecode -t memory | head -40
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.

Handle 0x0034, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 3 TB
Error Information Handle: Not Provided
Number Of Devices: 24

Handle 0x0035, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0034
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM_A1
Bank Locator: Not Specified
Type: DDR4
Type Detail: Synchronous
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 12345678
Asset Tag: Not Specified
Part Number: M393A4K40CB2-CTD

# 查看内存使用情况
# free -h
total used free shared buff/cache available
Mem: 125Gi 2.1Gi 122Gi 128Mi 1.2Gi 122Gi
Swap: 8.0Gi 0B 8.0Gi

# 查看EDAC错误
# edac-util -v
mc0: 0 CE, 0 UE
mc0/csrow0: 0 CE, 0 UE
mc0/csrow1: 0 CE, 0 UE
mc0/csrow2: 0 CE, 0 UE
mc0/csrow3: 0 CE, 0 UE

# 查看内存错误计数
# cat /sys/devices/system/edac/mc/mc0/ce_count
0
# cat /sys/devices/system/edac/mc/mc0/ue_count
0

# 内存测试脚本
# cat > /opt/scripts/mem_test.sh << 'EOF' #!/bin/bash echo "内存测试开始..." echo "总内存: $(free -h | grep Mem | awk '{print $2}')" # 安装memtester if ! command -v memtester &> /dev/null; then
yum install -y memtester
fi

# 测试1GB内存
echo “”
echo “执行内存测试(1GB)…”
memtester 1G 1

if [ $? -eq 0 ]; then
echo “内存测试通过”
else
echo “内存测试失败,可能存在硬件问题”
fi
EOF

# chmod +x /opt/scripts/mem_test.sh

# 通过IPMI查看内存状态
# ipmitool sensor | grep -i dimm
DIMM_A1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
DIMM_A2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
DIMM_B1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
DIMM_B2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na

# 查看SEL日志中的内存错误
# ipmitool sel list | grep -i memory
1 | 04/03/2026 | 10:00:00 | Memory #0x52 | Correctable ECC | Asserted
2 | 04/03/2026 | 10:05:00 | Memory #0x53 | Uncorrectable ECC | Asserted

# 定位故障内存
# ipmitool sel list | grep -i “memory\|dimm”
1 | 04/03/2026 | 10:00:00 | Memory DIMM_A1 | Correctable ECC | Asserted

4. 磁盘故障诊断

磁盘故障可能导致数据丢失,需要及时发现和处理。更多学习教程公众号风哥教程itpux_com

# 查看磁盘信息
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 1G 0 part /boot
├─sda2 8:2 0 100G 0 part /
├─sda3 8:3 0 200G 0 part /data
└─sda4 8:4 0 146.1G 0 part
sdb 8:16 0 447.1G 0 disk
└─sdb1 8:17 0 447.1G 0 part

# 查看SMART信息
# smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-3.10.0-1160.el7.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Dell
Device Model: DELLBOSS VD
Serial Number: ABCD1234
Firmware Version: 5.0.1-0000
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always – 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always – 8760
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always – 50
177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always – 0
183 Runtime_Bad_Block 0x0032 100 100 010 Old_age Always – 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always – 0
194 Temperature_Celsius 0x0022 065 065 000 Old_age Always – 35 (Min/Max 20/55)
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always – 0

# 磁盘健康检查脚本
# cat > /opt/scripts/disk_health_check.sh << 'EOF' #!/bin/bash echo "磁盘健康检查" echo "==========================================" for disk in $(ls /dev/sd? 2>/dev/null); do
echo “”
echo “检查磁盘: $disk”
echo “—————————————-”

# 检查SMART状态
HEALTH=$(smartctl -H $disk 2>/dev/null | grep “SMART overall-health” | awk ‘{print $NF}’)

if [ “$HEALTH” == “PASSED” ]; then
echo “SMART状态: 健康”
else
echo “SMART状态: 警告 – $HEALTH”
fi

# 检查重分配扇区
REALLOC=$(smartctl -A $disk 2>/dev/null | grep “Reallocated_Sector_Ct” | awk ‘{print $NF}’)
echo “重分配扇区: $REALLOC”

# 检查坏块
BAD_BLOCKS=$(smartctl -A $disk 2>/dev/null | grep “Runtime_Bad_Block” | awk ‘{print $NF}’)
echo “运行时坏块: $BAD_BLOCKS”

# 检查温度
TEMP=$(smartctl -A $disk 2>/dev/null | grep “Temperature_Celsius” | awk ‘{print $10}’)
echo “当前温度: ${TEMP}°C”

# 检查使用时间
HOURS=$(smartctl -A $disk 2>/dev/null | grep “Power_On_Hours” | awk ‘{print $NF}’)
echo “使用时间: ${HOURS} 小时”
done

echo “”
echo “==========================================”
EOF

# chmod +x /opt/scripts/disk_health_check.sh

# 查看磁盘I/O错误
# cat /sys/block/sda/device/ioerr_cnt
0

# 查看dmesg中的磁盘错误
# dmesg | grep -i “sda\|error\|fail” | tail -10
[ 2.123456] sd 0:0:0:0: [sda] 937703088 512-byte logical blocks: (480 GB/447 GiB)
[ 2.123457] sd 0:0:0:0: [sda] Write Protect is off

# 检查RAID状态
# storcli /c0 /vall show
Controller = 0
Status = Success
Description = None

Virtual Drives:
==============
DG/VD TYPE State Access Consist Cache Cac sCC Size Name
0/0 RAID1 Optl RW Yes RWBD – ON 446.625 GB OS_VD
1/1 RAID5 Optl RW Yes RWBD – ON 1.818 TB DATA_VD

5. 网卡故障诊断

网卡故障影响网络连接和数据传输。author:www.itpux.com

# 查看网卡信息
# ip link show
1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
3: eth1: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:11:22:33:44:56 brd ff:ff:ff:ff:ff:ff

# 查看网卡统计
# ethtool -S eth0 | head -30
NIC statistics:
rx_packets: 123456789
tx_packets: 98765432
rx_bytes: 123456789012
tx_bytes: 98765432109
rx_broadcast: 123456
tx_broadcast: 65432
rx_multicast: 78901
tx_multicast: 34567
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 78901
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 0

# 查看网卡状态
# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
10000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
10000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
MDI-X: Unknown
Link detected: yes

# 网卡诊断脚本
# cat > /opt/scripts/nic_diag.sh << 'EOF' #!/bin/bash echo "网卡诊断" echo "==========================================" for nic in $(ls /sys/class/net/ | grep -v lo); do echo "" echo "网卡: $nic" echo "----------------------------------------" # 检查链路状态 LINK=$(cat /sys/class/net/$nic/carrier 2>/dev/null || echo “unknown”)
if [ “$LINK” == “1” ]; then
echo “链路状态: 连接”
else
echo “链路状态: 断开”
fi

# 检查速度
SPEED=$(cat /sys/class/net/$nic/speed 2>/dev/null || echo “unknown”)
echo “速度: ${SPEED} Mb/s”

# 检查双工模式
DUPLEX=$(cat /sys/class/net/$nic/duplex 2>/dev/null || echo “unknown”)
echo “双工模式: $DUPLEX”

# 检查MTU
MTU=$(cat /sys/class/net/$nic/mtu)
echo “MTU: $MTU”

# 检查错误计数
RX_ERR=$(cat /sys/class/net/$nic/statistics/rx_errors)
TX_ERR=$(cat /sys/class/net/$nic/statistics/tx_errors)
RX_DROP=$(cat /sys/class/net/$nic/statistics/rx_dropped)
TX_DROP=$(cat /sys/class/net/$nic/statistics/tx_dropped)

echo “接收错误: $RX_ERR”
echo “发送错误: $TX_ERR”
echo “接收丢包: $RX_DROP”
echo “发送丢包: $TX_DROP”

# 告警检查
if [ $RX_ERR -gt 100 ] || [ $TX_ERR -gt 100 ]; then
echo “警告: 错误计数过高”
fi
done

echo “”
echo “==========================================”
EOF

# chmod +x /opt/scripts/nic_diag.sh

# 网络连通性测试
# ping -c 4 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.123 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=0.145 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=0.134 ms
64 bytes from 192.168.1.1: icmp_seq=4 ttl=64 time=0.156 ms

— 192.168.1.1 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.123/0.139/0.156/0.015 ms

6. 电源故障诊断

电源故障可能导致服务器意外关机。

# 查看电源状态
# ipmitool sensor | grep -i power
Power Supply 1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Power Supply 2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Power Units | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Pwr Consumption | 450 | Watts | ok | na | na | na | na | 800 | na

# 查看电源详细信息
# ipmitool fru | grep -A 20 “Power Supply”
Board Mfg : Dell Inc.
Board Product : Power Supply 1100W
Board Serial : CN-ABCD123-12345-678-ABCD
Board Part Number : 0H7H1MA00

# 查看电源冗余状态
# ipmitool sdr type ‘Power Supply’
Power Supply 1 | 0x0 | discrete | 0x0000| na | na | na | na | na | na
Power Supply 2 | 0x0 | discrete | 0x0000| na | na | na | na | na | na

# 电源诊断脚本
# cat > /opt/scripts/psu_diag.sh << 'EOF' #!/bin/bash echo "电源诊断" echo "==========================================" # 检查电源状态 echo "电源状态:" ipmitool sensor | grep "Power Supply" | while read line; do STATUS=$(echo $line | awk -F'|' '{print $3}' | tr -d ' ') NAME=$(echo $line | awk -F'|' '{print $1}' | tr -d ' ') if [ "$STATUS" == "0x0" ]; then echo " $NAME: 正常" else echo " $NAME: 异常 ($STATUS)" fi done # 检查功耗 echo "" echo "功耗信息:" PWR=$(ipmitool sensor | grep "Pwr Consumption" | awk -F'|' '{print $2}' | tr -d ' ') echo " 当前功耗: ${PWR}W" # 检查电源冗余 echo "" echo "电源冗余状态:" REDUNDANCY=$(ipmitool sdr type 'Power Unit' | grep "Power Units" | awk -F'|' '{print $3}' | tr -d ' ') if [ "$REDUNDANCY" == "0x0" ]; then echo " 冗余状态: 正常" else echo " 冗余状态: 降级" fi # 检查SEL日志 echo "" echo "电源相关事件:" ipmitool sel list | grep -i "power" | tail -5 echo "" echo "==========================================" EOF # chmod +x /opt/scripts/psu_diag.sh # 查看电源配置 # ipmitool chassis status System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : false Power Control Fault : false Power Restore Policy : previous Last Power Event : command Chassis Intrusion : inactive Front Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false

7. 风扇故障诊断

风扇故障可能导致服务器过热。

# 查看风扇状态
# ipmitool sensor | grep -i fan
Fan1 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan2 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan3 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan4 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan5 | 4560 | RPM | ok | na | 600.000| na | na | na | na
Fan6 | 4560 | RPM | ok | na | 600.000| na | na | na | na

# 风扇诊断脚本
# cat > /opt/scripts/fan_diag.sh << 'EOF' #!/bin/bash echo "风扇诊断" echo "==========================================" # 检查风扇转速 echo "风扇状态:" ipmitool sensor | grep "Fan" | while read line; do NAME=$(echo $line | awk -F'|' '{print $1}' | tr -d ' ') SPEED=$(echo $line | awk -F'|' '{print $2}' | tr -d ' ') UNIT=$(echo $line | awk -F'|' '{print $3}' | tr -d ' ') STATUS=$(echo $line | awk -F'|' '{print $4}' | tr -d ' ') THRESHOLD=$(echo $line | awk -F'|' '{print $6}' | tr -d ' ') echo " $NAME: ${SPEED} ${UNIT} (阈值: ${THRESHOLD} RPM)" if [ "$STATUS" != "ok" ]; then echo " 警告: 风扇状态异常" fi done # 检查风扇冗余 echo "" echo "风扇冗余状态:" REDUNDANCY=$(ipmitool sdr type 'Fan' | grep -i "redundancy" | awk -F'|' '{print $3}' | tr -d ' ') if [ -n "$REDUNDANCY" ]; then if [ "$REDUNDANCY" == "0x0" ]; then echo " 冗余状态: 正常" else echo " 冗余状态: 降级" fi fi # 检查SEL日志 echo "" echo "风扇相关事件:" ipmitool sel list | grep -i "fan" | tail -5 echo "" echo "==========================================" EOF # chmod +x /opt/scripts/fan_diag.sh # 查看风扇模式 # ipmitool raw 0x30 0x45 0x01 01 # 设置风扇模式 # 自动模式 # ipmitool raw 0x30 0x45 0x01 0x00 # 手动模式 # ipmitool raw 0x30 0x45 0x01 0x01 # 设置风扇速度(手动模式下) # ipmitool raw 0x30 0x45 0x01 0x1f

8. 温度故障诊断

温度异常可能导致硬件损坏或性能下降。

# 查看温度传感器
# ipmitool sensor | grep -i temp
Inlet Temp | 25.000 | degrees C | ok | na | 5.000 | 10.000 | 42.000 | 47.000 | na
CPU1 Temp | 45.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
CPU2 Temp | 43.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na
Board Temp | 35.000 | degrees C | ok | na | 5.000 | 10.000 | 75.000 | 85.000 | na
DIMM Temp | 38.000 | degrees C | ok | na | 5.000 | 10.000 | 85.000 | 95.000 | na

# 温度诊断脚本
# cat > /opt/scripts/temp_diag.sh << 'EOF' #!/bin/bash echo "温度诊断" echo "==========================================" # 检查各组件温度 echo "温度状态:" ipmitool sensor | grep "Temp" | while read line; do NAME=$(echo $line | awk -F'|' '{print $1}' | tr -d ' ') TEMP=$(echo $line | awk -F'|' '{print $2}' | tr -d ' ') STATUS=$(echo $line | awk -F'|' '{print $4}' | tr -d ' ') CRITICAL=$(echo $line | awk -F'|' '{print $8}' | tr -d ' ') WARNING=$(echo $line | awk -F'|' '{print $7}' | tr -d ' ') echo " $NAME: ${TEMP}°C (警告: ${WARNING}°C, 临界: ${CRITICAL}°C)" if [ "$STATUS" != "ok" ]; then echo " 警告: 温度异常" fi done # 检查温度趋势 echo "" echo "温度趋势分析:" for sensor in "CPU1 Temp" "CPU2 Temp" "Inlet Temp"; do CURRENT=$(ipmitool sensor | grep "$sensor" | awk -F'|' '{print $2}' | tr -d ' ') CRITICAL=$(ipmitool sensor | grep "$sensor" | awk -F'|' '{print $8}' | tr -d ' ') if [ -n "$CURRENT" ] && [ -n "$CRITICAL" ]; then RATIO=$(echo "scale=2; $CURRENT / $CRITICAL * 100" | bc) echo " $sensor: ${RATIO}% (当前/临界)" if [ $(echo "$RATIO > 80″ | bc) -eq 1 ]; then
echo ” 警告: 温度接近临界值”
fi
fi
done

# 检查SEL日志
echo “”
echo “温度相关事件:”
ipmitool sel list | grep -i “temp\|thermal” | tail -5

echo “”
echo “==========================================”
EOF

# chmod +x /opt/scripts/temp_diag.sh

# 使用lm_sensors查看温度
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +45.0°C (high = +85.0°C, crit = +95.0°C)
Core 0: +42.0°C (high = +85.0°C, crit = +95.0°C)
Core 1: +43.0°C (high = +85.0°C, crit = +95.0°C)

dell_smm-isa-0000
Adapter: ISA adapter
Processor Fan: 4560 RPM
CPU: +45.0°C
Ambient: +25.0°C

9. BMC故障诊断

BMC故障影响服务器远程管理功能。

# 检查BMC状态
# ipmitool mc info
Device ID : 32
Device Revision : 1
Firmware Revision : 5.20.00
IPMI Version : 2.0
Manufacturer ID : 674
Manufacturer Name : Dell Inc.
Product ID : 256 (0x0100)
Device Available : yes
Provides Device SDRs : yes

# 检查BMC网络配置
# ipmitool lan print 1
Set in Progress : Set Complete
Auth Type Support : NONE MD2 MD5 PASSWORD
Auth Type Enable : Callback : MD2 MD5 PASSWORD
: User : MD2 MD5 PASSWORD
: Operator : MD2 MD5 PASSWORD
: Admin : MD2 MD5 PASSWORD
: OEM : MD2 MD5 PASSWORD
IP Address Source : Static Address
IP Address : 192.168.1.100
Subnet Mask : 255.255.255.0
MAC Address : 00:11:22:33:44:55
Default Gateway IP : 192.168.1.1
Cipher Suite Priv Max : aaaaaaaaaaaaaaa

# 检查BMC用户
# ipmitool user list 1
ID Name Callin Link Auth IPMI Msg Channel Priv Limit
1 true false false NO ACCESS
2 root true false false ADMINISTRATOR
3 admin true false false ADMINISTRATOR

# BMC诊断脚本
# cat > /opt/scripts/bmc_diag.sh << 'EOF' #!/bin/bash echo "BMC诊断" echo "==========================================" # 检查BMC连接 echo "1. BMC连接状态" if ipmitool mc info &>/dev/null; then
echo ” BMC连接: 正常”
else
echo ” BMC连接: 失败”
exit 1
fi

# 检查BMC版本
echo “”
echo “2. BMC信息”
ipmitool mc info | grep -E “Firmware|IPMI Version|Manufacturer”

# 检查BMC网络
echo “”
echo “3. BMC网络配置”
ipmitool lan print 1 | grep -E “IP Address|Subnet Mask|MAC Address|Gateway”

# 检查SEL状态
echo “”
echo “4. SEL状态”
SEL_INFO=$(ipmitool sel info)
echo “$SEL_INFO” | grep -E “Entries|Free Space”

# 检查SDR状态
echo “”
echo “5. SDR状态”
SDR_COUNT=$(ipmitool sdr list | wc -l)
echo ” 传感器数量: $SDR_COUNT”

# 检查BMC时间
echo “”
echo “6. BMC时间”
BMC_TIME=$(ipmitool sel time get)
echo ” BMC时间: $BMC_TIME”
echo ” 系统时间: $(date)”

# 检查BMC健康状态
echo “”
echo “7. BMC健康状态”
ipmitool sdr type ‘Watchdog’ | while read line; do
echo ” $line”
done

echo “”
echo “==========================================”
EOF

# chmod +x /opt/scripts/bmc_diag.sh

# 重启BMC
# ipmitool mc reset cold
Sent cold reset command to MC

# 清除SEL日志
# ipmitool sel clear
Clearing SEL. Please wait…
SEL cleared.

10. 诊断工具使用

综合诊断工具提高故障定位效率。

# 综合诊断脚本
# cat > /opt/scripts/hardware_diag.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/hardware_diag_$(date +%Y%m%d_%H%M%S).log" echo "==========================================" echo "服务器硬件综合诊断" echo "主机: $(hostname)" echo "时间: $(date)" echo "==========================================" # 1. 系统信息 echo "" echo "【系统信息】" dmidecode -t system | grep -E "Manufacturer|Product Name|Serial Number" # 2. CPU诊断 echo "" echo "【CPU诊断】" echo "CPU型号: $(lscpu | grep "Model name" | awk -F: '{print $2}')" echo "CPU核心: $(nproc)" echo "CPU温度: $(sensors | grep "Package" | awk '{print $4}')" ipmitool sensor | grep "CPU" | grep "Temp" # 3. 内存诊断 echo "" echo "【内存诊断】" echo "总内存: $(free -h | grep Mem | awk '{print $2}')" echo "内存使用: $(free -h | grep Mem | awk '{print $3}')" echo "ECC错误: $(edac-util -v 2>/dev/null || echo ‘无错误’)”

# 4. 磁盘诊断
echo “”
echo “【磁盘诊断】”
for disk in $(ls /dev/sd? 2>/dev/null); do
HEALTH=$(smartctl -H $disk 2>/dev/null | grep “SMART overall-health” | awk ‘{print $NF}’)
echo “$disk: $HEALTH”
done

# 5. 网络诊断
echo “”
echo “【网络诊断】”
for nic in $(ls /sys/class/net/ | grep -v lo); do
LINK=$(cat /sys/class/net/$nic/carrier 2>/dev/null || echo “unknown”)
SPEED=$(cat /sys/class/net/$nic/speed 2>/dev/null || echo “unknown”)
echo “$nic: 链路=$LINK, 速度=${SPEED}Mb/s”
done

# 6. 电源诊断
echo “”
echo “【电源诊断】”
ipmitool sensor | grep “Power Supply” | while read line; do
NAME=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
STATUS=$(echo $line | awk -F’|’ ‘{print $3}’ | tr -d ‘ ‘)
[ “$STATUS” == “0x0” ] && echo “$NAME: 正常” || echo “$NAME: 异常”
done

# 7. 风扇诊断
echo “”
echo “【风扇诊断】”
ipmitool sensor | grep “Fan” | head -6 | while read line; do
NAME=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
SPEED=$(echo $line | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
echo “$NAME: ${SPEED} RPM”
done

# 8. 温度诊断
echo “”
echo “【温度诊断】”
ipmitool sensor | grep “Temp” | while read line; do
NAME=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
TEMP=$(echo $line | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
echo “$NAME: ${TEMP}°C”
done

# 9. SEL事件
echo “”
echo “【最近SEL事件】”
ipmitool sel list | tail -5

# 10. 健康状态汇总
echo “”
echo “【健康状态汇总】”
HEALTH_STATUS=”正常”

# 检查是否有异常
if ipmitool sensor | grep -v “ok” | grep -q “degrees\|RPM\|Watts”; then
HEALTH_STATUS=”异常”
fi

if dmesg | grep -qi “error\|fail\|critical”; then
HEALTH_STATUS=”异常”
fi

echo “整体健康状态: $HEALTH_STATUS”

echo “”
echo “==========================================”
echo “诊断完成”
echo “==========================================”
EOF

# chmod +x /opt/scripts/hardware_diag.sh

# 执行诊断
# /opt/scripts/hardware_diag.sh
==========================================
服务器硬件综合诊断
主机: fgedu-server01
时间: Fri Apr 3 10:00:00 CST 2026
==========================================

【系统信息}
Manufacturer: Dell Inc.
Product Name: PowerEdge R740
Serial Number: ABCD1234

【CPU诊断】
CPU型号: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU核心: 32
CPU温度: +45.0°C
CPU1 Temp | 45.000 | degrees C | ok

【内存诊断】
总内存: 125Gi
内存使用: 2.1Gi
ECC错误: mc0: 0 CE, 0 UE

【磁盘诊断】
/dev/sda: PASSED
/dev/sdb: PASSED

【网络诊断】
eth0: 链路=1, 速度=10000Mb/s
eth1: 链路=1, 速度=10000Mb/s

【电源诊断】
Power Supply 1: 正常
Power Supply 2: 正常

【风扇诊断】
Fan1: 4560 RPM
Fan2: 4560 RPM
Fan3: 4560 RPM

【温度诊断】
Inlet Temp: 25.000°C
CPU1 Temp: 45.000°C
CPU2 Temp: 43.000°C

【最近SEL事件】
1 | 04/03/2026 | 10:00:00 | System Board | System Event | Asserted

【健康状态汇总】
整体健康状态: 正常

==========================================
诊断完成
==========================================

生产环境风哥建议:建立完善的硬件监控体系,定期进行硬件巡检,配置告警阈值,准备备件库存,记录故障处理过程,建立硬件故障知识库。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息