IT教程FG296-硬件服务器故障诊断与维护实战

1. 硬件故障类型

硬件服务器故障主要包括CPU、内存、存储、网络、电源和散热系统等方面的故障。更多学习教程www.fgedu.net.cn

# 常见硬件故障类型

## CPU故障
– 过热导致的性能下降或宕机
– 针脚损坏导致的无法启动
– 内部故障导致的系统不稳定

## 内存故障
– 内存模块损坏导致的无法启动
– 内存接触不良导致的系统崩溃
– 内存容量不足导致的性能问题

## 存储故障
– 硬盘物理损坏导致的数据丢失
– 硬盘逻辑故障导致的无法访问
– RAID阵列故障导致的服务中断

## 网络故障
– 网卡损坏导致的网络连接中断
– 网线接触不良导致的网络不稳定
– 网络交换机故障导致的网络中断

## 电源故障
– 电源供应器故障导致的系统宕机
– 电压不稳定导致的设备损坏
– 冗余电源故障导致的单点故障

## 散热系统故障
– 风扇故障导致的设备过热
– 散热片积尘导致的散热不良
– 温度传感器故障导致的误报

2. 故障诊断工具

使用专业的诊断工具可以快速定位硬件故障。学习交流加群风哥微信: itpux-com

# 硬件诊断工具

## 内置诊断工具
# 戴尔服务器
# racadm -r 192.168.1.100 -u root -p P@ssw0rd getsel

# HP服务器
# hponcfg -w config.xml

# IBM服务器
# ibmvmc -h 192.168.1.100 -u USERID -p PASSW0RD inventory

## 操作系统内置工具
# 查看CPU信息
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz

# 查看内存信息
# dmidecode -t memory
# free -h
total used free shared buff/cache available
Mem: 62G 2.1G 58G 8.5M 1.8G 59G
Swap: 32G 0B 32G

# 查看磁盘信息
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 500G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 499G 0 part
├─ol-root 253:0 0 50G 0 lvm /
├─ol-swap 253:1 0 32G 0 lvm [SWAP]
└─ol-home 253:2 0 417G 0 lvm /home

# 查看网络信息
# ifconfig
# ethtool eth0

# 查看温度信息
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +35.0°C (high = +80.0°C, crit = +90.0°C)
Core 0: +32.0°C (high = +80.0°C, crit = +90.0°C)
Core 1: +33.0°C (high = +80.0°C, crit = +90.0°C)
Core 2: +34.0°C (high = +80.0°C, crit = +90.0°C)
Core 3: +32.0°C (high = +80.0°C, crit = +90.0°C)

3. CPU故障诊断与维护

CPU是服务器的核心组件，其故障会导致系统无法正常运行。

# CPU故障诊断

## 检查CPU温度
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +75.0°C (high = +80.0°C, crit = +90.0°C)
Core 0: +72.0°C (high = +80.0°C, crit = +90.0°C)
Core 1: +73.0°C (high = +80.0°C, crit = +90.0°C)

## 检查CPU使用率
# top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 root 20 0 100000 50000 20000 S 95.0 2.5 0:30.00 stress

## 检查CPU错误日志
# dmesg | grep -i cpu
[ 0.000000] CPU0 microcode updated early to revision 0x200004d, date = 2020-01-23
[ 0.000000] CPU1 microcode updated early to revision 0x200004d, date = 2020-01-23
[ 0.000000] CPU2 microcode updated early to revision 0x200004d, date = 2020-01-23
[ 0.000000] CPU3 microcode updated early to revision 0x200004d, date = 2020-01-23

## CPU维护
1. 定期清理CPU散热片和风扇的灰尘
2. 确保服务器机房温度适宜（18-24°C）
3. 避免CPU超频，使用默认频率
4. 定期更新CPU微码
5. 确保服务器电源稳定

风哥风哥提示：CPU温度过高是常见的故障原因，需要定期检查散热系统，确保CPU在正常温度范围内运行。

4. 内存故障诊断与维护

内存故障会导致系统崩溃、数据丢失等严重问题。学习交流加群风哥QQ113257174

# 内存故障诊断

## 使用memtest86+进行内存测试
# 下载memtest86+镜像
# wget https://www.memtest86.com/downloads/memtest86-usb.zip
# 制作启动U盘
# dd if=memtest86-usb.img of=/dev/sdb bs=1M

## 检查内存错误日志
# dmesg | grep -i memory
[ 0.000000] Memory: 65972352K/67108864K available (10240K kernel code, 1433K rwdata, 4096K rodata, 1024K init, 8192K bss, 1136512K reserved, 0K cma-reserved)

## 检查内存使用情况
# free -h
total used free shared buff/cache available
Mem: 62G 2.1G 58G 8.5M 1.8G 59G
Swap: 32G 0B 32G

## 内存维护
1. 定期清理内存插槽的灰尘
2. 确保内存模块接触良好
3. 使用相同品牌、型号和容量的内存模块
4. 避免混插不同频率的内存
5. 定期检查内存使用情况，避免内存不足

5. 存储故障诊断与维护

存储故障会导致数据丢失和服务中断，需要及时诊断和处理。更多学习教程公众号风哥教程itpux_com

# 存储故障诊断

## 检查硬盘状态
# smartctl -a /dev/sda
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always – 0
3 Spin_Up_Time 0x0027 183 170 021 Pre-fail Always – 6181
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always – 1234
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always – 0
7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always – 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always – 12345
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always – 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always – 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always – 567
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always – 123
193 Load_Cycle_Count 0x0032 198 198 000 Old_age Always – 4567
194 Temperature_Celsius 0x0022 118 100 000 Old_age Always – 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always – 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always – 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline – 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always – 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline – 0

## 检查RAID状态
# megacli -LDInfo -Lall -aAll
Adapter 0 — Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 499.0 GB
Sector Size : 512
Is VD emulated : No
Parity Size : 249.5 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk’s Default
Encryption Type : None
Is VD Cached: No

## 存储维护
1. 定期备份重要数据
2. 定期检查硬盘健康状态
3. 监控RAID阵列状态
4. 避免存储设备过热
5. 定期清理存储设备的灰尘
6. 使用UPS确保电源稳定

6. 网络故障诊断与维护

网络故障会导致服务器无法访问，影响业务运行。

# 网络故障诊断

## 检查网络接口状态
# ifconfig eth0
eth0: flags=4163 mtu 1500
inet 192.168.1.100 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::250:56ff:fe8b:1234 prefixlen 64 scopeid 0x20 ether 00:50:56:8b:12:34 txqueuelen 1000 (Ethernet)
RX packets 1000000 bytes 1000000000 (953.6 MiB)
TX packets 800000 bytes 800000000 (762.9 MiB)

## 检查网络连接
# ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.500 ms
64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=0.450 ms
64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=0.420 ms

## 检查网络路由
# ip route
default via 192.168.1.1 dev eth0 proto static metric 100
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100

## 检查网络端口
# netstat -tuln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN

## 网络维护
1. 定期检查网线连接状态
2. 确保网络设备接地良好
3. 避免网络设备过热
4. 定期更新网络设备固件
5. 实施网络监控和告警机制

7. 电源故障诊断与维护

电源故障会导致服务器突然宕机，可能造成数据丢失。author:www.itpux.com

# 电源故障诊断

## 检查电源状态
# 戴尔服务器
# racadm -r 192.168.1.100 -u root -p P@ssw0rd getsystemstatus

# HP服务器
# hponcfg -w config.xml

## 检查电源日志
# dmesg | grep -i power
[ 0.000000] ACPI: Power Button [PWRB]
[ 0.000000] ACPI: Sleep Button [SLPB]
[ 0.000000] ACPI: Power Resource [P0L1]
[ 0.000000] ACPI: Power Resource [P0L2]

## 检查UPS状态
# upsc ups@fgedudb
battery.charge: 100
battery.charge.low: 20
battery.charge.warning: 50
battery.mfr.date: 2026/01/01
battery.runtime: 3600
battery.type: PbAcid
battery.voltage: 24.0
battery.voltage.nominal: 24.0
device.mfr: APC
device.model: Smart-UPS 1500
device.type: ups
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2

## 电源维护
1. 使用UPS确保电源稳定
2. 定期检查电源供应器的风扇和散热
3. 确保服务器机房电源线路负载均衡
4. 定期测试电源供应器的输出电压
5. 对于关键服务器，使用冗余电源

8. 散热系统故障诊断与维护

散热系统故障会导致服务器过热，影响性能甚至损坏硬件。

# 散热系统故障诊断

## 检查风扇状态
# 戴尔服务器
# racadm -r 192.168.1.100 -u root -p P@ssw0rd getfanspeed

# HP服务器
# hponcfg -w config.xml

## 检查温度传感器
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +35.0°C (high = +80.0°C, crit = +90.0°C)
Core 0: +32.0°C (high = +80.0°C, crit = +90.0°C)
Core 1: +33.0°C (high = +80.0°C, crit = +90.0°C)
Core 2: +34.0°C (high = +80.0°C, crit = +90.0°C)
Core 3: +32.0°C (high = +80.0°C, crit = +90.0°C)

## 散热系统维护
1. 定期清理服务器内部的灰尘
2. 确保服务器机房通风良好
3. 定期检查风扇的运行状态
4. 确保服务器周围有足够的空间散热
5. 定期检查温度传感器的准确性

9. 预防性维护策略

预防性维护可以减少硬件故障的发生，延长服务器的使用寿命。

生产环境风哥建议：
– 制定定期维护计划，包括月度、季度和年度维护
– 定期备份重要数据和配置
– 定期检查硬件健康状态
– 保持服务器机房环境清洁和温度适宜
– 定期更新服务器固件和驱动
– 建立完善的故障响应机制

# 预防性维护计划

## 月度维护
1. 检查服务器温度和风扇状态
2. 检查硬盘健康状态
3. 检查RAID阵列状态
4. 检查网络连接状态
5. 检查系统日志中的错误信息

## 季度维护
1. 清理服务器内部灰尘
2. 检查电源供应器状态
3. 检查内存模块接触情况
4. 测试UPS电池状态
5. 更新服务器固件和驱动

## 年度维护
1. 全面检查服务器硬件
2. 测试服务器备份恢复功能
3. 评估服务器性能和容量
4. 制定硬件升级计划
5. 检查并更新灾难恢复计划

# 故障响应流程
1. 识别故障：通过监控系统或用户报告发现故障
2. 初步诊断：使用诊断工具确定故障类型
3. 制定解决方案：根据故障类型制定相应的解决方案
4. 实施解决方案：执行修复操作
5. 验证修复：确认故障已解决
6. 记录和分析：记录故障原因和解决方案，分析预防措施

风哥风哥提示：硬件服务器的故障诊断和维护需要专业的知识和工具，定期的预防性维护可以大大减少故障的发生，确保服务器的稳定运行。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

IT教程FG296-硬件服务器故障诊断与维护实战

1. 硬件故障类型

2. 故障诊断工具

3. CPU故障诊断与维护

4. 内存故障诊断与维护

5. 存储故障诊断与维护

6. 网络故障诊断与维护

7. 电源故障诊断与维护

8. 散热系统故障诊断与维护

9. 预防性维护策略

相关推荐

联系我们