1. 首页 > IT综合教程 > 正文

IT教程FG323-服务器硬件监控管理

1. 硬件监控概述

服务器硬件监控是保障服务器稳定运行的重要手段,通过监控CPU、内存、磁盘、网络、温度等指标,及时发现和处理硬件故障。更多学习教程www.fgedu.net.cn

# 查看服务器硬件信息
# dmidecode -t system | head -20
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.2.1 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R750
Version: Not Specified
Serial Number: ABCD1234
UUID: 12345678-1234-1234-1234-123456789012
Wake-up Type: Power Switch
SKU Number: 0917
Family: PowerEdge

# 查看CPU信息
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Gold 6348R CPU @ 2.60GHz
Stepping: 6
CPU MHz: 2600.000
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5200.00
Virtualization: VT-x
L1d cache: 48K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 49152K

# 查看内存信息
# dmidecode -t memory | grep -E “Size|Type|Speed|Manufacturer” | head -20
Size: 32 GB
Type: DDR4
Speed: 3200 MT/s
Manufacturer: Samsung
Size: 32 GB
Type: DDR4
Speed: 3200 MT/s
Manufacturer: Samsung
Size: 32 GB
Type: DDR4
Speed: 3200 MT/s
Manufacturer: Samsung
Size: 32 GB
Type: DDR4
Speed: 3200 MT/s
Manufacturer: Samsung

生产环境风哥建议:部署完善的硬件监控系统,配置合理的告警阈值,定期检查硬件健康状态,做好硬件维护计划。

2. IPMI配置管理

IPMI是服务器硬件管理的重要接口,可以远程监控和管理服务器。学习交流加群风哥微信: itpux-com

# 安装ipmitool
# yum install -y ipmitool

# 加载IPMI内核模块
# modprobe ipmi_devintf
# modprobe ipmi_si

# 查看IPMI信息
# ipmitool lan print
Set in Progress : Set Complete
Auth Type Support : NONE MD2 MD5 PASSWORD
Auth Type Enable : Callback : MD2 MD5 PASSWORD
: User : MD2 MD5 PASSWORD
: Operator : MD2 MD5 PASSWORD
: Admin : MD2 MD5 PASSWORD
: OEM : MD2 MD5 PASSWORD
IP Address Source : Static Address
IP Address : 192.168.1.200
Subnet Mask : 255.255.255.0
MAC Address : 00:50:56:ab:cd:ef
Default Gateway IP : 192.168.1.1
VLAN ID : Disabled
Cipher Suite Priv Max : Not Available
Bad Password Threshold : Not Available

# 配置IPMI网络
# ipmitool lan set 1 ipaddr 192.168.1.200
Setting LAN IP Address to 192.168.1.200
# ipmitool lan set 1 netmask 255.255.255.0
Setting LAN Subnet Mask to 255.255.255.0
# ipmitool lan set 1 defgw ipaddr 192.168.1.1
Setting LAN Default Gateway IP to 192.168.1.1

# 配置IPMI用户
# ipmitool user list 1
ID Name Callin Link Auth IPMI Msg Channel Priv Limit
1 false false true ADMINISTRATOR
2 root false false true ADMINISTRATOR

# 设置用户密码
# ipmitool user set password 2 Fgedu@IPMI123
Set User Password command successful (user 2)

# 查看传感器数据
# ipmitool sensor list | head -20
CPU1 Temp | 45.000 | degrees C | ok | na | 0.000 | 0.000 | 0.000 | 95.000 | 100.000 | 100.000 | na
CPU2 Temp | 43.000 | degrees C | ok | na | 0.000 | 0.000 | 0.000 | 95.000 | 100.000 | 100.000 | na
System Temp | 28.000 | degrees C | ok | na | -9.000 | -7.000 | -5.000 | 85.000 | 90.000 | 95.000 | na
Fan1 | 4560.000 | RPM | ok | na | 600.000 | 600.000 | 600.000 | na | na | na | na
Fan2 | 4560.000 | RPM | ok | na | 600.000 | 600.000 | 600.000 | na | na | na | na
Fan3 | 4560.000 | RPM | ok | na | 600.000 | 600.000 | 600.000 | na | na | na | na
12V | 12.156 | Volts | ok | na | 10.272 | 10.560 | 10.848 | 12.960 | 13.248 | 13.536 | na
5V | 5.064 | Volts | ok | na | 4.248 | 4.368 | 4.488 | 5.464 | 5.584 | 5.704 | na
3.3V | 3.312 | Volts | ok | na | 2.784 | 2.856 | 2.928 | 3.588 | 3.660 | 3.732 | na

# 查看系统事件日志
# ipmitool sel list | tail -20
1 | 04/03/2026 | 10:00:00 | Temperature #0x1 | Upper Critical going high | Asserted
2 | 04/03/2026 | 10:05:00 | Temperature #0x1 | Upper Critical going high | Deasserted
3 | 04/03/2026 | 10:10:00 | Fan #0x1 | Lower Critical going low | Asserted
4 | 04/03/2026 | 10:15:00 | Fan #0x1 | Lower Critical going low | Deasserted
5 | 04/03/2026 | 10:20:00 | Power Supply #0x1 | Power Supply AC lost | Asserted
6 | 04/03/2026 | 10:25:00 | Power Supply #0x1 | Power Supply AC lost | Deasserted

# 查看电源状态
# ipmitool power status
Chassis Power is on

# 远程开关机
# ipmitool -H 192.168.1.200 -U root -P Fgedu@IPMI123 power on
Chassis Power Control: Up/On

# ipmitool -H 192.168.1.200 -U root -P Fgedu@IPMI123 power off
Chassis Power Control: Down/Off

# ipmitool -H 192.168.1.200 -U root -P Fgedu@IPMI123 power reset
Chassis Power Control: Reset

# 查看机箱状态
# ipmitool chassis status
System Power : on
Power Overload : false
Power Interlock : inactive
Main Power Fault : false
Power Control Fault : false
Power Restore Policy : always-off
Last Power Event :
Chassis Intrusion : inactive
Front Panel Lockout : inactive
Drive Fault : false
Cooling/Fan Fault : false

3. CPU监控

CPU监控包括使用率、温度、频率等指标。学习交流加群风哥QQ113257174

# 查看CPU使用率
# top -bn1 | head -20
top – 10:00:00 up 30 days, 5:30, 3 users, load average: 0.50, 0.30, 0.10
Tasks: 200 total, 1 running, 199 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.0 us, 2.0 sy, 0.0 ni, 92.5 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 131072.0 total, 120000.0 free, 10000.0 used, 1072.0 buff/cache
MiB Swap: 32768.0 total, 32768.0 free, 0.0 used. 120000.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 169924 11256 8236 S 0.0 0.0 1:23.45 systemd
123 root 20 0 0 0 0 S 0.0 0.0 0:05.67 kworker/0:1
456 root 20 0 0 0 0 S 0.0 0.0 0:03.21 kworker/1:0

# 查看每个CPU核心使用率
# mpstat -P ALL 1 3
Linux 5.10.0-136.65.0.122.oe2203sp3.x86_64 (fgedu-server01) 04/03/2026 _x86_64_ (64 CPU)

10:00:00 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:00:01 AM all 5.00 0.00 2.00 0.50 0.00 0.00 0.00 0.00 0.00 92.50
10:00:01 AM 0 4.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 94.00
10:00:01 AM 1 5.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 93.00
10:00:01 AM 2 6.00 0.00 3.00 0.00 0.00 0.00 0.00 0.00 0.00 91.00
10:00:01 AM 3 4.00 0.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 93.00

Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 5.00 0.00 2.00 0.50 0.00 0.00 0.00 0.00 0.00 92.50
Average: 0 4.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 94.00
Average: 1 5.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 93.00

# 查看CPU温度
# cat /sys/class/thermal/thermal_zone*/temp
45000
43000

# 使用sensors查看温度
# yum install -y lm_sensors
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +45.0°C (high = +95.0°C, crit = +100.0°C)
Core 0: +44.0°C (high = +95.0°C, crit = +100.0°C)
Core 1: +45.0°C (high = +95.0°C, crit = +100.0°C)
Core 2: +43.0°C (high = +95.0°C, crit = +100.0°C)
Core 3: +44.0°C (high = +95.0°C, crit = +100.0°C)

# 查看CPU频率
# cat /proc/cpuinfo | grep “cpu MHz” | head -10
cpu MHz : 2600.000
cpu MHz : 2600.000
cpu MHz : 2600.000
cpu MHz : 2600.000
cpu MHz : 2600.000
cpu MHz : 2600.000
cpu MHz : 2600.000
cpu MHz : 2600.000

# 查看CPU负载
# cat /proc/loadavg
0.50 0.30 0.10 2/500 12345

4. 内存监控

内存监控包括使用率、错误、带宽等指标。更多学习教程公众号风哥教程itpux_com

# 查看内存使用
# free -h
total used free shared buff/cache available
Mem: 128Gi 10Gi 117Gi 1.0Gi 1.0Gi 117Gi
Swap: 32Gi 0B 32Gi

# 查看内存详细信息
# cat /proc/meminfo
MemTotal: 134217728 kB
MemFree: 123731968 kB
MemAvailable: 123731968 kB
Buffers: 102400 kB
Cached: 512000 kB
SwapCached: 0 kB
Active: 512000 kB
Inactive: 256000 kB
Active(anon): 256000 kB
Inactive(anon): 128000 kB
Active(file): 256000 kB
Inactive(file): 128000 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 33554432 kB
SwapFree: 33554432 kB

# 查看内存使用趋势
# vmstat 1 5
procs ———–memory———- —swap– —–io—- -system– ——cpu—–
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 123731968 102400 512000 0 0 0 0 1000 2000 5 2 92 1 0
0 0 0 123731968 102400 512000 0 0 0 0 1000 2000 5 2 92 1 0
0 0 0 123731968 102400 512000 0 0 0 0 1000 2000 5 2 92 1 0
0 0 0 123731968 102400 512000 0 0 0 0 1000 2000 5 2 92 1 0
0 0 0 123731968 102400 512000 0 0 0 0 1000 2000 5 2 92 1 0

# 查看内存错误
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: 0 Corrected Errors

# 查看ECC内存状态
# cat /sys/devices/system/edac/mc/mc*/ce_count
0
0

# 查看NUMA内存分布
# numactl –hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 65536 MB
node 0 free: 60000 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 65536 MB
node 1 free: 62000 MB
node distances:
node 0 1
0: 10 21
1: 21 10

5. 磁盘监控

磁盘监控包括使用率、I/O性能、SMART状态等。author:www.itpux.com

# 查看磁盘使用率
# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 63G 8.5M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/mapper/vg_data-lv_root 100G 10G 90G 10% /
/dev/sda1 1014M 150M 865M 15% /boot
/dev/mapper/vg_data-lv_data 500G 50G 450G 10% /data

# 查看磁盘I/O
# iostat -x 1 3
Linux 5.10.0-136.65.0.122.oe2203sp3.x86_64 (fgedu-server01) 04/03/2026 _x86_64_ (64 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
5.00 0.00 2.00 0.50 0.00 92.50

Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 10.00 20.00 512.00 1024.00 0.00 0.00 0.00 0.00 0.50 0.30 0.01 51.20 51.20 0.50 1.50
sdb 50.00 100.00 2560.00 5120.00 0.00 0.00 0.00 0.00 0.20 0.10 0.02 51.20 51.20 0.30 4.50

# 查看磁盘SMART信息
# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-136.65.0.122.oe2203sp3.x86_64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Samsung SSD 970 EVO Plus
Device Model: Samsung SSD 970 EVO Plus 500GB
Serial Number: ABCD1234EFGH5678
LU WWN Device Id: 5 002538 8a000000
Firmware Version: 2B2QEXM7
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical, 512 bytes physical
Rotation Rate: Solid State Device
Form Factor: M.2
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always – 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always – 1234
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always – 100
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always – 1
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always – 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always – 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always – 0
187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always – 0
190 Airflow_Temperature_Cel 0x0022 065 060 000 Old_age Always – 35 (Min/Max 25/45)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always – 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always – 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always – 0
232 Available_Reservd_Space 0x0013 099 099 010 Pre-fail Always – 99
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always – 1234567890

6. 网络监控

网络监控包括流量、丢包、错误等指标。

# 查看网络接口统计
# cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
lo: 12345678 123456 0 0 0 0 0 0 12345678 123456 0 0 0 0 0 0
eth0: 1234567890 1234567 0 0 0 0 0 0 987654321 987654 0 0 0 0 0 0

# 查看网络流量
# iftop -t -s 10 -n
interface: eth0
IP traffic monitor
Listening on eth0
10.0s 20.0s 30.0s 40.0s 50.0s 60.0s
===========================================================================
192.168.1.100 => 192.168.1.200 1.23Mb 1.45Mb 1.56Mb
<= 512Kb 678Kb 789Kb 192.168.1.100 => 192.168.1.201 512Kb 678Kb 789Kb
<= 256Kb 345Kb 456Kb =========================================================================== Total send rate: 1.74Mb 2.13Mb 2.35Mb Total receive rate: 768Kb 1.02Mb 1.25Mb Total send and receive: 2.51Mb 3.15Mb 3.60Mb # 查看网络连接 # ss -s Total: 1000 TCP: 500 (estab 300, closed 100, orphaned 50, timewait 50) Transport Total IP IPv6 RAW 1 0 1 UDP 10 5 5 TCP 400 200 200 INET 411 205 206 FRAG 0 0 0 # 查看网络错误 # netstat -i Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 123456 0 0 0 98765 0 0 0 BMRU lo 65536 12345 0 0 0 12345 0 0 0 LRU # 查看网络延迟 # ping -c 5 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data. 64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.521 ms 64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time=0.432 ms 64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time=0.398 ms 64 bytes from 192.168.1.1: icmp_seq=4 ttl=64 time=0.456 ms 64 bytes from 192.168.1.1: icmp_seq=5 ttl=64 time=0.489 ms --- 192.168.1.1 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4004ms rtt min/avg/max/mdev = 0.398/0.459/0.521/0.045 ms

7. 温度监控

温度监控确保服务器在安全温度范围内运行。

# 查看CPU温度
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +45.0°C (high = +95.0°C, crit = +100.0°C)
Core 0: +44.0°C (high = +95.0°C, crit = +100.0°C)
Core 1: +45.0°C (high = +95.0°C, crit = +100.0°C)
Core 2: +43.0°C (high = +95.0°C, crit = +100.0°C)
Core 3: +44.0°C (high = +95.0°C, crit = +100.0°C)

# 查看系统温度
# ipmitool sdr type temperature
CPU1 Temp | 45 degrees C | ok
CPU2 Temp | 43 degrees C | ok
System Temp | 28 degrees C | ok
Ambient Temp | 22 degrees C | ok

# 查看风扇转速
# ipmitool sdr type fan
Fan1 | 4560 RPM | ok
Fan2 | 4560 RPM | ok
Fan3 | 4560 RPM | ok
Fan4 | 4560 RPM | ok
Fan5 | 4560 RPM | ok
Fan6 | 4560 RPM | ok

# 温度告警脚本
# cat > /opt/scripts/temp_monitor.sh << 'EOF' #!/bin/bash TEMP_THRESHOLD=80 LOG_FILE="/var/log/temp_monitor.log" CPU_TEMP=$(sensors | grep "Package id 0" | awk '{print $4}' | cut -d'.' -f1) if [ "$CPU_TEMP" -gt "$TEMP_THRESHOLD" ]; then echo "$(date): CPU温度过高: ${CPU_TEMP}°C" >> $LOG_FILE
# 发送告警
curl -s “http://alert.fgedu.net.cn/api/send?msg=CPU温度过高:${CPU_TEMP}°C”
fi
EOF

# chmod +x /opt/scripts/temp_monitor.sh

# 添加定时任务
# crontab -e
*/5 * * * * /opt/scripts/temp_monitor.sh

8. 电源监控

电源监控确保服务器电源稳定。

# 查看电源状态
# ipmitool sdr type “Power Supply”
Power Supply 1 | 0x00 | ok
Power Supply 2 | 0x00 | ok

# 查看功耗
# ipmitool dcmi power reading
Instantaneous power reading: 350 Watts
Minimum during sampling period: 320 Watts
Maximum during sampling period: 450 Watts
Average power reading over sample period: 340 Watts
IPMI timestamp: Fri Apr 3 10:00:00 2026
Sampling period: 1000000 ms.

# 查看电源历史
# ipmitool dcmi power statistics
Power Measurement : Active
Current Value : 350 Watts
Minimum Value : 320 Watts
Maximum Value : 450 Watts
Average Value : 340 Watts
Time Stamp : Fri Apr 3 10:00:00 2026
Reporting Period : 1000000 ms

# 查看UPS状态(如果连接了UPS)
# upsc ups@192.168.1.10
battery.charge: 100
battery.runtime: 3600
battery.voltage: 48.0
device.type: ups
driver.name: nutdrv_qx
driver.version: 2.7.4
input.frequency: 50.0
input.voltage: 220.0
output.voltage: 220.0
ups.load: 25
ups.status: OL
ups.temperature: 25.0

# 电源监控脚本
# cat > /opt/scripts/power_monitor.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/power_monitor.log" POWER=$(ipmitool dcmi power reading | grep "Instantaneous" | awk '{print $4}') echo "$(date): 当前功耗: ${POWER}W" >> $LOG_FILE

if [ “$POWER” -gt 500 ]; then
echo “$(date): 功耗过高: ${POWER}W” >> $LOG_FILE
curl -s “http://alert.fgedu.net.cn/api/send?msg=服务器功耗过高:${POWER}W”
fi
EOF

# chmod +x /opt/scripts/power_monitor.sh

9. 告警配置

告警配置确保及时发现硬件问题。

# 配置邮件告警
# cat > /opt/scripts/hardware_alert.sh << 'EOF' #!/bin/bash ALERT_EMAIL="admin@fgedu.net.cn" LOG_FILE="/var/log/hardware_alert.log" check_cpu_temp() { TEMP=$(sensors | grep "Package id 0" | awk '{print $4}' | cut -d'.' -f1) if [ "$TEMP" -gt 80 ]; then echo "$(date): CPU温度告警: ${TEMP}°C" >> $LOG_FILE
echo “CPU温度过高: ${TEMP}°C” | mail -s “硬件告警-CPU温度” $ALERT_EMAIL
fi
}

check_memory() {
MEM_USAGE=$(free | grep Mem | awk ‘{printf “%.0f”, $3/$2 * 100}’)
if [ “$MEM_USAGE” -gt 90 ]; then
echo “$(date): 内存使用率告警: ${MEM_USAGE}%” >> $LOG_FILE
echo “内存使用率过高: ${MEM_USAGE}%” | mail -s “硬件告警-内存” $ALERT_EMAIL
fi
}

check_disk() {
DISK_USAGE=$(df -h / | tail -1 | awk ‘{print $5}’ | cut -d’%’ -f1)
if [ “$DISK_USAGE” -gt 85 ]; then
echo “$(date): 磁盘使用率告警: ${DISK_USAGE}%” >> $LOG_FILE
echo “磁盘使用率过高: ${DISK_USAGE}%” | mail -s “硬件告警-磁盘” $ALERT_EMAIL
fi
}

check_cpu_temp
check_memory
check_disk
EOF

# chmod +x /opt/scripts/hardware_alert.sh

# 配置定时任务
# crontab -e
*/5 * * * * /opt/scripts/hardware_alert.sh

# 配置IPMI告警
# ipmitool lan set 1 alert enable
# ipmitool event 1

10. 健康检查

健康检查全面评估服务器硬件状态。

# 硬件健康检查脚本
# cat > /opt/scripts/health_check.sh << 'EOF' #!/bin/bash echo "==========================================" echo "服务器硬件健康检查报告" echo "检查时间: $(date)" echo "==========================================" echo "" echo "1. CPU状态检查" echo "------------------------------------------" CPU_TEMP=$(sensors | grep "Package id 0" | awk '{print $4}') echo "CPU温度: $CPU_TEMP" CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) echo "CPU使用率: ${CPU_USAGE}%" CPU_LOAD=$(cat /proc/loadavg | awk '{print $1,$2,$3}') echo "CPU负载: $CPU_LOAD" echo "" echo "2. 内存状态检查" echo "------------------------------------------" free -h MEM_ERRORS=$(edac-util -v | grep "Corrected Errors" | awk '{print $4}') echo "内存ECC错误: $MEM_ERRORS" echo "" echo "3. 磁盘状态检查" echo "------------------------------------------" df -h echo "" echo "SMART状态:" smartctl -H /dev/sda | grep "SMART overall-health" echo "" echo "4. 网络状态检查" echo "------------------------------------------" ip addr show eth0 | grep "inet " ping -c 1 192.168.1.1 > /dev/null 2>&1 && echo “网络连通: 正常” || echo “网络连通: 异常”

echo “”
echo “5. 温度状态检查”
echo “——————————————”
ipmitool sdr type temperature

echo “”
echo “6. 风扇状态检查”
echo “——————————————”
ipmitool sdr type fan

echo “”
echo “7. 电源状态检查”
echo “——————————————”
ipmitool sdr type “Power Supply”
ipmitool dcmi power reading | grep “Instantaneous”

echo “”
echo “8. IPMI事件日志”
echo “——————————————”
ipmitool sel list | tail -10

echo “”
echo “==========================================”
echo “健康检查完成”
echo “==========================================”
EOF

# chmod +x /opt/scripts/health_check.sh

# 执行健康检查
# /opt/scripts/health_check.sh
==========================================
服务器硬件健康检查报告
检查时间: Fri Apr 3 10:00:00 CST 2026
==========================================

1. CPU状态检查
——————————————
CPU温度: +45.0°C
CPU使用率: 5.0%
CPU负载: 0.50 0.30 0.10

2. 内存状态检查
——————————————
total used free shared buff/cache available
Mem: 128Gi 10Gi 117Gi 1.0Gi 1.0Gi 117Gi
Swap: 32Gi 0B 32Gi
内存ECC错误: 0

3. 磁盘状态检查
——————————————
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_data-lv_root 100G 10G 90G 10% /
/dev/sda1 1014M 150M 865M 15% /boot
/dev/mapper/vg_data-lv_data 500G 50G 450G 10% /data

SMART状态:
SMART overall-health self-assessment test result: PASSED

4. 网络状态检查
——————————————
inet 192.168.1.100/24 brd 192.168.1.255 scope global eth0
网络连通: 正常

5. 温度状态检查
——————————————
CPU1 Temp | 45 degrees C | ok
CPU2 Temp | 43 degrees C | ok
System Temp | 28 degrees C | ok

6. 风扇状态检查
——————————————
Fan1 | 4560 RPM | ok
Fan2 | 4560 RPM | ok
Fan3 | 4560 RPM | ok

7. 电源状态检查
——————————————
Power Supply 1 | 0x00 | ok
Power Supply 2 | 0x00 | ok
Instantaneous power reading: 350 Watts

8. IPMI事件日志
——————————————
1 | 04/03/2026 | 10:00:00 | Temperature #0x1 | Upper Critical going high | Asserted
2 | 04/03/2026 | 10:05:00 | Temperature #0x1 | Upper Critical going high | Deasserted

==========================================
健康检查完成
==========================================

生产环境风哥建议:部署完善的硬件监控系统,配置合理的告警阈值,定期检查硬件健康状态,建立硬件维护计划,保留硬件事件日志用于故障分析。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息