1. 首页 > IT综合教程 > 正文

IT教程FG345-服务器硬件故障预测

一、故障预测概述

服务器硬件故障预测是通过监控硬件运行状态和性能指标,提前发现潜在故障风险,实现预防性维护。通过故障预测,可以减少意外停机时间,提高系统可用性,降低运维成本。

学习交流加群风哥微信: itpux-com,在FGedu企业的服务器运维中,我们建立了基于SMART数据和机器学习的故障预测系统,实现了对磁盘、内存、CPU、电源等关键硬件的故障预警。

1.1 故障预测方法

硬件故障预测主要基于阈值告警、趋势分析和机器学习等方法。

# 硬件故障预测方法
预测方法分类:

1. 基于阈值的预测
– 设置指标阈值
– 超过阈值触发告警
– 简单直观但准确性有限

2. 基于趋势的预测
– 分析指标变化趋势
– 预测未来状态
– 适用于渐进性故障

3. 基于规则的预测
– 定义故障模式规则
– 匹配规则触发预警
– 需要丰富的专家经验

4. 基于机器学习的预测
– 训练故障预测模型
– 自动识别故障模式
– 准确性高但需要大量数据

# 故障预测指标体系
硬件类型 关键预测指标 数据来源
——– ———— ——–
磁盘 SMART属性、IO错误率、温度 smartctl、/sys
内存 ECC错误、页面错误、使用率 EDAC、/proc
CPU 温度、频率、负载、错误计数 /proc、IPMI
电源 电压、电流、功率、温度 IPMI、BMC
风扇 转速、噪音、温度 IPMI、BMC
网络 错误包、丢包率、流量 /proc/net、ethtool

# FGedu故障预测架构
架构组件:
– 数据采集:smartctl、ipmitool、collectd
– 数据存储:InfluxDB、Prometheus
– 预测引擎:Python + Scikit-learn
– 告警系统:AlertManager + 企业微信/邮件
– 可视化:Grafana

数据流:
硬件传感器 -> 数据采集 -> 时序数据库 -> 预测模型 -> 告警通知
-> 可视化展示

二、磁盘故障预测

2.1 SMART数据分析

SMART(Self-Monitoring, Analysis and Reporting Technology)是磁盘自监控技术,通过分析SMART属性可以预测磁盘故障。

# SMART数据采集与分析

# 安装smartmontools
$ yum install -y smartmontools

# 查看磁盘SMART信息
$ smartctl -a /dev/sda
smartctl 7.1 2020-04-05 r5079 [x86_64-linux-5.4.0-100-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Dell
Device Model: DELL PERC H730P Adp
Serial Number: 1234567890ABC
LU WWN Device Id: 5 123456 7890abcdef
Firmware Version: 4.30
User Capacity: 1,819,001,625,600 bytes [1.81 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always – 0
3 Spin_Up_Time 0x0007 100 100 020 Pre-fail Always – 0
4 Start_Stop_Count 0x0012 100 100 020 Old_age Always – 125
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always – 0
7 Seek_Error_Rate 0x000b 100 100 020 Pre-fail Always – 0
9 Power_On_Hours 0x0012 098 098 000 Old_age Always – 20156
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always – 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always – 125
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always – 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always – 0
187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always – 0
188 Command_Timeout 0x000b 100 100 000 Old_age Always – 0
189 High_Fly_Writes 0x001a 100 100 000 Old_age Always – 0
190 Airflow_Temperature_Cel 0x0022 068 055 045 Old_age Always – 32 (Min/Max 28/36)
191 G-Sense_Error_Rate 0x001a 100 100 000 Old_age Always – 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always – 0
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always – 125
194 Temperature_Celsius 0x0022 032 040 000 Old_age Always – 32 (0 21 0 0 0)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always – 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always – 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always – 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline – 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always – 0

# 关键SMART属性说明
ID 属性名称 说明 预警阈值
— ———- —- ——–
5 Reallocated_Sector_Ct 重新分配扇区数 > 0 预警
187 Reported_Uncorrect 无法纠正的错误 > 0 预警
188 Command_Timeout 命令超时 > 0 预警
197 Current_Pending_Sector 待处理扇区 > 0 预警
198 Offline_Uncorrectable 离线无法纠正 > 0 预警
199 UDMA_CRC_Error_Count CRC错误计数 > 0 预警

# SMART数据采集脚本
#!/bin/bash
# 文件名: smart_collector.sh

# 获取所有磁盘
DISKS=$(lsblk -d -o NAME | grep -E “^sd|^nvme|^vd”)

for disk in $DISKS; do
echo “=== /dev/$disk ===”

# 获取健康状态
health=$(smartctl -H /dev/$disk | grep “SMART overall-health” | awk ‘{print $NF}’)
echo “Health Status: $health”

# 获取关键属性
smartctl -A /dev/$disk | awk ‘
/Reallocated_Sector_Ct|Reported_Uncorrect|Current_Pending_Sector|Offline_Uncorrectable|UDMA_CRC_Error_Count/ {
attr_id=$1
attr_name=$2
raw_value=$NF
if (raw_value > 0) {
print “WARNING: ” attr_name ” = ” raw_value
}
}’

# 获取温度
temp=$(smartctl -A /dev/$disk | grep “Temperature_Celsius” | awk ‘{print $10}’)
if [ ! -z “$temp” ] && [ $temp -gt 50 ]; then
echo “WARNING: High temperature: ${temp}°C”
fi

echo “”
done

$ chmod +x smart_collector.sh
$ ./smart_collector.sh
=== /dev/sda ===
Health Status: PASSED

=== /dev/sdb ===
Health Status: PASSED
WARNING: Reallocated_Sector_Ct = 5
WARNING: Current_Pending_Sector = 12

2.2 磁盘故障预测模型

使用机器学习模型预测磁盘故障,提高预测准确性。

# 磁盘故障预测模型
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# 加载SMART数据(示例数据)
# 实际应用中需要从数据库或文件中加载历史数据
data = pd.DataFrame({
‘smart_5_raw’: [0, 0, 5, 0, 10, 0, 2, 0],
‘smart_187_raw’: [0, 0, 1, 0, 3, 0, 0, 0],
‘smart_188_raw’: [0, 0, 0, 0, 2, 0, 0, 0],
‘smart_197_raw’: [0, 0, 12, 0, 25, 0, 5, 0],
‘smart_198_raw’: [0, 0, 8, 0, 15, 0, 3, 0],
‘smart_199_raw’: [0, 0, 0, 0, 1, 0, 0, 0],
‘smart_194_raw’: [32, 35, 45, 30, 55, 28, 42, 33],
‘smart_9_raw’: [20156, 15234, 45678, 12345, 56789, 8765, 34567, 23456],
‘failure’: [0, 0, 1, 0, 1, 0, 1, 0]
})

# 特征和标签
features = [‘smart_5_raw’, ‘smart_187_raw’, ‘smart_188_raw’,
‘smart_197_raw’, ‘smart_198_raw’, ‘smart_199_raw’,
‘smart_194_raw’, ‘smart_9_raw’]
X = data[features]
y = data[‘failure’]

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 训练随机森林模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 模型评估
y_pred = model.predict(X_test)
print(“模型评估报告:”)
print(classification_report(y_test, y_pred))

print(“\n混淆矩阵:”)
print(confusion_matrix(y_test, y_pred))

# 输出
模型评估报告:
precision recall f1-score support

0 1.00 1.00 1.00 1
1 1.00 1.00 1.00 1

accuracy 1.00 2
macro avg 1.00 1.00 1.00 2
weighted avg 1.00 1.00 1.00 2

混淆矩阵:
[[1 0]
[0 1]]

# 特征重要性
feature_importance = pd.DataFrame({
‘feature’: features,
‘importance’: model.feature_importances_
}).sort_values(‘importance’, ascending=False)

print(“\n特征重要性:”)
print(feature_importance)

# 输出
特征重要性:
feature importance
3 smart_197_raw 0.35
4 smart_198_raw 0.25
0 smart_5_raw 0.15
1 smart_187_raw 0.10
6 smart_194_raw 0.08
7 smart_9_raw 0.05
2 smart_188_raw 0.01
5 smart_199_raw 0.01

# 保存模型
joblib.dump(model, ‘disk_failure_model.pkl’)

# 预测函数
def predict_disk_failure(smart_data):
“””预测磁盘故障概率”””
model = joblib.load(‘disk_failure_model.pkl’)
prediction = model.predict_proba([smart_data])
return {
‘failure_probability’: prediction[0][1],
‘prediction’: ‘FAIL’ if prediction[0][1] > 0.5 else ‘PASS’
}

# 使用模型预测
new_disk_data = [0, 0, 0, 15, 10, 0, 45, 30000]
result = predict_disk_failure(new_disk_data)
print(f”\n预测结果: {result}”)

# 输出
预测结果: {‘failure_probability’: 0.85, ‘prediction’: ‘FAIL’}

三、内存故障预测

3.1 ECC错误监控

通过监控ECC(Error Correcting Code)错误预测内存故障。

# 内存ECC错误监控

# 检查EDAC支持
$ ls /sys/devices/system/edac/
mc pci

# 查看内存控制器信息
$ cat /sys/devices/system/edac/mc/mc0/size
Size: 65536 MB

# 查看ECC错误统计
$ cat /sys/devices/system/edac/mc/mc0/ce_count
0

$ cat /sys/devices/system/edac/mc/mc0/ue_count
0

# 查看每个内存槽位的错误
$ for dimm in /sys/devices/system/edac/mc/mc0/dimm*; do
echo “=== $(basename $dimm) ===”
cat $dimm/dimm_label
cat $dimm/dimm_location
cat $dimm/dimm_ce_count
cat $dimm/dimm_ue_count
done

=== dimm0 ===
CPU0_DIMM_A0
CPU0
0
0

=== dimm1 ===
CPU0_DIMM_A1
CPU0
0
0

# 使用edac-util工具
$ yum install -y edac-utils
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU0_DIMM_A0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU0_DIMM_A1: 0 Corrected Errors

# 内存错误日志分析
$ grep -i “memory error\|ECC error\|corrected error” /var/log/messages
Apr 3 10:00:00 fgedu-server01 kernel: [12345.678901] EDAC MC0: 1 CE memory read error on CPU0_DIMM_A0 (mc:0 row:0 channel:0)

# 内存错误预测脚本
#!/bin/bash
# 文件名: memory_error_monitor.sh

LOG_FILE=”/var/log/memory_errors.log”
ALERT_THRESHOLD=10

# 获取ECC错误计数
get_ecc_errors() {
ce_count=$(cat /sys/devices/system/edac/mc/mc0/ce_count 2>/dev/null || echo 0)
ue_count=$(cat /sys/devices/system/edac/mc/mc0/ue_count 2>/dev/null || echo 0)
echo “$ce_count $ue_count”
}

# 获取上次的错误计数
get_previous_count() {
if [ -f “$LOG_FILE” ]; then
tail -1 “$LOG_FILE” | awk ‘{print $3, $4}’
else
echo “0 0”
fi
}

# 主监控逻辑
current=$(get_ecc_errors)
previous=$(get_previous_count)

ce_current=$(echo $current | awk ‘{print $1}’)
ue_current=$(echo $current | awk ‘{print $2}’)
ce_previous=$(echo $previous | awk ‘{print $1}’)
ue_previous=$(echo $previous | awk ‘{print $2}’)

# 计算新增错误
ce_new=$((ce_current – ce_previous))
ue_new=$((ue_current – ue_previous))

# 记录日志
echo “$(date ‘+%Y-%m-%d %H:%M:%S’) $ce_current $ue_current $ce_new $ue_new” >> $LOG_FILE

# 告警判断
if [ $ce_new -gt $ALERT_THRESHOLD ]; then
echo “WARNING: High memory CE errors detected: $ce_new new errors” | mail -s “Memory Error Alert” ops@fgedu.net.cn
fi

if [ $ue_new -gt 0 ]; then
echo “CRITICAL: Memory UE errors detected: $ue_new new errors” | mail -s “Memory Critical Alert” ops@fgedu.net.cn
fi

# 设置定时任务
$ crontab -l
*/5 * * * * /opt/scripts/memory_error_monitor.sh

# 运行监控
$ ./memory_error_monitor.sh
2026-04-03 10:00:00 0 0 0 0

四、CPU故障预测

4.1 CPU温度与性能监控

监控CPU温度、频率和性能指标,预测CPU故障。

# CPU故障预测监控

# 1. 使用IPMI监控CPU温度
$ ipmitool sensor list | grep -i temp
CPU1 Temp | 45.000 | degrees C | ok | 0.000 | 0.000 | 0.000 | 95.000 | 100.000 | 105.000
CPU2 Temp | 42.000 | degrees C | ok | 0.000 | 0.000 | 0.000 | 95.000 | 100.000 | 105.000
System Temp | 35.000 | degrees C | ok | 0.000 | 0.000 | 0.000 | 85.000 | 90.000 | 95.000

# 2. 查看CPU频率
$ cat /proc/cpuinfo | grep “cpu MHz”
cpu MHz : 2100.000
cpu MHz : 2100.000
cpu MHz : 2100.000
cpu MHz : 2100.000

# 3. 查看CPU性能计数器
$ perf stat -a sleep 5

Performance counter stats for ‘system wide’:

12567.89 msec cpu-clock # 1.000 CPUs utilized
12345 context-switches # 0.982 K/sec
567 cpu-migrations # 45.110 /sec
12345 page-faults # 0.982 K/sec
123456789012 cycles # 0.098 GHz
123456789012 instructions # 1.00 insn per cycle

5.006789012 seconds time elapsed

# 4. 检查MCE(Machine Check Exception)
$ cat /proc/mce
CPU 0: MACHINE CHECK: 0 0x0
CPU 1: MACHINE CHECK: 0 0x0

# 5. 使用mcelog工具
$ yum install -y mcelog
$ systemctl start mcelog
$ systemctl enable mcelog

# 查看MCE日志
$ cat /var/log/mcelog
MCE 0 at 2026-04-03 10:00:00
CPU 0 BANK 0
STATUS 0x0000000000000000
MCGSTATUS 0x0000000000000000

# CPU故障预测脚本
#!/bin/bash
# 文件名: cpu_health_check.sh

echo “=== CPU健康检查 ===”

# 检查CPU温度
echo -e “\n1. CPU温度检查:”
ipmitool sensor list | grep -i “cpu.*temp” | while read line; do
temp=$(echo $line | awk ‘{print $3}’)
threshold=$(echo $line | awk ‘{print $11}’)
if [ $(echo “$temp > $threshold” | bc) -eq 1 ]; then
echo “WARNING: $line”
else
echo “OK: $line”
fi
done

# 检查CPU频率降频
echo -e “\n2. CPU频率检查:”
base_freq=$(cat /proc/cpuinfo | grep “model name” | head -1 | grep -oP ‘@ \K[0-9.]+’)
current_freq=$(cat /proc/cpuinfo | grep “cpu MHz” | head -1 | awk ‘{print $4}’)
if [ $(echo “$current_freq < $base_freq * 0.8" | bc) -eq 1 ]; then echo "WARNING: CPU频率降频 当前: ${current_freq}MHz 基准: ${base_freq}GHz" else echo "OK: CPU频率正常 ${current_freq}MHz" fi # 检查MCE错误 echo -e "\n3. MCE错误检查:" mce_count=$(grep -c "MACHINE CHECK" /var/log/mcelog 2>/dev/null || echo 0)
if [ $mce_count -gt 0 ]; then
echo “WARNING: 发现 $mce_count 个MCE错误”
tail -20 /var/log/mcelog
else
echo “OK: 无MCE错误”
fi

# 检查CPU负载
echo -e “\n4. CPU负载检查:”
load=$(uptime | awk -F’load average:’ ‘{print $2}’ | awk ‘{print $1}’)
cpu_cores=$(nproc)
if [ $(echo “$load > $cpu_cores * 2” | bc) -eq 1 ]; then
echo “WARNING: CPU负载过高: $load”
else
echo “OK: CPU负载正常: $load”
fi

$ chmod +x cpu_health_check.sh
$ ./cpu_health_check.sh
=== CPU健康检查 ===

1. CPU温度检查:
OK: CPU1 Temp | 45.000 | degrees C | ok

2. CPU频率检查:
OK: CPU频率正常 2100.000MHz

3. MCE错误检查:
OK: 无MCE错误

4. CPU负载检查:
OK: CPU负载正常: 0.50,

五、电源故障预测

5.1 电源状态监控

通过IPMI监控电源状态,预测电源故障。

# 电源故障预测监控

# 1. 查看电源状态
$ ipmitool power status
Chassis Power is on

# 2. 查看电源传感器
$ ipmitool sensor list | grep -i power
Power Supply 1 | 0x0 | discrete | 0x0180| na | na | na | na | na | na
Power Supply 2 | 0x0 | discrete | 0x0180| na | na | na | na | na | na
Power Read | 245.000 | Watts | ok | na | na | na | na | na | na

# 3. 查看电源详细信息
$ ipmitool fru print | grep -A 20 “Power Supply”
Board Mfg : Dell Inc.
Board Product : Power Supply
Board Serial : CN-1234567890
Board Part Number : 0PXD5T

# 4. 查看电压
$ ipmitool sensor list | grep -i volt
12V | 12.050 | Volts | ok | 11.400 | 11.600 | 12.600 | 12.800 | na | na
5V | 5.050 | Volts | ok | 4.750 | 4.850 | 5.150 | 5.250 | na | na
3.3V | 3.320 | Volts | ok | 3.135 | 3.200 | 3.400 | 3.465 | na | na

# 5. 电源健康检查脚本
#!/bin/bash
# 文件名: power_health_check.sh

echo “=== 电源健康检查 ===”

# 检查电源状态
echo -e “\n1. 电源状态:”
power_count=$(ipmitool sensor list | grep -c “Power Supply”)
power_ok=$(ipmitool sensor list | grep “Power Supply” | grep -c “0x0180”)

echo “电源总数: $power_count”
echo “正常电源: $power_ok”

if [ $power_ok -lt $power_count ]; then
echo “WARNING: 有电源故障”
ipmitool sensor list | grep “Power Supply”
fi

# 检查功耗
echo -e “\n2. 功耗检查:”
power_watts=$(ipmitool sensor list | grep “Power Read” | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
echo “当前功耗: ${power_watts}W”

# 检查电压
echo -e “\n3. 电压检查:”
ipmitool sensor list | grep -i volt | while read line; do
voltage=$(echo $line | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
name=$(echo $line | awk -F’|’ ‘{print $1}’ | tr -d ‘ ‘)
status=$(echo $line | awk -F’|’ ‘{print $4}’ | tr -d ‘ ‘)

if [ “$status” != “ok” ]; then
echo “WARNING: $name = ${voltage}V (状态: $status)”
else
echo “OK: $name = ${voltage}V”
fi
done

# 检查电池(如果支持)
echo -e “\n4. CMOS电池检查:”
battery_status=$(ipmitool sensor list | grep -i “battery” || echo “未检测到电池传感器”)
echo “$battery_status”

$ chmod +x power_health_check.sh
$ ./power_health_check.sh
=== 电源健康检查 ===

1. 电源状态:
电源总数: 2
正常电源: 2

2. 功耗检查:
当前功耗: 245.000W

3. 电压检查:
OK: 12V = 12.050V
OK: 5V = 5.050V
OK: 3.3V = 3.320V

4. CMOS电池检查:
未检测到电池传感器

六、故障预测系统建设

6.1 综合故障预测平台

建设统一的硬件故障预测平台,实现自动化预警。

# 故障预测系统架构
系统组件:
1. 数据采集层
– smartctl:磁盘SMART数据
– ipmitool:BMC传感器数据
– collectd:系统性能数据
– 自定义脚本:专项数据采集

2. 数据存储层
– InfluxDB:时序数据存储
– Prometheus:指标存储
– MySQL:元数据存储

3. 分析预测层
– 规则引擎:阈值判断
– ML模型:机器学习预测
– 趋势分析:历史趋势预测

4. 告警通知层
– AlertManager:告警管理
– 企业微信:即时通知
– 邮件:详细报告
– SMS:紧急告警

# 综合监控脚本
#!/bin/bash
# 文件名: hardware_prediction.sh

# 配置
INFLUXDB_URL=”http://influxdb:8086″
DATABASE=”hardware_monitor”
ALERT_WEBHOOK=”https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx”

# 数据采集函数
collect_disk_data() {
for disk in $(lsblk -d -o NAME | grep -E “^sd|^nvme”); do
smart_data=$(smartctl -A /dev/$disk 2>/dev/null)

# 提取关键指标
smart_5=$(echo “$smart_data” | grep “Reallocated_Sector_Ct” | awk ‘{print $10}’)
smart_197=$(echo “$smart_data” | grep “Current_Pending_Sector” | awk ‘{print $10}’)
temp=$(echo “$smart_data” | grep “Temperature_Celsius” | awk ‘{print $10}’)

# 发送到InfluxDB
curl -s -XPOST “$INFLUXDB_URL/write?db=$DATABASE” \
–data-binary “disk_health,host=$(hostname),disk=$disk smart_5=$smart_5,smart_197=$smart_197,temp=$temp”
done
}

collect_memory_data() {
ce_count=$(cat /sys/devices/system/edac/mc/mc0/ce_count 2>/dev/null || echo 0)
ue_count=$(cat /sys/devices/system/edac/mc/mc0/ue_count 2>/dev/null || echo 0)

curl -s -XPOST “$INFLUXDB_URL/write?db=$DATABASE” \
–data-binary “memory_health,host=$(hostname) ce_count=$ce_count,ue_count=$ue_count”
}

collect_cpu_data() {
cpu_temp=$(ipmitool sensor list | grep “CPU1 Temp” | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
cpu_load=$(uptime | awk -F’load average:’ ‘{print $2}’ | awk ‘{print $1}’ | tr -d ‘,’)

curl -s -XPOST “$INFLUXDB_URL/write?db=$DATABASE” \
–data-binary “cpu_health,host=$(hostname) temp=$cpu_temp,load=$cpu_load”
}

collect_power_data() {
power_watts=$(ipmitool sensor list | grep “Power Read” | awk -F’|’ ‘{print $2}’ | tr -d ‘ ‘)
power_status=$(ipmitool sensor list | grep “Power Supply” | grep -c “0x0180”)

curl -s -XPOST “$INFLUXDB_URL/write?db=$DATABASE” \
–data-binary “power_health,host=$(hostname) watts=$power_watts,status=$power_status”
}

# 告警函数
send_alert() {
local level=$1
local message=$2

curl -s -XPOST “$ALERT_WEBHOOK” \
-H ‘Content-Type: application/json’ \
-d “{
\”msgtype\”: \”markdown\”,
\”markdown\”: {
\”content\”: \”## 硬件故障预警\n\n**级别**: $level\n\n**内容**: $message\n\n**主机**: $(hostname)\n\n**时间**: $(date ‘+%Y-%m-%d %H:%M:%S’)\”
}
}”
}

# 主函数
main() {
echo “$(date): 开始采集硬件数据…”

collect_disk_data
collect_memory_data
collect_cpu_data
collect_power_data

echo “$(date): 数据采集完成”
}

main

# 设置定时任务
$ crontab -l
*/5 * * * * /opt/scripts/hardware_prediction.sh >> /var/log/hardware_prediction.log 2>&1

# 运行监控
$ ./hardware_prediction.sh
Fri Apr 3 10:00:00 CST 2026: 开始采集硬件数据…
Fri Apr 3 10:00:05 CST 2026: 数据采集完成

总结

服务器硬件故障预测是提高系统可用性的重要手段,通过监控硬件状态和性能指标,可以提前发现潜在故障,实现预防性维护。本教程详细介绍了磁盘、内存、CPU、电源等关键硬件的故障预测方法,以及故障预测系统的建设方案。

更多学习教程www.fgedu.net.cn,在实际工作中,建议建立完善的硬件监控体系,结合规则引擎和机器学习模型,实现准确的故障预测和及时的告警通知。

风哥风哥提示:硬件故障预测要特别注意数据积累,只有足够的历史数据才能训练出准确的预测模型。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息