fgedu.net.cn
目录
一、基础概念
1.1 监控定义
监控是指对系统的运行状态、性能指标和事件进行实时采集、分析和展示的过程。通过监控,可以及时发现系统问题,确保系统的稳定运行。
1.2 告警定义
告警是指当系统出现异常情况时,通过各种方式(如邮件、短信、电话等)通知相关人员的过程。告警可以帮助运维人员及时发现和处理问题。
1.3 监控系统组件
- Prometheus:开源的监控系统,用于采集和存储指标数据。
- Grafana:开源的数据可视化工具,用于展示监控数据。
- AlertManager:用于处理告警的组件,可以将告警发送到不同的通知渠道。
- Node Exporter:用于采集主机指标的组件。
- Blackbox Exporter:用于监控网络端点的组件。
1.4 监控指标类型
- 系统指标:CPU、内存、磁盘、网络等主机层面的指标。
- 数据库指标:TiDB、TiKV、PD等组件的性能指标。
- 业务指标:QPS、响应时间、错误率等业务层面的指标。
二、规划建议
2.1 监控架构
- 集中式监控:所有监控数据集中存储和管理。
- 分布式监控:监控数据分散存储,通过聚合器汇总。
- 混合监控:结合集中式和分布式监控的优点。
2.2 监控指标规划
- 核心指标:必须监控的关键指标,如QPS、响应时间、错误率等。
- 次要指标:辅助监控的指标,如连接数、缓存命中率等。
- 系统指标:主机和操作系统的指标,如CPU、内存、磁盘等。
2.3 告警策略规划
- 告警级别:根据问题的严重程度设置不同的告警级别,如紧急、严重、警告、信息等。
- 告警触发条件:设置合理的告警触发条件,避免误报和漏报。
- 告警通知渠道:根据告警级别选择合适的通知渠道,如邮件、短信、电话等。
- 告警处理流程:制定告警处理流程,确保告警得到及时处理。
2.4 存储规划
- 存储容量:根据监控数据量规划存储容量。
- 存储类型:选择合适的存储类型,如本地磁盘、SSD、云存储等。
- 数据保留期:根据业务需求设置合理的数据保留期。
三、实施方案
3.1 部署监控系统
使用TiUP部署监控组件
# 部署TiDB集群时启用监控
tiup cluster deploy tidb-cluster v7.5.0 topology.yaml --user root -p
# 查看监控组件状态
tiup cluster display tidb-cluster
Cluster type: tidb Cluster name: tidb-cluster Cluster version: v7.5.0 ... Monitor nodes: 192.168.1.10:9090 RUNNING 192.168.1.10:3000 RUNNING 192.168.1.10:9093 RUNNING
手动部署监控系统
# 安装Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz风哥提示:
tar -xzf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
./prometheus --config.file=prometheus.yml
# 安装Grafana
wget https://dl.grafana.com/oss/release/grafana-9.5.10.linux-amd64.tar.gz
tar -xzf grafana-9.5.10.linux-amd64.tar.gz
cd grafana-9.5.10
./bin/grafana-server
# 安装AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar -xzf alertmanager-0.25.0.linux-amd64.tar.gz
cd alertmanager-0.25.0.linux-amd64
./alertmanager --config.file=alertmanager.yml
3.2 配置监控指标
配置Prometheus
# 编辑Prometheus配置文件
cat > prometheus.yml << EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
scrape_configs:
- job_name: 'tidb'
static_configs:
- targets: ['192.168.1.13:10080']
- job_name: 'tikv'
static_configs:
- targets: ['192.168.1.14:20180', '192.168.1.15:20180', '192.168.1.16:20180']
- job_name: 'pd'
static_configs:
- targets: ['192.168.1.10:2379', '192.168.1.11:2379', '192.168.1.12:2379']
- job_name: 'node'
static_configs:
- targets: ['192.168.1.10:9100', '192.168.1.11:9100', '192.168.1.12:9100', '192.168.1.13:9100', '192.168.1.14:9100', '192.168.1.15:9100', '192.168.1.16:9100']
EOF
# 重启Prometheus
systemctl restart prometheus
配置Grafana
# 访问Grafana
# 地址:http://localhost:3000
# 默认用户名:admin,密码:admin
# 添加Prometheus数据源
# 配置 > 数据源 > 添加数据源 > Prometheus
# URL:http://localhost:9090
# 导入TiDB监控面板
# 仪表板 > 导入
# 输入面板ID:12906(TiDB Overview)
# 选择Prometheus数据源
# 导入其他监控面板
# TiKV:12907
# PD:12908
# Node Exporter:1860
3.3 配置告警规则
配置AlertManager
# 编辑AlertManager配置文件
cat > alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'学习交流加群风哥QQ113257174
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
# 重启AlertManager
systemctl restart alertmanager
配置Prometheus告警规则
# 创建告警规则文件
cat > rules/tidb-alerts.yml << EOF
groups:
- name: tidb-alerts
rules:
- alert: TiDBServerDown
expr: up{job="tidb"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiDB server down"
description: "TiDB server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: TiKVServerDown
expr: up{job="tikv"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiKV server down"
description: "TiKV server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: PDServerDown
expr: up{job="pd"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PD server down"
description: "PD server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: HighCPULoad
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load"
description: "CPU load on {{ $labels.instance }} has been above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage on {{ $labels.instance }} has been above 80% for more than 5 minutes"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage"
description: "Disk usage on {{ $labels.instance }} has been above 80% for more than 5 minutes"
EOF
# 重启Prometheus
systemctl restart prometheus
3.4 监控指标查询
使用PromQL查询监控指标
# 查询TiDB QPS
rate(tidb_server_query_total[5m])
# 查询TiDB响应时间
avg(tidb_server_handle_query_duration_seconds_sum / tidb_server_handle_query_duration_seconds_count)
# 查询TiKV CPU使用率
avg(100 - (avg by (instance) (irate(tikv_process_cpu_seconds_total{mode="idle"}[5m])) * 100))
# 查询PD leader
count(pd_server_is_leader == 1)
# 查询节点CPU使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 查询节点内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# 查询节点磁盘使用率
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
使用Grafana查询监控指标
# 访问Grafana
# 地址:http://localhost:3000
# 创建仪表板
# 仪表板 > 新建仪表板 > 添加面板
# 配置查询
# 数据源:Prometheus
# 查询:rate(tidb_server_query_total[5m])
# 配置图表
# 类型:Graph
# 标题:TiDB QPS
# 单位:req/s
# 保存仪表板
# 点击保存按钮,输入仪表板名称
3.5 监控自动化
使用Ansible部署监控系统
# 创建Ansible playbook
cat > deploy-monitoring.yml << EOF
---
- hosts: monitoring
become: yes
tasks:
- name: Install Prometheus
unarchive:
src: https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
dest: /opt
remote_src: yes
- name: Configure Prometheus
copy:
content: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
scrape_configs:
- job_name: 'tidb'
static_configs:
- targets: ['192.168.1.13:10080']
- job_name: 'tikv'
static_configs:
- targets: ['192.168.1.14:20180', '192.168.1.15:20180', '192.168.1.16:20180']
- job_name: 'pd'
static_configs:
- targets: ['192.168.1.10:2379', '192.168.1.11:2379', '192.168.1.12:2379']
- job_name: 'node'
static_configs:
- targets: ['192.168.1.10:9100', '192.168.1.11:9100', '192.168.1.12:9100', '192.168.1.13:9100', '192.168.1.14:9100', '192.168.1.15:9100', '192.168.1.16:9100']
dest: /opt/prometheus-2.45.0.linux-amd64/prometheus.yml
- name: Start Prometheus
shell: |
cd /opt/prometheus-2.45.0.linux-amd64
nohup ./prometheus --config.file=prometheus.yml > prometheus.log 2>&1 &
- name: Install Grafana
unarchive:
src: https://dl.grafana.com/oss/release/grafana-9.5.10.linux-amd64.tar.gz
dest: /opt
remote_src: yes
- name: Start Grafana
shell: |
cd /opt/grafana-9.5.10
nohup ./bin/grafana-server > grafana.log 2>&1 &
- name: Install AlertManager
unarchive:
src: https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
dest: /opt
remote_src: yes
- name: Configure AlertManager
copy:
content: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
dest: /opt/alertmanager-0.25.0.linux-amd64/alertmanager.yml
- name: Start AlertManager
shell: |
cd /opt/alertmanager-0.25.0.linux-amd64
nohup ./alertmanager --config.file=alertmanager.yml > alertmanager.log 2>&1 &
EOF
# 执行Ansible playbook
ansible-playbook -i hosts.ini deploy-monitoring.yml
四、实战案例
4.1 生产环境监控部署
场景:企业需要在生产环境中部署监控系统,确保TiDB集群的稳定运行。
步骤1:部署监控组件
# 使用TiUP部署TiDB集群时启用监控
tiup cluster deploy tidb-prod v7.5.0 topology.yaml --user root -p
# 查看集群状态
tiup cluster display tidb-prod
Cluster type: tidb Cluster name: tidb-prod Cluster version: v7.5.0 ... Monitor nodes: 192.168.1.10:9090 RUNNING 192.168.1.10:3000 RUNNING 192.168.1.10:9093 RUNNING
步骤2:配置监控指标
# 编辑Prometheus配置文件
tiup cluster edit-config tidb-prod
# 在配置文件中添加监控目标
# 保存配置并重载
tiup cluster reload tidb-prod -R prometheus
步骤3:配置告警规则
# 创建告警规则文件
cat > tidb-alerts.yml << EOF
groups:
- name: tidb-prod-alerts
rules:
- alert: TiDBServerDown
expr: up{job="tidb"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiDB server down"
description: "TiDB server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: TiKVServerDown
expr: up{job="tikv"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiKV server down"
description: "TiKV server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: PDServerDown
expr: up{job="pd"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PD server down"
description: "PD server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: HighCPULoad
expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load"
description: "CPU load on {{ $labels.instance }} has been above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage on {{ $labels.instance }} has been above 80% for more than 5 minutes"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage"
description: "Disk usage on {{ $labels.instance }} has been above 80% for more than 5 minutes"
- alert: SlowQueries
expr: rate(tidb_server_slow_query_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Slow queries"
description: "Number of slow queries on {{ $labels.instance }} has been above 10 for more than 5 minutes"
EOF
# 复制告警规则文件到Prometheus配置目录
cp tidb-alerts.yml /tidb-deploy/tidb-prod-monitor-9090/conf/rules/
# 重载Prometheus配置
tiup cluster reload tidb-prod -R prometheus
步骤4:配置AlertManager
# 编辑AlertManager配置文件
cat > alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'
send_resolved: true
- name: 'sms'
webhook_configs:
- url: 'http://sms-gateway:8080/send'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
# 复制AlertManager配置文件到配置目录
cp alertmanager.yml /tidb-deploy/tidb-prod-monitor-9093/conf/
# 重载AlertManager配置
tiup cluster reload tidb-prod -R alertmanager
步骤5:验证监控系统
# 访问Prometheus
# 地址:http://192.168.1.10:9090
# 访问Grafana
# 地址:http://192.168.1.10:3000
# 默认用户名:admin,密码:admin
# 访问AlertManager
# 地址:http://192.168.1.10:9093
# 测试告警
# 在Prometheus中执行查询:up{job="tidb"} == 0
# 验证告警是否触发
4.2 监控告警处理
场景:监控系统触发告警,需要及时处理。
步骤1:接收告警
# 接收邮件告警
# 邮件主题:[FIRING:1] TiDB server down
# 邮件内容:TiDB server 192.168.1.13:10080 has been down for more than 5 minutes
# 接收短信告警
# 短信内容:[TiDB Alert] TiDB server 192.168.1.13:10080 is down
步骤2:分析告警
# 登录TiDB服务器
ssh root@192.168.1.13
# 查看TiDB进程状态
systemctl status tidb-server
# 查看TiDB日志
cat /tidb-deploy/tidb-prod-tidb-4000/log/tidb.log | tail -n 100
# 查看系统状态
free -m
top
df -h
步骤3:处理告警
# 重启TiDB服务
systemctl restart tidb-server
# 查看TiDB进程状态
systemctl status tidb-server
# 验证TiDB服务
mysql -h 192.168.1.13 -P 4000 -u root -p -e "SHOW STATUS LIKE 'TiDB%';"
# 确认告警恢复
# 访问AlertManager,确认告警状态变为resolved
五、经验总结
5.1 监控与告警最佳实践
- 全面监控:监控系统的各个组件和指标
- 合理配置告警:设置合理的告警触发条件,避免误报和漏报
- 多渠道通知:使用多种通知渠道,确保告警及时送达
- 定期检查:定期检查监控系统的运行状态
- 自动化处理:使用自动化工具处理常见告警
- 记录和分析:记录告警历史,分析告警模式
- 持续优化:根据实际情况持续优化监控和告警策略
- 演练和测试:定期演练告警处理流程,确保响应及时
5.2 常见问题与解决方案
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 监控数据丢失 | Prometheus存储配置不当 | 调整Prometheus存储配置,增加存储容量 |
| 告警误报 | 告警触发条件设置不合理 | 调整告警触发条件,增加持续时间 |
| 告警漏报 | 告警规则配置不完整 | 完善告警规则,确保覆盖所有关键指标 |
| 监控系统性能问题 | Prometheus配置不当,数据量过大 | 优化Prometheus配置,增加资源,使用远程存储 |
| 告警通知延迟 | 网络问题,通知渠道配置不当 | 检查网络连接,优化通知渠道配置 |
5.3 监控与告警检查清单
| 检查项 | 配置要求 | 状态 |
|---|---|---|
| 监控部署 | 部署Prometheus、Grafana和AlertManager | □ |
| 监控指标配置 | 配置所有关键指标的监控 | □ |
| 告警规则配置 | 配置合理的告警规则 | □ |
| 告警通知配置 | 配置多种通知渠道 | □ |
| 监控系统性能 | 确保监控系统性能良好 | □ |
| 告警处理流程 | 制定完善的告警处理流程 | □ |
| 定期检查 | 定期检查监控系统运行状态 | □ |
| 演练测试 | 定期演练告警处理流程 | □ |
更多视频教程www.fgedu.net.cn
© 2024 TiDB数据库培训文档
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
