fgedu.net.cn
目录
一、基础概念
1.1 监控定义
监控是指对TiDB集群的运行状态、性能指标和健康状况进行实时监测和记录的过程。通过监控,运维人员可以及时发现集群的异常情况,确保系统的稳定运行。
1.2 告警定义
告警是指当监控指标达到预设阈值时,系统自动发送通知的机制。告警可以帮助运维人员及时发现和处理潜在问题,避免系统故障。
1.3 监控组件
- Prometheus:开源的监控系统,用于收集和存储监控指标
- Grafana:开源的数据可视化工具,用于展示监控数据
- AlertManager:告警管理工具,用于处理和发送告警
- TiDB Dashboard:TiDB官方的可视化管理工具
- Node Exporter:收集服务器级别的监控指标
- Blackbox Exporter:用于监控网络服务的可访问性
二、规划建议
2.1 监控架构规划
- 监控层次:服务器级别、集群级别、应用级别
- 监控指标:系统指标、TiDB指标、TiKV指标、PD指标
- 存储规划:监控数据的存储和保留期
- 高可用:监控系统的高可用设计
2.2 告警策略规划
- 告警级别:紧急、严重、警告、信息
- 告警阈值:根据业务需求设置合理的阈值
- 告警渠道:邮件、短信、电话、即时通讯工具
- 告警抑制:避免告警风暴
- 告警恢复:当问题解决后自动恢复告警
2.3 监控指标规划
- 系统指标:CPU、内存、磁盘、网络
- TiDB指标:QPS、响应时间、连接数、慢查询
- TiKV指标:操作延迟、磁盘I/O、内存使用
- PD指标:集群状态、调度情况
- 应用指标:业务QPS、响应时间、错误率
三、实施方案
3.1 监控系统部署
使用TiUP部署监控组件
# 生成监控拓扑文件
tiup cluster template > monitor-topology.yaml
# 编辑拓扑文件
# 添加监控节点配置
# 部署监控集群
tiup cluster deploy tidb-monitor v7.5.0 monitor-topology.yaml --user root -p
# 启动监控集群
tiup cluster start tidb-monitor
# 查看监控集群状态
tiup cluster status tidb-monitor
Cluster type: monitor Cluster name: tidb-monitor Cluster version: v7.5.0 ID Role Host Ports Status Data Dir Deploy Dir -- ---- ---- ----- ------ -------- ---------- 192.168.1.18:9090 prometheus 192.168.1.18 9090 Up /tidb/data/prometheus /tidb/deploy/prometheus-9090 192.168.1.18:3000 grafana 192.168.1.18 3000 Up - /tidb/deploy/grafana-3000 192.168.1.18:9093 alertmanager 192.168.1.18 9093 Up /tidb/data/alertmanager /tidb/deploy/alertmanager-9093 192.168.1.18:9100 node_exporter 192.168.1.18 9100 Up - /tidb/deploy/node_exporter-9100
3.2 监控配置
Prometheus配置
风哥提示:
# 查看Prometheus配置
cat /tidb/deploy/prometheus-9090/conf/prometheus.yml
# 配置示例
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'tidb'
static_configs:
- targets: ['192.168.1.13:10080', '192.168.1.14:10080']
- job_name: 'tikv'
static_configs:
- targets: ['192.168.1.15:20180', '192.168.1.16:20180', '192.168.1.17:20180']
- job_name: 'pd'
static_configs:
- targets: ['192.168.1.10:2379', '192.168.1.11:2379', '192.168.1.12:2379']
- job_name: 'node'
static_configs:
- targets: ['192.168.1.10:9100', '192.168.1.11:9100', '192.168.1.12:9100', '192.168.1.13:9100', '192.168.1.14:9100', '192.168.1.15:9100', '192.168.1.16:9100', '192.168.1.17:9100', '192.168.1.18:9100']
# 重启Prometheus
tiup cluster restart tidb-monitor --node 192.168.1.18:9090
告警规则配置
# 查看告警规则
cat /tidb/deploy/prometheus-9090/conf/rules/tidb.rules.yml
# 配置示例
groups:
- name: tidb-alerts
rules:
- alert: TiDBServerDown
expr: up{job="tidb"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiDB Server Down"
description: "TiDB server {{ $labels.instance }} has been down for more than 5 minutes."
- alert: TiKVServerDown
expr: up{job="tikv"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiKV Server Down"
description: "TiKV server {{ $labels.instance }} has been down for more than 5 minutes."
- alert: PDServerDown
expr: up{job="pd"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PD Server Down"
description: "PD server {{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU Usage"
description: "CPU usage on {{ $labels.instance }} is above 80% for more than 5 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory Usage"
description: "Memory usage on {{ $labels.instance }} is above 80% for more than 5 minutes."
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk Usage"
description: "Disk usage on {{ $labels.instance }} is above 80% for more than 5 minutes."
- alert: SlowQueries
expr: rate(tidb_server_slow_query_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Slow Queries"学习交流加群风哥QQ113257174
description: "TiDB server {{ $labels.instance }} has more than 10 slow queries per minute."
- alert: HighQPS
expr: sum(rate(tidb_server_qps[5m])) by (instance) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High QPS"
description: "TiDB server {{ $labels.instance }} has QPS above 10000 for more than 5 minutes."
# 重启Prometheus
tiup cluster restart tidb-monitor --node 192.168.1.18:9090
AlertManager配置
# 查看AlertManager配置
cat /tidb/deploy/alertmanager-9093/conf/alertmanager.yml
# 配置示例
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'corpid'
wechat_api_corp_secret: 'corpsecret'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'wechat'
routes:
- match:
severity: critical
receiver: 'email'
continue: true
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'
send_resolved: true
- name: 'wechat'
wechat_configs:
- corp_id: 'corpid'
to_party: '1'
agent_id: '1000001'
api_secret: 'api_secret'
message: '{{ template "wechat.default.message" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
# 重启AlertManager
tiup cluster restart tidb-monitor --node 192.168.1.18:9093
3.3 Grafana配置
访问Grafana
# 访问Grafana
# 地址:http://192.168.1.18:3000
# 默认用户名:admin,密码:admin
# 导入TiDB面板
# 面板 > 导入 > 输入面板ID:12048(TiDB Overview)
# 面板 > 导入 > 输入面板ID:12049(TiKV Overview)
# 面板 > 导入 > 输入面板ID:12050(PD Overview)
# 面板 > 导入 > 输入面板ID:12051(Node Exporter Full)
# 配置数据源
# 配置 > 数据源 > 添加数据源 > Prometheus
# URL:http://192.168.1.18:9090
# 保存并测试
创建自定义面板
# 创建自定义面板
# 面板 > 新建面板 > 添加查询
# 查询示例:sum(rate(tidb_server_qps[5m])) by (instance)
# 图形类型:Graph
# 保存面板到仪表板
四、实战案例
4.1 监控系统部署与配置
场景:企业需要部署TiDB监控系统,实现对集群的实时监控和告警。
步骤1:部署监控集群
# 生成监控拓扑文件
tiup cluster template > monitor-topology.yaml
# 编辑拓扑文件
# 添加监控节点配置
# 部署监控集群
tiup cluster deploy tidb-monitor v7.5.0 monitor-topology.yaml --user root -p
# 启动监控集群
tiup cluster start tidb-monitor
# 查看监控集群状态
tiup cluster status tidb-monitor
Cluster type: monitor Cluster name: tidb-monitor Cluster version: v7.5.0 ID Role Host Ports Status Data Dir Deploy Dir -- ---- ---- ----- ------ -------- ---------- 192.168.1.18:9090 prometheus 192.168.1.18 9090 Up /tidb/data/prometheus /tidb/deploy/prometheus-9090 192.168.1.18:3000 grafana 192.168.1.18 3000 Up - /tidb/deploy/grafana-3000 192.168.1.18:9093 alertmanager 192.168.1.18 9093 Up /tidb/data/alertmanager /tidb/deploy/alertmanager-9093 192.168.1.18:9100 node_exporter 192.168.1.18 9100 Up - /tidb/deploy/node_exporter-9100
步骤2:配置告警规则
# 编辑告警规则文件
cat > /tidb/deploy/prometheus-9090/conf/rules/tidb.rules.yml << EOF
groups:
- name: tidb-alerts
rules:
- alert: TiDBServerDown
expr: up{job="tidb"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiDB Server Down"
description: "TiDB server {{ $labels.instance }} has been down for more than 5 minutes."
- alert: TiKVServerDown
expr: up{job="tikv"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "TiKV Server Down"
description: "TiKV server {{ $labels.instance }} has been down for more than 5 minutes."
- alert: PDServerDown
expr: up{job="pd"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PD Server Down"
description: "PD server {{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU Usage"
description: "CPU usage on {{ $labels.instance }} is above 80% for more than 5 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory Usage"
description: "Memory usage on {{ $labels.instance }} is above 80% for more than 5 minutes."
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High Disk Usage"
description: "Disk usage on {{ $labels.instance }} is above 80% for more than 5 minutes."
- alert: SlowQueries
expr: rate(tidb_server_slow_query_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Slow Queries"
description: "TiDB server {{ $labels.instance }} has more than 10 slow queries per minute."
- alert: HighQPS
expr: sum(rate(tidb_server_qps[5m])) by (instance) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High QPS"
description: "TiDB server {{ $labels.instance }} has QPS above 10000 for more than 5 minutes."
EOF
# 重启Prometheus
tiup cluster restart tidb-monitor --node 192.168.1.18:9090
步骤3:配置AlertManager
# 编辑AlertManager配置文件
cat > /tidb/deploy/alertmanager-9093/conf/alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
EOF
# 重启AlertManager
tiup cluster restart tidb-monitor --node 192.168.1.18:9093
步骤4:验证监控系统
# 访问Grafana
# 地址:http://192.168.1.18:3000
# 导入TiDB面板
# 面板 > 导入 > 输入面板ID:12048(TiDB Overview)
# 面板 > 导入 > 输入面板ID:12049(TiKV Overview)
# 面板 > 导入 > 输入面板ID:12050(PD Overview)
# 查看监控数据
# 验证各组件的监控指标是否正常显示
# 测试告警
# 模拟TiDB节点故障,验证告警是否触发
五、经验总结
5.1 监控与告警最佳实践
- 全面监控:监控服务器、集群和应用的各个层面
- 合理设置告警阈值:根据业务需求设置合理的告警阈值
- 多渠道告警:配置多种告警渠道,确保及时收到通知
- 告警抑制:避免告警风暴,提高告警的有效性
- 定期检查:定期检查监控系统的运行状态
- 监控数据存储:合理规划监控数据的存储和保留期
- 可视化面板:创建直观的可视化面板,便于快速了解系统状态
5.2 监控指标优先级
- 关键指标:集群状态、节点健康状况、服务可用性
- 性能指标:QPS、响应时间、资源利用率
- 业务指标:业务QPS、错误率、用户体验
- 预测指标:资源使用趋势、容量规划
5.3 常见问题与解决方案
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 监控数据丢失 | Prometheus存储不足、配置错误 | 增加存储容量,检查配置 |
| 告警风暴 | 告警阈值设置不当、级联故障 | 调整告警阈值,配置告警抑制 |
| 告警延迟 | 网络延迟、AlertManager配置不当 | 检查网络连接,优化AlertManager配置 |
| 监控系统性能差 | Prometheus配置不当、资源不足 | 优化Prometheus配置,增加资源 |
| 监控指标不完整 | 配置错误、采集器故障 | 检查配置,重启采集器 |
5.4 监控与告警检查清单
| 检查项 | 配置要求 | 状态 |
|---|---|---|
| 监控系统部署 | 成功部署Prometheus、Grafana、AlertManager | □ |
| 数据源配置 | 正确配置Prometheus数据源 | □ |
| 告警规则 | 配置合理的告警规则 | □ |
| 告警渠道 | 配置多种告警渠道 | □ |
| 监控面板 | 导入并配置TiDB相关面板 | □ |
| 监控数据存储 | 合理规划监控数据存储 | □ |
| 告警测试 | 定期测试告警机制 | □ |
| 监控系统高可用 | 配置监控系统的高可用 | □ |
更多视频教程www.fgedu.net.cn
© 2024 TiDB数据库培训文档
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
