tidb教程FG053-TiDB Prometheus监控配置

本文档风哥主要介绍TiDB Prometheus监控配置，包括Prometheus的概念与特点、Prometheus架构、TiDB监控指标介绍、Prometheus部署规划、Prometheus安装与配置、Prometheus监控目标配置等内容，风哥教程参考TiDB官方文档监控告警相关内容编写，适合DBA人员在学习和测试中使用，如果要应用于生产环境则需要自行确认。更多视频教程www.fgedu.net.cn

Part01-基础概念与理论知识

1.1 Prometheus的概念与特点

Prometheus是一个开源的时序数据库和监控系统，用于收集和存储监控指标数据。Prometheus具有以下特点：学习交流加群风哥微信: itpux-com

Prometheus的特点：

开源免费
基于时序数据库
强大的查询语言（PromQL）
支持多种数据采集方式
支持告警功能
易于集成
适合云原生环境

1.2 Prometheus架构

Prometheus架构主要包括以下组件：

# Prometheus架构组件
1. Prometheus服务器：
– 负责采集和存储监控指标
– 提供查询接口
– 管理告警规则

2. 数据采集：
– 拉取模式（Pull）：Prometheus主动从目标获取数据
– 推送模式（Push）：使用Pushgateway接收临时任务的数据

3. 存储：
– 本地存储：基于时序数据库
– 远程存储：支持集成外部存储系统

4. 告警：
– AlertManager：处理告警
– 告警规则：定义告警条件

5. 可视化：
– Grafana：展示监控数据
– Prometheus UI：内置的简单UI

6. 服务发现：
– 静态配置
– 动态服务发现（Consul、Kubernetes等）

1.3 TiDB监控指标介绍

TiDB提供了丰富的监控指标，主要包括：

TiDB指标：QPS、延迟、错误率、连接数等
TiKV指标：存储使用、读写延迟、并发请求等

风哥提示：

PD指标：集群状态、调度情况、存储容量等
TiFlash指标：读写延迟、存储使用等
主机指标：CPU、内存、磁盘、网络等

风哥提示：TiDB的监控指标非常丰富，需要了解主要指标的含义和用途，以便更好地监控和管理集群。学习交流加群风哥QQ113257174

Part02-生产环境规划与建议

2.1 Prometheus部署规划

Prometheus部署规划要点：

# 部署模式
– 单实例部署：适合小型集群（3-5节点）
– 高可用部署：适合中型和大型集群（6+节点）
– 联邦集群：适合超大型集群或多集群环境

# 部署位置
– 独立服务器：与TiDB集群分离
– 资源隔离：确保监控系统不影响集群性能
– 网络连接：确保网络带宽充足，延迟低

# 存储规划
– 本地存储：SSD存储，IOPS要求高
– 存储容量：根据数据保留时间和采集频率计算
– 备份策略：定期备份监控数据

2.2 Prometheus资源配置

Prometheus资源配置建议：

# 资源配置
– CPU：
– 小型集群：2-4核
– 中型集群：4-8核
– 大型集群：8-16核

– 内存：
– 小型集群：8-16GB
– 中型集群：16-32GB
– 大型集群：32-64GB

– 磁盘：
– 小型集群：100-500GB
– 中型集群：500GB-1TB
– 大型集群：1TB-2TB

# 网络带宽
– 建议：1Gbps以上
– 考虑数据传输量：采集频率 × 指标数量 × 数据大小

2.3 Prometheus高可用配置

Prometheus高可用配置建议：

多实例部署：部署多个Prometheus实例，每个实例采集相同的目标
负载均衡：使用负载均衡器分发查询请求
数据一致性：确保多个实例采集的数据一致
故障转移：当一个实例故障时，自动切换到其他实例
远程存储：使用远程存储确保数据安全

生产环境建议：对于重要的生产环境，建议部署高可用的Prometheus集群，确保监控系统的可靠性。同时，要合理规划资源配置，确保Prometheus能够处理大量的监控数据。更多学习教程公众号风哥教程itpux_com

Part03-生产环境项目实施方案

3.1 Prometheus安装与配置

3.1.1 Prometheus安装

# 步骤1：下载Prometheus
$ wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# 步骤2：解压
$ tar -xzf prometheus-2.45.0.linux-amd64.tar.gz
$ cd prometheus-2.45.0.linux-amd64

# 步骤3：查看版本
$ ./prometheus –version

# 输出示例
prometheus, version 2.45.0 (branch: HEAD, revision: e9761e6)
build user: root@6564e5c9c040
build date: 2023-06-12T12:31:34Z
go version: go1.20.5
platform: linux/amd64

# 步骤4：启动Prometheus
$ ./prometheus –config.file=prometheus.yml

# 步骤5：验证启动
$ curl http://localhost:9090/metrics

# 输出示例学习交流加群风哥QQ113257174
# HELP prometheus_build_info A metric with a constant ‘1’ value labeled by version, revision, branch, and goversion from which Prometheus was built.
# TYPE prometheus_build_info gauge
prometheus_build_info{branch=”HEAD”,goversion=”go1.20.5″,revision=”e9761e6″,version=”2.45.0″} 1
# HELP prometheus_tsdb_head_samples_appended_total Total number of samples appended to the head.
# TYPE prometheus_tsdb_head_samples_appended_total counter
prometheus_tsdb_head_samples_appended_total 12345

3.1.2 Prometheus配置文件

# prometheus.yml配置示例
global:
# 采集间隔
scrape_interval: 15s
# 评估间隔
evaluation_interval: 15s
# 外部标签
external_labels:
monitor: ‘tidb-monitor’

# 告警配置
alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093

# 告警规则文件
rule_files:
– “alerts/*.yml”

# 采集配置
scrape_configs:
# TiDB组件监控
– job_name: ‘tidb’
static_configs:
– targets: [‘192.168.1.10:10080’, ‘192.168.1.11:10080’]

– job_name: ‘tikv’
static_configs:
– targets: [‘192.168.1.20:20180’, ‘192.168.1.21:20180’, ‘192.168.1.22:20180’]

– job_name: ‘pd’
static_configs:
– targets: [‘192.168.1.30:2379’, ‘192.168.1.31:2379’, ‘192.168.1.32:2379’]

– job_name: ‘tiflash’
static_configs:
– targets: [‘192.168.1.40:20292’, ‘192.168.1.41:20292’]

# 主机监控
– job_name: ‘node’
static_configs:
– targets: [‘192.168.1.10:9100’, ‘192.168.1.11:9100’, ‘192.168.1.20:9100’, ‘192.168.1.21:9100’, ‘192.168.1.22:9100’, ‘192.168.1.30:9100’, ‘192.168.1.31:9100’, ‘192.168.1.32:9100’, ‘192.168.1.40:9100’, ‘192.168.1.41:9100’]

# 其他组件监控
– job_name: ‘alertmanager’
static_configs:
– targets: [‘localhost:9093’]

– job_name: ‘grafana’
static_configs:
– targets: [‘localhost:3000’]

3.2 Prometheus监控目标配置

3.2.1 静态配置

# 静态配置示例
scrape_configs:
– job_name: ‘tidb’
static_configs:
– targets: [‘192.168.1.10:10080’, ‘192.168.1.11:10080’]
labels:
cluster: ‘fgedu-cluster’
environment: ‘production’

– job_name: ‘tikv’
static_configs:
– targets: [‘192.168.1.20:20180’, ‘192.168.1.21:20180’, ‘192.168.1.22:20180’]
labels:
cluster: ‘fgedu-cluster’
environment: ‘production’

– job_name: ‘pd’
static_configs:
– targets: [‘192.168.1.30:2379’, ‘192.168.1.31:2379’, ‘192.168.1.32:2379’]
labels:
cluster: ‘fgedu-cluster’
environment: ‘production’

3.2.2 动态服务发现

# DNS服务发现示例
scrape_configs:
– job_name: ‘tidb’
dns_sd_configs:
– names:
– ‘tidb.fgedu.net.cn’
type: ‘A’
port: 10080

# 文件服务发现示例
scrape_configs:
– job_name: ‘tikv’
file_sd_configs:
– files:
– ‘targets/tikv_targets.yml’
refresh_interval: 5m

# tikv_targets.yml内容示例
– targets: [‘192.168.1.20:20180’, ‘192.168.1.21:20180’]
labels:
cluster: ‘fgedu-cluster’
environment: ‘production’

– targets: [‘192.168.1.22:20180’]
labels:
cluster: ‘fgedu-cluster’
environment: ‘production’

3.3 Prometheus存储配置

3.3.1 本地存储配置

# 启动时指定存储配置
$ ./prometheus \
–config.file=prometheus.yml \
–storage.tsdb.path=/tidb/fgdata/prometheus \
–storage.tsdb.retention.time=15d \
–storage.tsdb.retention.size=100GB

# 配置说明
– storage.tsdb.path：存储路径
– storage.tsdb.retention.time：数据保留时间
– storage.tsdb.retention.size：数据保留大小

# 存储路径规划
– 建议使用SSD存储
– 路径：/tidb/fgdata/prometheus
– 权限：prometheus用户拥有读写权限

3.3.2 远程存储配置

# 远程存储配置示例
remote_write:
– url: “http://remote-storage:9090/api/v1/write”
basic_auth:
username: “prometheus”
password: “password”

remote_read:
– url: “http://remote-storage:9090/api/v1/read”
basic_auth:
username: “prometheus”
password: “password”

# 常用远程存储方案
– Thanos
– Cortex
– InfluxDB
– Elasticsearch

风哥提示：Prometheus的存储配置非常重要，需要根据实际情况设置合理的数据保留时间和存储大小，避免磁盘空间不足。对于长期存储，建议使用远程存储方案。from tidb视频:www.itpux.com

Part04-生产案例与实战讲解

4.1 Prometheus监控系统搭建实战

4.1.1 单实例部署

# 步骤1：准备服务器
$ hostnamectl set-hostname prometheus.fgedu.net.cn
$ ip addr add 192.168.1.50/24 dev eth0
$ systemctl restart network

# 步骤2：安装Prometheus
$ cd /tidb/app
$ wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
$ tar -xzf prometheus-2.45.0.linux-amd64.tar.gz
$ mv prometheus-2.45.0.linux-amd64 prometheus

# 步骤3：创建配置文件
$ mkdir -p /tidb/app/prometheus/alerts
$ cat > /tidb/app/prometheus/prometheus.yml << EOF global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - localhost:9093 rule_files: - "alerts/*.yml" scrape_configs: - job_name: 'tidb' static_configs: - targets: ['192.168.1.10:10080', '192.168.1.11:10080'] - job_name: 'tikv' static_configs: - targets: ['192.168.1.20:20180', '192.168.1.21:20180', '192.168.1.22:20180'] - job_name: 'pd' static_configs: - targets: ['192.168.1.30:2379', '192.168.1.31:2379', '192.168.1.32:2379'] - job_name: 'tiflash' static_configs: - targets: ['192.168.1.40:20292', '192.168.1.41:20292'] - job_name: 'node' static_configs: - targets: ['192.168.1.10:9100', '192.168.1.11:9100', '192.168.1.20:9100', '192.168.1.21:9100', '192.168.1.22:9100', '192.168.1.30:9100', '192.168.1.31:9100', '192.168.1.32:9100', '192.168.1.40:9100', '192.168.1.41:9100'] EOF # 步骤4：创建系统服务 $ cat > /etc/systemd/system/prometheus.service << EOF [Unit] Description=Prometheus After=network.target [Service] Type=simple ExecStart=/tidb/app/prometheus/prometheus --config.file=/tidb/app/prometheus/prometheus.yml --storage.tsdb.path=/tidb/fgdata/prometheus --storage.tsdb.retention.time=15d Restart=on-failure [Install] WantedBy=multi-user.target EOF # 步骤5：启动服务 $ systemctl daemon-reload $ systemctl enable prometheus $ systemctl start prometheus # 步骤6：验证服务状态 $ systemctl status prometheus # 输出示例 ● prometheus.service - Prometheus Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2026-04-09 10:00:00 CST; 5min ago Main PID: 12345 (prometheus) Tasks: 10 Memory: 256.0M CPU: 10% CGroup: /system.slice/prometheus.service └─12345 /tidb/app/prometheus/prometheus --config.file=/tidb/app/prometheus/prometheus.yml --storage.tsdb.path=/tidb/fgdata/prometheus --storage.tsdb.retention.time=15d

4.1.2 高可用部署

# 步骤1：准备两台服务器
# Server1: 192.168.1.50
# Server2: 192.168.1.51

# 步骤2：在两台服务器上安装Prometheus（步骤同单实例部署）

# 步骤3：配置负载均衡（使用HAProxy）
$ yum install haproxy -y
$ cat > /etc/haproxy/haproxy.cfg << EOF global log 127.0.0.1 local0 maxconn 4000 user haproxy group haproxy daemon defaults log global mode http option httplog option dontlognull retries 3 timeout connect 5000 timeout client 50000 timeout server 50000 frontend prometheus bind *:9090 default_backend prometheus_servers backend prometheus_servers balance roundrobin server prometheus1 192.168.1.50:9090 check server prometheus2 192.168.1.51:9090 check EOF $ systemctl enable haproxy $ systemctl start haproxy # 步骤4：配置远程存储（使用Thanos） # 安装Thanos $ wget https://github.com/thanos-io/thanos/releases/download/v0.32.2/thanos-0.32.2.linux-amd64.tar.gz $ tar -xzf thanos-0.32.2.linux-amd64.tar.gz $ mv thanos-0.32.2.linux-amd64/thanos /tidb/app/bin/ # 启动Thanos Sidecar $ /tidb/app/bin/thanos sidecar \ --prometheus.url=http://localhost:9090 \ --tsdb.path=/tidb/fgdata/prometheus \ --objstore.config-file=/tidb/app/thanos/objstore.yml # objstore.yml配置示例 type: S3 config: bucket: "thanos" endpoint: "s3.amazonaws.com" access_key: "access-key" secret_key: "secret-key"

4.2 Prometheus使用案例

4.2.1 使用PromQL查询

# 访问Prometheus UI
http://192.168.1.50:9090

# 常用PromQL查询示例

## 查询TiDB QPS
sum(rate(tidb_server_qps[5m])) by (instance)

## 查询TiKV读写延迟
histogram_quantile(0.99, sum(rate(tikv_server_request_duration_seconds_bucket[5m])) by (instance, le))

## 查询PD集群状态
pd_cluster_status{type=”health”}

## 查询主机CPU使用率
average(cpu_usage{mode=”idle”}) by (instance) * 100

## 查询磁盘使用率
100 – (node_filesystem_free_bytes{mountpoint=”/”} / node_filesystem_size_bytes{mountpoint=”/”} * 100)

## 查询网络流量
rate(node_network_receive_bytes_total[5m]) by (instance)
rate(node_network_transmit_bytes_total[5m]) by (instance)

4.2.2 配置告警规则

# 创建告警规则文件
$ cat > /tidb/app/prometheus/alerts/tidb_alerts.yml << EOF groups: - name: tidb rules: - alert: TiDBDown expr: up{job="tidb"} == 0 for: 5m labels: severity: critical annotations: summary: "TiDB down" description: "TiDB instance {{ labels.instance }} is down for more than 5 minutes" - alert: TiDBHighQPS expr: sum(rate(tidb_server_qps[5m])) by (instance) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: “TiDB high QPS”
description: “TiDB instance {{ labels.instance }} has high QPS: {{ value }}”

– alert: TiDBHighErrorRate
expr: sum(rate(tidb_server_errors_total[5m])) by (instance) / sum(rate(tidb_server_qps[5m])) by (instance) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: “TiDB high error rate”
description: “TiDB instance {{ labels.instance }} has high error rate: {{ value }}”
EOF

# 重新加载配置
$ curl -X POST http://localhost:9090/-/reload

# 查看告警状态
http://192.168.1.50:9090/alerts

4.3 Prometheus常见问题处理

4.3.1 Prometheus启动失败

# 问题现象：Prometheus启动失败

# 排查步骤：
1. 查看系统日志
$ journalctl -u prometheus

2. 检查配置文件
$ /tidb/app/prometheus/promtool check config /tidb/app/prometheus/prometheus.yml

3. 检查端口占用
$ netstat -tlnp | grep 9090

4. 检查磁盘空间
$ df -h

5. 检查内存使用
$ free -m

# 常见错误及解决方案：
– 配置文件错误：检查配置文件格式，使用promtool验证
– 端口占用：停止占用端口的进程或修改Prometheus端口
– 磁盘空间不足：清理磁盘空间或修改存储路径
– 内存不足：增加服务器内存或调整Prometheus配置

4.3.2 监控数据丢失

# 问题现象：监控数据丢失

# 排查步骤：
1. 检查Prometheus存储配置
$ grep storage /tidb/app/prometheus/prometheus.yml

2. 检查存储路径权限
$ ls -la /tidb/fgdata/prometheus/

3. 检查磁盘空间
$ df -h /tidb/fgdata

4. 检查Prometheus日志
$ journalctl -u prometheus | grep error

5. 检查采集目标状态
$ curl http://localhost:9090/targets

# 解决方案：
– 调整存储保留时间：修改storage.tsdb.retention.time
– 清理过期数据：手动清理过期的tsdb文件
– 增加磁盘空间：扩展存储容量
– 检查采集目标：确保所有目标都能正常采集

4.3.3 查询性能问题

# 问题现象：PromQL查询缓慢

# 排查步骤：
1. 检查Prometheus内存使用
$ top -p $(pgrep prometheus)

2. 检查查询语句复杂度
$ curl http://localhost:9090/api/v1/query?query=sum(rate(tidb_server_qps[5m])) by (instance)

3. 检查存储性能
$ iostat -x

4. 检查采集频率
$ grep scrape_interval /tidb/app/prometheus/prometheus.yml

# 解决方案：
– 优化查询语句：使用更高效的PromQL表达式
– 增加内存：提高Prometheus的内存配置
– 使用SSD存储：提高存储性能
– 调整采集频率：适当降低采集频率
– 使用聚合：在查询中使用聚合函数减少数据量

生产环境建议：定期检查Prometheus的运行状态，及时处理常见问题，确保监控系统的稳定性。建议建立Prometheus的维护规范，包括定期备份、清理过期数据、优化配置等。

Part05-风哥经验总结与分享

5.1 Prometheus最佳实践

Prometheus最佳实践：

合理配置采集频率：根据指标重要性，设置不同的采集间隔
使用标签管理：合理使用标签，方便查询和过滤
优化存储配置：设置合理的数据保留时间和存储大小
实现高可用：对于重要环境，部署高可用的Prometheus集群
使用远程存储：对于长期数据，使用远程存储方案
定期备份：定期备份Prometheus配置和数据
监控Prometheus自身：监控Prometheus的运行状态

5.2 Prometheus性能优化

Prometheus性能优化建议：

增加资源配置：根据实际需求，增加CPU、内存配置
优化采集配置：减少不必要的指标采集
使用SSD存储：提高存储IO性能
调整WAL配置：优化预写日志配置
使用压缩：启用数据压缩功能
合理设置块大小：调整tsdb块大小
使用查询缓存：启用Prometheus查询缓存

5.3 Prometheus维护建议

# Prometheus维护建议

## 定期检查
– 每周检查Prometheus运行状态
– 每月检查存储使用情况
– 每季度检查配置是否需要更新

## 定期清理
– 清理过期数据：根据保留策略自动清理
– 清理日志文件：定期清理Prometheus日志
– 清理无用的告警规则：移除不再需要的告警规则

## 定期备份
– 备份配置文件：每周备份一次
– 备份数据：每月备份一次重要数据
– 备份告警规则：每次修改后备份

## 版本更新
– 及时更新Prometheus版本：获取新功能和bug修复
– 测试环境验证：在测试环境验证新版本
– 制定更新计划：避免在业务高峰期更新

## 故障演练
– 定期进行故障演练：模拟Prometheus故障
– 测试高可用切换：验证高可用配置是否有效
– 测试数据恢复：验证备份数据的可用性

## 文档与培训
– 建立Prometheus维护文档
– 对运维人员进行培训
– 分享维护经验和最佳实践

风哥提示：Prometheus是TiDB监控系统的核心组件，需要定期维护和优化。建议建立Prometheus的维护规范，确保监控系统的可靠性和稳定性。

持续改进：Prometheus的配置和优化是一个持续的过程，需要根据实际运行情况不断调整和改进。建议定期review Prometheus的运行状态，优化配置，提高监控系统的效率和可靠性。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

tidb教程FG053-TiDB Prometheus监控配置

Part01-基础概念与理论知识

1.1 Prometheus的概念与特点

1.2 Prometheus架构

1.3 TiDB监控指标介绍

Part02-生产环境规划与建议

2.1 Prometheus部署规划

2.2 Prometheus资源配置

2.3 Prometheus高可用配置

Part03-生产环境项目实施方案

3.1 Prometheus安装与配置

3.1.1 Prometheus安装

3.1.2 Prometheus配置文件

3.2 Prometheus监控目标配置

3.2.1 静态配置

3.2.2 动态服务发现

3.3 Prometheus存储配置

3.3.1 本地存储配置

3.3.2 远程存储配置

Part04-生产案例与实战讲解

4.1 Prometheus监控系统搭建实战

4.1.1 单实例部署

4.1.2 高可用部署

4.2 Prometheus使用案例

4.2.1 使用PromQL查询

4.2.2 配置告警规则

4.3 Prometheus常见问题处理

4.3.1 Prometheus启动失败

4.3.2 监控数据丢失

4.3.3 查询性能问题

Part05-风哥经验总结与分享

5.1 Prometheus最佳实践

5.2 Prometheus性能优化

5.3 Prometheus维护建议

相关推荐

联系我们