Prometheus 监控系统
Prometheus 监控系统
什么是 Prometheus
Prometheus 是一个开源的系统监控和告警工具包,由 SoundCloud 开发。它通过 Pull 模式采集指标数据,使用 PromQL 进行查询,并支持多维数据模型和强大的告警功能。
核心概念
- 指标 (Metric): 可测量的数值数据
- 标签 (Label): 指标的多维属性
- 目标 (Target): 被监控的端点
- 抓取 (Scrape): 定期从目标获取指标
- PromQL: Prometheus 查询语言
安装 Prometheus
二进制安装
# 下载
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
# 解压
tar xzf prometheus-2.47.0.linux-amd64.tar.gz
cd prometheus-2.47.0.linux-amd64
# 启动
./prometheus --config.file=prometheus.yml
Docker 安装
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
Docker Compose
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
prometheus-data:
grafana-data:
配置文件
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node1:9100', 'node2:9100']
metrics_path: /metrics
- job_name: 'app'
static_configs:
- targets: ['app1:8080', 'app2:8080']
metrics_path: /actuator/prometheus
指标类型
Counter(计数器)
# 只增不减的计数
http_requests_total{method="GET", status="200"} 1234
Gauge(仪表盘)
# 可增可减的数值
cpu_usage_percent{instance="node1"} 75.5
Histogram(直方图)
# 请求持续时间分布
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1500
http_request_duration_seconds_bucket{le="+Inf"} 2000
Summary(摘要)
# 类似直方图,但在客户端计算
http_request_duration_seconds{quantile="0.5"} 0.1
http_request_duration_seconds{quantile="0.99"} 0.5
PromQL 查询
基本查询
# 查询指标
http_requests_total
# 带标签过滤
http_requests_total{method="GET"}
# 范围查询
http_requests_total[5m]
# 聚合
sum(http_requests_total)
常用查询
# CPU 使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
# 磁盘使用率
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)
# HTTP 请求速率
rate(http_requests_total[5m])
# HTTP 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
告警规则
rules/alerts.yml
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighCPU
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes."
- alert: HighMemory
expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for more than 5 minutes."
- name: app-alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for more than 5 minutes."
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "99th percentile latency is above 1 second."
Alertmanager 配置
alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook:5001/'
- name: 'email'
email_configs:
- to: 'admin@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .Annotations.summary }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
实践案例
监控 Spring Boot 应用
# 添加 Micrometer 依赖后配置
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
export:
prometheus:
enabled: true
Prometheus 配置
scrape_configs:
- job_name: 'spring-boot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['app:8080']
Exporter
Node Exporter
# 安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
# 启动
./node_exporter
MySQL Exporter
# 安装
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz
# 启动
./mysqld_exporter --config.my-cnf=.my.cnf
常用命令
# 检查配置
promtool check config prometheus.yml
# 测试规则
promtool test rules test.yml
# 查询指标
curl 'http://localhost:9090/api/v1/query?query=up'
# 查看目标
curl 'http://localhost:9090/api/v1/targets'
常见问题
抓取失败
# 检查目标状态
curl http://localhost:9090/api/v1/targets
# 检查网络连接
curl http://node1:9100/metrics
指标缺失
# 检查配置
promtool check config prometheus.yml
# 检查标签
curl 'http://localhost:9090/api/v1/label/__name__/values'
最佳实践
- 使用合适的抓取间隔
- 配置合理的告警规则
- 使用标签组织指标
- 定期清理旧数据
- 监控 Prometheus 自身状态
总结
Prometheus 是一个功能强大的监控系统。通过合理配置采集、查询和告警,可以构建完善的监控告警体系,保障系统稳定运行。