🔧

Prometheus 监控系统

📂 devops ⏱ 3 min 522 words

Prometheus 监控告警指标 Grafana

Prometheus 监控系统

什么是 Prometheus

Prometheus 是一个开源的系统监控和告警工具包，由 SoundCloud 开发。它通过 Pull 模式采集指标数据，使用 PromQL 进行查询，并支持多维数据模型和强大的告警功能。

核心概念

指标 (Metric): 可测量的数值数据
标签 (Label): 指标的多维属性
目标 (Target): 被监控的端点
抓取 (Scrape): 定期从目标获取指标
PromQL: Prometheus 查询语言

安装 Prometheus

二进制安装

# 下载
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz

# 解压
tar xzf prometheus-2.47.0.linux-amd64.tar.gz
cd prometheus-2.47.0.linux-amd64

# 启动
./prometheus --config.file=prometheus.yml

Docker 安装

docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Docker Compose

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

volumes:
  prometheus-data:
  grafana-data:

配置文件

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']
    metrics_path: /metrics

  - job_name: 'app'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
    metrics_path: /actuator/prometheus

指标类型

Counter（计数器）

# 只增不减的计数
http_requests_total{method="GET", status="200"} 1234

Gauge（仪表盘）

# 可增可减的数值
cpu_usage_percent{instance="node1"} 75.5

Histogram（直方图）

# 请求持续时间分布
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1500
http_request_duration_seconds_bucket{le="+Inf"} 2000

Summary（摘要）

# 类似直方图，但在客户端计算
http_request_duration_seconds{quantile="0.5"} 0.1
http_request_duration_seconds{quantile="0.99"} 0.5

PromQL 查询

基本查询

# 查询指标
http_requests_total

# 带标签过滤
http_requests_total{method="GET"}

# 范围查询
http_requests_total[5m]

# 聚合
sum(http_requests_total)

常用查询

# CPU 使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

# 磁盘使用率
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

# HTTP 请求速率
rate(http_requests_total[5m])

# HTTP 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

告警规则

rules/alerts.yml

groups:
  - name: node-alerts
    rules:
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes."
      
      - alert: HighCPU
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes."
      
      - alert: HighMemory
        expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% for more than 5 minutes."

  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for more than 5 minutes."
      
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "99th percentile latency is above 1 second."

Alertmanager 配置

alertmanager.yml

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://webhook:5001/'
  
  - name: 'email'
    email_configs:
      - to: 'admin@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'
  
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ .Annotations.summary }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

实践案例

监控 Spring Boot 应用

# 添加 Micrometer 依赖后配置
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    export:
      prometheus:
        enabled: true

Prometheus 配置

scrape_configs:
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['app:8080']

Exporter

Node Exporter

# 安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz

# 启动
./node_exporter

MySQL Exporter

# 安装
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz

# 启动
./mysqld_exporter --config.my-cnf=.my.cnf

常用命令

# 检查配置
promtool check config prometheus.yml

# 测试规则
promtool test rules test.yml

# 查询指标
curl 'http://localhost:9090/api/v1/query?query=up'

# 查看目标
curl 'http://localhost:9090/api/v1/targets'

常见问题

抓取失败

# 检查目标状态
curl http://localhost:9090/api/v1/targets

# 检查网络连接
curl http://node1:9100/metrics

指标缺失

# 检查配置
promtool check config prometheus.yml

# 检查标签
curl 'http://localhost:9090/api/v1/label/__name__/values'

最佳实践

使用合适的抓取间隔
配置合理的告警规则
使用标签组织指标
定期清理旧数据
监控 Prometheus 自身状态

总结

Prometheus 是一个功能强大的监控系统。通过合理配置采集、查询和告警，可以构建完善的监控告警体系，保障系统稳定运行。