← 返回首页
🔧

Alertmanager:Prometheus告警管理

📂 devops ⏱ 2 min 376 words

Alertmanager:Prometheus告警管理

什么是Alertmanager

Alertmanager是Prometheus生态中的告警管理组件,负责接收Prometheus发送的告警通知,进行去重、分组、路由和发送。它支持多种通知渠道,包括邮件、Slack、钉钉、企业微信等。

安装Alertmanager

Docker安装

docker run -d \
  --name=alertmanager \
  -p 9093:9093 \
  -v alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager:latest

二进制安装

# 下载Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

# 创建用户和目录
useradd --no-create-home --shell /bin/false alertmanager
mkdir -p /etc/alertmanager /var/lib/alertmanager
cp alertmanager amtool /usr/local/bin/
cp alertmanager.yml /etc/alertmanager/
chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

告警配置

基础配置文件

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'
  smtp_require_tls: true

# 告警模板
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# 路由配置
route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  
  routes:
    - match:
        severity: critical
      receiver: 'critical-receiver'
      group_wait: 30s
      
    - match:
        severity: warning
      receiver: 'warning-receiver'

# 抑制规则
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

接收者配置

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
        
  - name: 'critical-receiver'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
        send_resolved: true
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
        
  - name: 'warning-receiver'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Prometheus告警规则

# prometheus-rules.yml
groups:
  - name: node-alerts
    rules:
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "节点 {{ $labels.instance }} 宕机"
          description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
          
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU使用率过高"
          description: "实例 {{ $labels.instance }} CPU使用率超过80%,当前值 {{ $value }}%"
          
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "实例 {{ $labels.instance }} 磁盘使用率超过80%"

告警路由策略

高级路由配置

route:
  receiver: 'default'
  routes:
    # 严重告警发送到PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
      
    # 所有告警都发送到Slack
    - match_re:
        severity: (critical|warning)
      receiver: 'slack-general'
      
    # 特定服务的告警路由
    - match:
        service: database
      receiver: 'dba-team'
      group_by: ['alertname', 'instance']

时间抑制

route:
  receiver: 'default'
  routes:
    - match:
        severity: warning
      receiver: 'warning-offhours'
      
# 在非工作时间抑制警告告警
inhibit_rules:
  - source_match:
      alertname: 'MaintenanceWindow'
    target_match:
      severity: 'warning'
    equal: ['instance']

管理工具

# 使用amtool管理Alertmanager

# 查看当前告警
amtool alert

# 查看路由树
amtool config routes

# 添加静默规则(2小时内不接收特定告警)
amtool silence add alertname=NodeDown instance=node1 --duration=2h --comment="维护中"

# 查看静默规则
amtool silence query

# 取消静默
amtool silence expire <silence-id>

# 检查配置文件
amtool check-config alertmanager.yml

监控Alertmanager

# Alertmanager指标
curl http://localhost:9093/metrics

# 关键指标
# alertmanager_notifications_total - 发送的通知总数
# alertmanager_notifications_failed_total - 发送失败的通知数
# alertmanager_alerts_received_total - 接收的告警总数
# alertmanager_alerts_active - 当前活跃的告警数

# 配置Prometheus监控Alertmanager
scrape_configs:
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']

高可用部署

# 使用DNS实现Alertmanager集群
# 在docker-compose.yml中配置多个实例

# docker-compose.yml
services:
  alertmanager1:
    image: prom/alertmanager
    command:
      - '--cluster.peer=alertmanager2:9094'
      - '--cluster.peer=alertmanager3:9094'
      
  alertmanager2:
    image: prom/alertmanager
    command:
      - '--cluster.peer=alertmanager1:9094'
      - '--cluster.peer=alertmanager3:9094'
      
  alertmanager3:
    image: prom/alertmanager
    command:
      - '--cluster.peer=alertmanager1:9094'
      - '--cluster.peer=alertmanager2:9094'

常见问题排查

# 查看Alertmanager日志
docker logs alertmanager

# 检查告警是否被发送
amtool alert --silenced=false

# 检查路由匹配
amtool config routes match alertname=NodeDown

# 验证通知配置
amtool config receivers