Alertmanager:Prometheus告警管理
Alertmanager:Prometheus告警管理
什么是Alertmanager
Alertmanager是Prometheus生态中的告警管理组件,负责接收Prometheus发送的告警通知,进行去重、分组、路由和发送。它支持多种通知渠道,包括邮件、Slack、钉钉、企业微信等。
安装Alertmanager
Docker安装
docker run -d \
--name=alertmanager \
-p 9093:9093 \
-v alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager:latest
二进制安装
# 下载Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
# 创建用户和目录
useradd --no-create-home --shell /bin/false alertmanager
mkdir -p /etc/alertmanager /var/lib/alertmanager
cp alertmanager amtool /usr/local/bin/
cp alertmanager.yml /etc/alertmanager/
chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
告警配置
基础配置文件
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
smtp_require_tls: true
# 告警模板
templates:
- '/etc/alertmanager/templates/*.tmpl'
# 路由配置
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 30s
- match:
severity: warning
receiver: 'warning-receiver'
# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
接收者配置
receivers:
- name: 'default-receiver'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
- name: 'critical-receiver'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
send_resolved: true
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'warning-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Prometheus告警规则
# prometheus-rules.yml
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 宕机"
description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU使用率超过80%,当前值 {{ $value }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "实例 {{ $labels.instance }} 磁盘使用率超过80%"
告警路由策略
高级路由配置
route:
receiver: 'default'
routes:
# 严重告警发送到PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# 所有告警都发送到Slack
- match_re:
severity: (critical|warning)
receiver: 'slack-general'
# 特定服务的告警路由
- match:
service: database
receiver: 'dba-team'
group_by: ['alertname', 'instance']
时间抑制
route:
receiver: 'default'
routes:
- match:
severity: warning
receiver: 'warning-offhours'
# 在非工作时间抑制警告告警
inhibit_rules:
- source_match:
alertname: 'MaintenanceWindow'
target_match:
severity: 'warning'
equal: ['instance']
管理工具
# 使用amtool管理Alertmanager
# 查看当前告警
amtool alert
# 查看路由树
amtool config routes
# 添加静默规则(2小时内不接收特定告警)
amtool silence add alertname=NodeDown instance=node1 --duration=2h --comment="维护中"
# 查看静默规则
amtool silence query
# 取消静默
amtool silence expire <silence-id>
# 检查配置文件
amtool check-config alertmanager.yml
监控Alertmanager
# Alertmanager指标
curl http://localhost:9093/metrics
# 关键指标
# alertmanager_notifications_total - 发送的通知总数
# alertmanager_notifications_failed_total - 发送失败的通知数
# alertmanager_alerts_received_total - 接收的告警总数
# alertmanager_alerts_active - 当前活跃的告警数
# 配置Prometheus监控Alertmanager
scrape_configs:
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
高可用部署
# 使用DNS实现Alertmanager集群
# 在docker-compose.yml中配置多个实例
# docker-compose.yml
services:
alertmanager1:
image: prom/alertmanager
command:
- '--cluster.peer=alertmanager2:9094'
- '--cluster.peer=alertmanager3:9094'
alertmanager2:
image: prom/alertmanager
command:
- '--cluster.peer=alertmanager1:9094'
- '--cluster.peer=alertmanager3:9094'
alertmanager3:
image: prom/alertmanager
command:
- '--cluster.peer=alertmanager1:9094'
- '--cluster.peer=alertmanager2:9094'
常见问题排查
# 查看Alertmanager日志
docker logs alertmanager
# 检查告警是否被发送
amtool alert --silenced=false
# 检查路由匹配
amtool config routes match alertname=NodeDown
# 验证通知配置
amtool config receivers