告警架构:分级抑制与On-Call设计
告警架构:分级抑制与On-Call设计
告警架构概览
告警系统是运维的眼睛,负责在系统异常时及时通知相关人员。良好的告警架构应减少噪音、精准触发、分级响应。
告警流程:
监控数据 → 规则评估 → 告警触发 → 路由分组 → 通知发送 → 人员响应 → 问题处理
│ │ │ │ │ │ │
Prometheus 告警规则 阈值判断 抑制/分组 多渠道 On-Call 事件管理
评估引擎 去重 聚合 通知 轮值 升级
告警分级体系
严重级别定义
severity_levels:
P1_Critical:
description: "核心业务完全不可用"
examples:
- 支付系统宕机
- 数据库主节点故障
- 安全漏洞被利用
response_time: "5分钟"
escalation: "立即升级到管理层"
channels: ["电话", "短信", "即时通讯"]
P2_High:
description: "核心功能严重受损"
examples:
- API错误率 > 10%
- 响应延迟 > 5秒
- 磁盘使用 > 90%
response_time: "15分钟"
escalation: "30分钟未响应则升级"
channels: ["即时通讯", "邮件"]
P3_Medium:
description: "非核心功能异常"
examples:
- 后台任务失败
- 非关键服务降级
- 性能轻微下降
response_time: "1小时"
escalation: "4小时未响应则升级"
channels: ["即时通讯"]
P4_Low:
description: "信息性告警"
examples:
- SSL证书即将过期
- 配置变更通知
- 容量预警
response_time: "24小时"
escalation: "无"
channels: ["邮件", "工单"]
告警抑制策略
Prometheus抑制规则
# alertmanager.yml抑制配置
inhibit_rules:
# 服务完全不可用时,抑制该服务的其他告警
- source_match:
alertname: ServiceDown
severity: critical
target_match:
service: "{{ .Labels.service }}"
equal: ['service']
# 数据库主节点故障时,抑制从节点相关告警
- source_match:
alertname: DatabasePrimaryDown
target_match_re:
alertname: 'Database.*'
equal: ['cluster']
# 宿主机故障时,抑制该主机上所有Pod告警
- source_match:
alertname: NodeDown
target_match:
instance: "{{ .Labels.instance }}"
equal: ['instance']
静默规则
# 维护窗口静默
apiVersion: v1
matchers:
- name: alertname
value: ".*"
isRegex: true
start: "2024-01-20T02:00:00Z"
end: "2024-01-20T06:00:00Z"
createdBy: "maintenance-bot"
comment: "计划维护窗口"
On-Call轮值设计
轮值策略
# oncall-schedule.yml
rotation:
primary:
duration: "7d" # 一周轮值
schedule:
- user: alice
start: "2024-01-01"
- user: bob
start: "2024-01-08"
secondary:
duration: "7d"
offset: "3.5d" # 与primary错开
schedule:
- user: charlie
start: "2024-01-03"
- user: diana
start: "2024-01-10"
escalation:
- level: 1
wait: "5m"
notify: ["primary"]
- level: 2
wait: "10m"
notify: ["primary", "secondary"]
- level: 3
wait: "15m"
notify: ["primary", "secondary", "team-lead"]
- level: 4
wait: "30m"
notify: ["primary", "secondary", "team-lead", "vp-engineering"]
升级策略
# 告警升级逻辑
class EscalationPolicy:
def __init__(self):
self.levels = [
{"wait_minutes": 5, "contacts": ["oncall_primary"]},
{"wait_minutes": 10, "contacts": ["oncall_primary", "oncall_secondary"]},
{"wait_minutes": 15, "contacts": ["oncall_primary", "secondary", "team_lead"]},
{"wait_minutes": 30, "contacts": ["all", "management"]},
]
async def handle_alert(self, alert: Alert):
for level in self.levels:
# 检查是否已解决
if await self.is_resolved(alert.id):
return
# 发送通知
await self.notify(level["contacts"], alert)
# 等待响应
resolved = await self.wait_for_response(
alert.id,
level["wait_minutes"]
)
if resolved:
return
# 所有级别都未响应,触发紧急流程
await self.emergency_escalation(alert)
告警疲劳治理
告警质量指标
alert_quality_metrics:
signal_to_noise_ratio:
description: "有效告警/总告警比率"
target: "> 0.8"
mttr:
description: "平均响应时间"
target: "< 15分钟(P1)"
alert_frequency:
description: "单个告警触发频率"
target: "< 5次/天"
false_positive_rate:
description: "误报率"
target: "< 10%"
告警优化策略
optimization_strategies:
- name: "减少噪音"
actions:
- 聚合相似告警
- 增加告警持续时间阈值
- 使用多窗口多燃烧率
- name: "提高精度"
actions:
- 调整阈值
- 增加条件组合
- 使用机器学习异常检测
- name: "改善体验"
actions:
- 提供Runbook链接
- 包含上下文信息
- 分级通知渠道
最佳实践
- 分级响应:不同级别使用不同通知渠道和响应时间
- 抑制策略:避免告警风暴,减少噪音
- 轮值公平:合理分配On-Call负担,避免人员疲劳
- 持续优化:定期Review告警规则,删除无效告警