错误预算:可靠性管理策略
错误预算:可靠性管理策略
什么是错误预算
错误预算是基于SLO(服务级别目标)计算出的允许失败空间。例如,99.9%的可用性目标意味着每个月允许有43.8分钟的停机时间。错误预算机制帮助团队在可靠性和发布速度之间取得平衡。
错误预算计算
基础计算
#!/bin/bash
# error-budget-calculator.sh
calculate_budget() {
local target=$1 # 可用性目标(百分比)
local window=$2 # 时间窗口(天数)
total_minutes=$(echo "$window * 24 * 60" | bc)
allowed_downtime=$(echo "scale=2; $total_minutes * (1 - $target / 100)" | bc)
echo "可用性目标: ${target}%"
echo "时间窗口: ${window}天 (${total_minutes}分钟)"
echo "允许停机时间: ${allowed_downtime}分钟"
echo "允许请求数失败: $(echo "scale=0; $allowed_downtime / $total_minutes * 1000000" | bc)/百万"
}
# 计算不同目标的预算
calculate_budget 99.9 30 # 99.9% 月度
calculate_budget 99.95 30 # 99.95% 月度
calculate_budget 99.99 30 # 99.99% 月度
高级计算公式
#!/usr/bin/env python3
# error_budget.py
from datetime import datetime, timedelta
import math
class ErrorBudget:
def __init__(self, availability_target, window_days):
self.target = availability_target
self.window = window_days
self.total_minutes = window_days * 24 * 60
self.allowed_downtime = self.total_minutes * (1 - self.target / 100)
def calculate_budget_consumed(self, actual_downtime):
"""计算已消耗的错误预算百分比"""
return (actual_downtime / self.allowed_downtime) * 100
def calculate_remaining_budget(self, actual_downtime):
"""计算剩余错误预算"""
remaining = self.allowed_downtime - actual_downtime
return max(0, remaining)
def get_budget_status(self, consumed_percent):
"""获取预算状态"""
if consumed_percent < 50:
return "healthy"
elif consumed_percent < 80:
return "warning"
elif consumed_percent < 100:
return "critical"
else:
return "exhausted"
def can_release(self, current_consumed, additional_risk):
"""判断是否可以发布"""
projected = current_consumed + additional_risk
return projected < 80 # 80%阈值
# 使用示例
budget = ErrorBudget(availability_target=99.9, window_days=30)
consumed = budget.calculate_budget_consumed(actual_downtime=20)
print(f"已消耗预算: {consumed:.2f}%")
print(f"剩余预算: {budget.calculate_remaining_budget(20):.2f}分钟")
print(f"预算状态: {budget.get_budget_status(consumed)}")
Prometheus监控
错误预算指标
# prometheus-rules.yaml
groups:
- name: error-budget
rules:
# 计算可用性
- record: service:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{code!~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# 计算错误预算消耗
- record: service:error_budget:consumed:ratio
expr: |
1 - (
service:availability:ratio_rate5m
/
0.001 # SLO目标 99.9%
)
# 计算剩余预算
- record: service:error_budget:remaining:minutes
expr: |
(1 - service:error_budget:consumed:ratio)
*
43.2 # 月度允许停机分钟数(99.9% SLO)
告警规则
# error-budget-alerts.yaml
groups:
- name: error-budget-alerts
rules:
- alert: ErrorBudgetWarning
expr: service:error_budget:consumed:ratio > 0.5
for: 1h
labels:
severity: warning
annotations:
summary: "错误预算消耗超过50%"
description: "服务 {{ $labels.service }} 已消耗 {{ $value | humanizePercentage }} 的错误预算"
- alert: ErrorBudgetCritical
expr: service:error_budget:consumed:ratio > 0.8
for: 1h
labels:
severity: critical
annotations:
summary: "错误预算消耗超过80%"
description: "服务 {{ $labels.service }} 已消耗 {{ $value | humanizePercentage }} 的错误预算"
action: "停止非关键变更"
- alert: ErrorBudgetExhausted
expr: service:error_budget:consumed:ratio >= 1
for: 5m
labels:
severity: critical
annotations:
summary: "错误预算已耗尽"
description: "服务 {{ $labels.service }} 的错误预算已完全耗尽"
action: "立即处理可靠性问题"
Grafana仪表盘
错误预算仪表盘JSON
{
"title": "错误预算仪表盘",
"panels": [
{
"title": "错误预算消耗",
"type": "gauge",
"targets": [{
"expr": "service:error_budget:consumed:ratio",
"legendFormat": "{{ service }}"
}],
"thresholds": [
{"value": 0, "color": "green"},
{"value": 0.5, "color": "yellow"},
{"value": 0.8, "color": "orange"},
{"value": 1, "color": "red"}
]
},
{
"title": "剩余预算时间",
"type": "stat",
"targets": [{
"expr": "service:error_budget:remaining:minutes",
"legendFormat": "{{ service }}"
}]
},
{
"title": "可用性趋势",
"type": "timeseries",
"targets": [{
"expr": "service:availability:ratio_rate5m",
"legendFormat": "{{ service }}"
}]
}
]
}
错误预算策略
策略配置
# error-budget-policy.yaml
policies:
- name: "健康状态"
condition: "consumed < 50%"
actions:
- "正常发布"
- "可以进行实验"
- "执行常规变更"
- name: "警告状态"
condition: "50% <= consumed < 80%"
actions:
- "限制发布频率"
- "增加测试覆盖"
- "优化监控告警"
- name: "临界状态"
condition: "80% <= consumed < 100%"
actions:
- "冻结非关键变更"
- "专注可靠性改进"
- "执行事后分析"
- name: "耗尽状态"
condition: "consumed >= 100%"
actions:
- "完全冻结变更"
- "全力恢复可靠性"
- "升级到管理层"
自动化策略
#!/bin/bash
# check-error-budget.sh - 自动化检查错误预算
SERVICE=$1
BUDGET_CONSUMED=$(curl -s "http://prometheus/api/v1/query?query=service:error_budget:consumed:ratio{service='$SERVICE'}" | jq -r '.data.result[0].value[1]')
echo "服务: $SERVICE"
echo "错误预算消耗: ${BUDGET_CONSUMED}%"
if (( $(echo "$BUDGET_CONSUMED < 50" | bc -l) )); then
echo "状态: 健康 - 可以正常发布"
exit 0
elif (( $(echo "$BUDGET_CONSUMED < 80" | bc -l) )); then
echo "状态: 警告 - 限制发布频率"
# 发送Slack通知
curl -X POST https://hooks.slack.com/services/xxx \
-d '{"text": "警告: 错误预算消耗超过50%"}'
exit 1
elif (( $(echo "$BUDGET_CONSUMED < 100" | bc -l) )); then
echo "状态: 临界 - 冻结非关键变更"
# 触发冻结流程
kubectl label namespace production freeze=true
exit 2
else
echo "状态: 耗尽 - 全力恢复可靠性"
# 触发紧急响应
curl -X POST http://pagerduty/api/incidents \
-d '{"title": "错误预算耗尽", "severity": "critical"}'
exit 3
fi
CI/CD集成
发布门禁
# .github/workflows/release-gate.yml
name: Release Gate
on:
push:
branches: [main]
jobs:
check-error-budget:
runs-on: ubuntu-latest
steps:
- name: Check Error Budget
run: |
BUDGET=$(curl -s "http://prometheus/api/v1/query?query=service:error_budget:consumed:ratio{service='myapp'}" | jq -r '.data.result[0].value[1]')
if (( $(echo "$BUDGET > 80" | bc -l) )); then
echo "错误预算消耗超过80%,阻止发布"
exit 1
fi
echo "错误预算检查通过"
- name: Deploy
if: success()
run: |
# 执行部署
echo "开始部署..."
事后分析
分析模板
# 事后分析报告
## 事件概述
- **事件ID**: INC-2024-001
- **影响时间**: 2024-01-15 14:00 - 14:45
- **影响范围**: API服务
- **错误预算消耗**: 15分钟/43.8分钟
## 根本原因
数据库连接池耗尽导致API响应超时
## 时间线
- 14:00 - 流量突增
- 14:05 - 连接池耗尽
- 14:10 - 告警触发
- 14:15 - 开始处理
- 14:30 - 临时扩容
- 14:45 - 系统恢复
## 改进措施
1. 自动扩容机制
2. 连接池优化
3. 流量控制
## 错误预算影响
本次事件消耗了34%的月度错误预算
最佳实践
- 透明度:公开错误预算使用情况
- 自动化:自动化预算监控和告警
- 策略执行:严格执行预算策略
- 持续改进:根据预算数据优化系统
- 团队文化:建立可靠性优先的文化