SRE架构:SLI/SLO与Error Budget实践
SRE架构:SLI/SLO与Error Budget实践
SRE核心概念
SRE(Site Reliability Engineering)是Google提出的运维方法论,将软件工程方法应用于基础设施和运维问题。核心是用数据驱动的方式平衡可靠性与创新速度。
SRE平衡模型:
├── 可靠性目标(SLO)
│ └── 99.9% 可用性 = 每月43分钟不可用
├── Error Budget(错误预算)
│ └── 允许的故障时间/错误请求数
└── 发布节奏
└── 预算充足时加速发布,预算耗尽时停止发布
SLI指标定义
SLI(Service Level Indicator)是服务可靠性的量化度量:
# 常见SLI指标类型
sli_types:
availability:
description: "成功请求占比"
formula: "成功请求数 / 总请求数"
example: "HTTP 2xx响应占比"
latency:
description: "请求延迟分布"
formula: "P99延迟 < 目标值的请求占比"
example: "99%请求在200ms内完成"
correctness:
description: "正确响应占比"
formula: "正确请求数 / 总请求数"
example: "返回正确数据的请求占比"
throughput:
description: "吞吐量"
formula: "单位时间处理请求数"
example: "QPS > 1000"
PromQL SLI查询
# 可用性SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# 延迟SLI(P99 < 200ms)
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) < 0.2
# 错误率SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
SLO目标设定
SLO(Service Level Objective)定义SLI的目标值,是团队对可靠性的承诺:
# SLO配置示例
slo_config:
service: "payment-api"
slos:
- name: "availability"
sli: "http_requests_success_rate"
target: 0.999 # 99.9%
window: "30d"
- name: "latency"
sli: "http_request_duration_p99"
target: 0.2 # 200ms
window: "30d"
- name: "freshness"
sli: "data_freshness"
target: 300 # 5分钟内
window: "7d"
Error Budget策略
Error Budget是1减去SLO目标,代表允许的不可靠程度:
# Error Budget计算
class ErrorBudget:
def __init__(self, slo_target: float, window_days: int):
self.slo_target = slo_target
self.window_days = window_days
self.budget = 1 - slo_target
def calculate_remaining(self, current_error_rate: float) -> float:
"""计算剩余Error Budget"""
total_requests = self.get_total_requests()
error_requests = total_requests * current_error_rate
allowed_errors = total_requests * self.budget
return (allowed_errors - error_requests) / allowed_errors
def get_release_policy(self, remaining_budget: float) -> str:
"""根据剩余预算决定发布策略"""
if remaining_budget > 0.5:
return "normal_release" # 正常发布
elif remaining_budget > 0.2:
return "cautious_release" # 谨慎发布
elif remaining_budget > 0:
return "no_new_releases" # 停止新发布
else:
return "reliability_focus" # 全力修复可靠性
SLO告警策略
# 多窗口多燃烧率告警
groups:
- name: slo-alerts
rules:
# 高燃烧率告警(5分钟内消耗14.4倍预算)
- alert: HighErrorBudgetBurn
expr: |
(
job:http_requests_error_rate:rate5m{job="myapp"}
> (14.4 * 0.001)
) and (
job:http_requests_error_rate:rate1h{job="myapp"}
> (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
annotations:
summary: "Error Budget快速消耗"
# 低燃烧率告警(6小时内消耗1倍预算)
- alert: LowErrorBudgetBurn
expr: |
(
job:http_requests_error_rate:rate30m{job="myapp"}
> (1 * 0.001)
) and (
job:http_requests_error_rate:rate6h{job="myapp"}
> (1 * 0.001)
)
for: 15m
labels:
severity: warning
可靠性管理最佳实践
- 渐进式SLO:从宽松目标开始,逐步收紧
- Error Budget驱动发布:预算充足时加速创新,预算耗尽时专注可靠性
- 事后复盘:每个故障事件都进行根因分析和改进
- 自动化:使用工具自动化SLO计算和告警