SRE实践方法论
SRE实践方法论
什么是SRE
SRE(Site Reliability Engineering)是将软件工程方法应用于基础设施和运维问题的实践。
核心概念
SLI(服务级别指标)
# 可用性SLI
slis:
availability:
type: success_ratio
good_events: "http_requests_total{status!~'5..'}"
total_events: "http_requests_total"
# 延迟SLI
latency:
type: histogram
metric: "http_request_duration_seconds"
threshold: 0.5 # 500ms
SLO(服务级别目标)
# SLO配置
slos:
- name: availability
target: 0.999 # 99.9%
window: 30d
- name: latency
target: 0.99 # 99%请求<500ms
window: 30d
错误预算
错误预算 = 1 - SLO
例如:SLO 99.9% → 错误预算 0.1%
30天内允许停机时间:30 * 24 * 60 * 0.001 = 43.2分钟
可用性计算
# 可用性计算
def calculate_availability(downtime_minutes, total_minutes):
return 1 - (downtime_minutes / total_minutes)
# 30天可用性
availability = calculate_availability(43.2, 30 * 24 * 60)
print(f"可用性: {availability * 100}%") # 99.9%
# 年度停机时间
annual_downtime = (1 - 0.999) * 365 * 24 * 60
print(f"年度允许停机: {annual_downtime}分钟") # 525.6分钟
告警设计
基于SLO的告警
groups:
- name: slo-alerts
rules:
- alert: SLOBreach
expr: |
(
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: "SLO breach detected"
description: "Availability dropped below 99.9%"
- alert: ErrorBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > 0.001 * 14.4
for: 5m
labels:
severity: warning
混沌工程与SRE
# 混沌实验与SLO关联
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-experiment
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: myapp
scheduler:
cron: '@every 24h'
# SLO验证
# 实验前:SLO 99.9%
# 实验后:验证是否保持SLO
变更管理
渐进式发布
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 600
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 30s
事后复盘
复盘模板
# 事后复盘报告
## 事件摘要
- 开始时间:
- 结束时间:
- 持续时间:
- 影响范围:
## 时间线
- HH:MM - 事件开始
- HH:MM - 检测到问题
- HH:MM - 开始处理
- HH:MM - 问题解决
## 根本原因
-
## 影响
- 用户影响:
- 数据影响:
- SLO影响:
## 改进措施
- [ ] 短期修复
- [ ] 长期改进
- [ ] 监控完善
实践:SRE工作流
# 1. 定义SLI/SLO
# 2. 配置监控和告警
# 3. 建立错误预算跟踪
# 4. 实施混沌工程
# 5. 渐进式发布
# 6. 事后复盘
最佳实践
- 自动化一切可自动化的
- 监控驱动决策
- 渐进式变更
- 拥抱失败
- 持续改进
总结
SRE是现代运维的核心方法论。通过SLI/SLO、错误预算、混沌工程等实践,可以系统性地提高系统可靠性。