SRE实践:Site Reliability Engineering
SRE实践:Site Reliability Engineering
什么是SRE
SRE(Site Reliability Engineering)是Google提出的一套运维方法论,将软件工程的方法应用于基础设施和运维问题。SRE的核心理念是用软件工程的方式解决运维问题,追求系统的可靠性和可扩展性。
SRE核心原则
SRE核心原则:
├── 1. 一切皆代码
├── 2. 自动化优先
├── 3. 渐进式变更
├── 4. 失败为常态
└── 5. 持续改进
错误预算机制
错误预算计算
# 错误预算计算公式
# 可用性目标: 99.9%
# 月度允许停机时间: 43.8分钟
# 季度允许停机时间: 131.5分钟
# 计算脚本
#!/bin/bash
# calculate-error-budget.sh
AVAILABILITY_TARGET=99.9
MONTH_MINUTES=43200 # 30天 * 24小时 * 60分钟
ALLOWED_DOWNTIME=$(echo "scale=2; $MONTH_MINUTES * (1 - $AVAILABILITY_TARGET / 100)" | bc)
echo "月度允许停机时间: ${ALLOWED_DOWNTIME}分钟"
# 监控错误预算消耗
BUDGET_CONSUMED=$(echo "scale=2; ($ACTUAL_DOWNTIME / $ALLOWED_DOWNTIME) * 100" | bc)
echo "错误预算消耗: ${BUDGET_CONSUMED}%"
错误预算策略
# error-budget-policy.yaml
policies:
- name: "高消耗策略"
condition: "budget_remaining < 50%"
actions:
- "冻结非关键变更"
- "增加测试覆盖率"
- "执行混沌工程实验"
- name: "临界策略"
condition: "budget_remaining < 10%"
actions:
- "停止新功能发布"
- "专注可靠性改进"
- "进行事后分析"
- name: "耗尽策略"
condition: "budget_remaining <= 0%"
actions:
- "完全冻结变更"
- "全力恢复可靠性"
- "升级到管理层"
SLI/SLO定义
服务级别指标(SLI)
# 延迟SLI
apiVersion: v1
kind: ConfigMap
metadata:
name: sli-definitions
data:
latency.yaml: |
name: http_request_duration
type: histogram
metric: http_request_duration_seconds_bucket
labels:
service: api-gateway
thresholds:
- value: 0.1 # 100ms
label: fast
- value: 0.5 # 500ms
label: acceptable
- value: 1.0 # 1s
label: slow
availability.yaml: |
name: http_requests_total
type: counter
metric: http_requests_total
labels:
service: api-gateway
good: "status_code{code=~'2..'}"
total: "status_code"
服务级别目标(SLO)
# slo-config.yaml
slos:
- name: "API可用性SLO"
sli: "availability"
target: 99.95
window: "30d"
- name: "API延迟SLO"
sli: "latency"
target: 99.0
window: "30d"
threshold: "500ms"
- name: "错误率SLO"
sli: "error_rate"
target: 99.9
window: "30d"
值班和应急响应
值班轮转
# oncall-schedule.yaml
schedule:
primary:
- name: "Alice"
period: "2024-01-01 to 2024-01-07"
phone: "+86-138-xxxx-xxxx"
email: "alice@example.com"
- name: "Bob"
period: "2024-01-08 to 2024-01-14"
phone: "+86-139-xxxx-xxxx"
email: "bob@example.com"
secondary:
- name: "Charlie"
period: "2024-01-01 to 2024-01-14"
phone: "+86-137-xxxx-xxxx"
email: "charlie@example.com"
应急响应流程
#!/bin/bash
# incident-response.sh - 应急响应脚本
INCIDENT_ID=$(date +%Y%m%d%H%M%S)
echo "开始应急响应: $INCIDENT_ID"
# 1. 评估影响
echo "评估事件影响..."
AFFECTED_SERVICES=$(curl -s http://monitor/api/affected-services)
AFFECTED_USERS=$(curl -s http://monitor/api/affected-users)
# 2. 通知相关人员
curl -X POST http://pagerduty/api/incidents \
-d "{
\"title\": \"生产事件 $INCIDENT_ID\",
\"severity\": \"critical\",
\"services\": \"$AFFECTED_SERVICES\"
}"
# 3. 收集信息
mkdir -p /tmp/incidents/$INCIDENT_ID
cp /var/log/app/*.log /tmp/incidents/$INCIDENT_ID/
kubectl get events --all-namespaces > /tmp/incidents/$INCIDENT_ID/events.txt
# 4. 执行恢复操作
echo "执行恢复操作..."
kubectl rollout undo deployment/api-server
# 5. 验证恢复
sleep 30
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://api/health)
if [ "$HEALTH" = "200" ]; then
echo "系统恢复正常"
else
echo "需要进一步排查"
fi
变更管理
变更流程
# change-request.yaml
change:
id: "CHG-2024-001"
title: "升级API服务到v2.0"
type: "standard"
risk: "medium"
approval:
required: true
approvers:
- "sre-team-lead"
- "api-team-lead"
rollout:
strategy: "canary"
steps:
- percent: 10
duration: "30m"
monitor: "error_rate < 1%"
- percent: 50
duration: "30m"
monitor: "error_rate < 0.5%"
- percent: 100
monitor: "error_rate < 0.1%"
rollback:
automatic: true
trigger: "error_rate > 2%"
金丝雀发布
# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-server
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
progressDeadlineSeconds: 600
analysis:
interval: 30s
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 30s
- name: request-duration
thresholdRange:
max: 500
interval: 30s
混沌工程实践
# 混沌实验自动化
#!/bin/bash
# chaos-experiment.sh
SERVICE="api-server"
NAMESPACE="default"
# 1. 验证稳态
echo "验证系统稳态..."
HEALTH=$(curl -s http://api/health | jq .status)
if [ "$HEALTH" != "\"healthy\"" ]; then
echo "系统不在稳态,终止实验"
exit 1
fi
# 2. 注入故障
echo "注入Pod故障..."
kubectl delete pod -n $NAMESPACE -l app=$SERVICE --field-selector=status.phase=Running
# 3. 监控指标
echo "监控系统指标..."
for i in {1..30}; do
ERROR_RATE=$(curl -s http://prometheus/api/v1/query?query=rate(http_requests_total{code=~'5..'}[1m])/rate(http_requests_total[1m]) | jq .data.result[0].value[1])
echo "错误率: $ERROR_RATE"
if (( $(echo "$ERROR_RATE > 0.1" | bc -l) )); then
echo "错误率过高,触发告警"
break
fi
sleep 10
done
# 4. 记录结果
echo "实验完成,记录结果..."
文档和知识管理
# 运维手册结构
## 1. 服务概览
- 服务架构图
- 关键组件说明
- 依赖关系
## 2. 运行手册
- 日常运维操作
- 故障排查指南
- 扩缩容流程
## 3. 事故处理
- 事故响应流程
- 联系人列表
- 升级矩阵
## 4. SLO文档
- SLI定义
- SLO目标
- 错误预算策略
## 5. 变更记录
- 变更历史
- 回滚步骤
- 验证方法
最佳实践
- 拥抱自动化:将重复性工作自动化
- 关注用户体验:以用户体验为导向定义SLO
- 持续改进:定期回顾和优化流程
- 知识共享:建立运维知识库
- 平衡速度和可靠性:合理使用错误预算