SLO/SLI:服务级别管理
核心概念
服务级别管理体系:
├── SLI (Service Level Indicator): 服务级别指标
├── SLO (Service Level Objective): 服务级别目标
└── SLA (Service Level Agreement): 服务级别协议
SLI定义
常见SLI类型
# SLI配置
slis:
# 可用性SLI
- name: availability
type: availability
description: "请求成功比例"
good_requests: "http_requests_total{status_code!~'5..'}"
total_requests: "http_requests_total"
# 延迟SLI
- name: latency
type: latency
description: "请求响应时间"
metric: "http_request_duration_seconds"
threshold: "500ms"
percentile: 99
# 正确性SLI
- name: correctness
type: correctness
description: "响应正确比例"
good_requests: "http_requests_total{correctness='true'}"
total_requests: "http_requests_total"
# 吞吐量SLI
- name: throughput
type: throughput
description: "每秒请求数"
metric: "rate(http_requests_total[1m])"
Prometheus SLI配置
# prometheus-sli-recording-rules.yaml
groups:
- name: sli-availability
rules:
# 5分钟窗口可用性
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status_code!~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# 30分钟窗口可用性
- record: sli:availability:ratio_rate30m
expr: |
sum(rate(http_requests_total{status_code!~"5.."}[30m])) by (service)
/
sum(rate(http_requests_total[30m])) by (service)
# 1小时窗口可用性
- record: sli:availability:ratio_rate1h
expr: |
sum(rate(http_requests_total{status_code!~"5.."}[1h])) by (service)
/
sum(rate(http_requests_total[1h])) by (service)
- name: sli-latency
rules:
# 延迟百分位
- record: sli:latency:p99
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
- record: sli:latency:p95
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
- record: sli:latency:p50
expr: |
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
SLO配置
SLO定义文件
# slo-config.yaml
slos:
# API可用性SLO
- name: "api-availability"
service: "api-server"
sli: "availability"
target: 99.95
window: "30d"
description: "API服务99.95%可用性"
# API延迟SLO
- name: "api-latency"
service: "api-server"
sli: "latency"
target: 99.0
window: "30d"
threshold: "500ms"
description: "99%的请求在500ms内响应"
# 数据库可用性SLO
- name: "database-availability"
service: "database"
sli: "availability"
target: 99.99
window: "30d"
description: "数据库99.99%可用性"
基于时间的SLO
# rolling-slo.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-server
namespace: default
spec:
service: "api-server"
labels:
team: "backend"
slos:
- name: "availability-30d"
objective: 99.95
sli:
events:
error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
description: "30天滚动窗口99.95%可用性"
- name: "latency-30d"
objective: 99.0
sli:
events:
error_query: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[{{.window}}])) == 0
total_query: sum(rate(http_request_duration_seconds_total[{{.window}}]))
description: "99%的请求在500ms内完成"
SLI/SLO监控
Grafana仪表盘
{
"title": "SLI/SLO监控",
"panels": [
{
"title": "可用性SLI",
"type": "timeseries",
"targets": [{
"expr": "sli:availability:ratio_rate5m",
"legendFormat": "{{ service }}"
}],
"thresholds": [
{"value": 0.9995, "color": "green"},
{"value": 0.999, "color": "yellow"},
{"value": 0.99, "color": "red"}
]
},
{
"title": "延迟SLI P99",
"type": "timeseries",
"targets": [{
"expr": "sli:latency:p99",
"legendFormat": "{{ service }}"
}],
"thresholds": [
{"value": 0.5, "color": "green"},
{"value": 1.0, "color": "yellow"},
{"value": 2.0, "color": "red"}
]
}
]
}
告警规则
# slo-alerts.yaml
groups:
- name: slo-alerts
rules:
# 可用性SLO告警
- alert: AvailabilitySLOBreach
expr: sli:availability:ratio_rate5m < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: "可用性SLO即将突破"
description: "{{ $labels.service }} 可用性当前为 {{ $value | humanizePercentage }}"
# 延迟SLO告警
- alert: LatencySLOBreach
expr: sli:latency:p99 > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "延迟SLO即将突破"
description: "{{ $labels.service }} P99延迟当前为 {{ $value }}秒"
# 错误预算告警
- alert: ErrorBudgetLow
expr: (1 - sli:availability:ratio_rate30m) / (1 - 0.9995) > 0.5
for: 1h
labels:
severity: warning
annotations:
summary: "错误预算消耗超过50%"
description: "{{ $labels.service }} 错误预算已消耗超过50%"
SLA管理
SLA定义
# sla-contract.yaml
sla:
service: "api-server"
provider: "Company Inc."
customer: "Enterprise Customer"
availability:
target: 99.95
measurement_period: "monthly"
excluded_downtime:
- "planned_maintenance"
- "force_majeure"
latency:
target: 99.0
threshold: "500ms"
measurement: "p99"
support:
response_time:
critical: "15 minutes"
high: "1 hour"
medium: "4 hours"
low: "1 business day"
penalties:
availability_below_99.9: "10% credit"
availability_below_99.0: "25% credit"
availability_below_95.0: "50% credit"
SLA监控脚本
#!/bin/bash
# sla-monitor.sh - SLA监控脚本
SERVICE="api-server"
SLA_TARGET=99.95
MONTH_START=$(date -d "$(date +%Y-%m-01)" +%s)
NOW=$(date +%s)
DAYS_ELAPSED=$(( (NOW - MONTH_START) / 86400 ))
# 获取当前可用性
AVAILABILITY=$(curl -s "http://prometheus/api/v1/query?query=avg_over_time(sli:availability:ratio_rate5m[30d])" | jq -r '.data.result[0].value[1]')
# 计算SLA状态
if (( $(echo "$AVAILABILITY >= $SLA_TARGET" | bc -l) )); then
echo "SLA达标: ${AVAILABILITY}% >= ${SLA_TARGET}%"
else
echo "SLA未达标: ${AVAILABILITY}% < ${SLA_TARGET}%"
fi
SLO工具
Sloth(Kubernetes Operator)
# 安装Sloth
helm repo add sloth https://slok.github.io/sloth
helm install sloth sloth/sloth -n monitoring
# 创建SLO
kubectl apply -f - << 'EOF'
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: api-server
namespace: default
spec:
service: api-server
labels:
team: backend
slos:
- name: availability
objective: 99.95
sli:
events:
error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
EOF
最佳实践
- 选择正确的SLI:基于用户体验选择指标
- 合理的SLO目标:考虑成本和收益
- 错误预算策略:制定明确的预算使用策略
- 持续监控:实时监控SLI/SLO状态
- 定期审查:定期审查和调整SLO目标