🔧

SLO/SLI：服务级别管理

📂 devops ⏱ 3 min 532 words

SLO/SLI：服务级别管理

核心概念

服务级别管理体系:
  ├── SLI (Service Level Indicator): 服务级别指标
  ├── SLO (Service Level Objective): 服务级别目标
  └── SLA (Service Level Agreement): 服务级别协议

SLI定义

常见SLI类型

# SLI配置
slis:
  # 可用性SLI
  - name: availability
    type: availability
    description: "请求成功比例"
    good_requests: "http_requests_total{status_code!~'5..'}"
    total_requests: "http_requests_total"
    
  # 延迟SLI
  - name: latency
    type: latency
    description: "请求响应时间"
    metric: "http_request_duration_seconds"
    threshold: "500ms"
    percentile: 99
    
  # 正确性SLI
  - name: correctness
    type: correctness
    description: "响应正确比例"
    good_requests: "http_requests_total{correctness='true'}"
    total_requests: "http_requests_total"
    
  # 吞吐量SLI
  - name: throughput
    type: throughput
    description: "每秒请求数"
    metric: "rate(http_requests_total[1m])"

Prometheus SLI配置

# prometheus-sli-recording-rules.yaml
groups:
  - name: sli-availability
    rules:
      # 5分钟窗口可用性
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status_code!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      # 30分钟窗口可用性
      - record: sli:availability:ratio_rate30m
        expr: |
          sum(rate(http_requests_total{status_code!~"5.."}[30m])) by (service)
          /
          sum(rate(http_requests_total[30m])) by (service)
      
      # 1小时窗口可用性
      - record: sli:availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status_code!~"5.."}[1h])) by (service)
          /
          sum(rate(http_requests_total[1h])) by (service)
  
  - name: sli-latency
    rules:
      # 延迟百分位
      - record: sli:latency:p99
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
      
      - record: sli:latency:p95
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
      
      - record: sli:latency:p50
        expr: |
          histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

SLO配置

SLO定义文件

# slo-config.yaml
slos:
  # API可用性SLO
  - name: "api-availability"
    service: "api-server"
    sli: "availability"
    target: 99.95
    window: "30d"
    description: "API服务99.95%可用性"
    
  # API延迟SLO
  - name: "api-latency"
    service: "api-server"
    sli: "latency"
    target: 99.0
    window: "30d"
    threshold: "500ms"
    description: "99%的请求在500ms内响应"
    
  # 数据库可用性SLO
  - name: "database-availability"
    service: "database"
    sli: "availability"
    target: 99.99
    window: "30d"
    description: "数据库99.99%可用性"

基于时间的SLO

# rolling-slo.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-server
  namespace: default
spec:
  service: "api-server"
  labels:
    team: "backend"
  slos:
    - name: "availability-30d"
      objective: 99.95
      sli:
        events:
          error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
          total_query: sum(rate(http_requests_total[{{.window}}]))
      description: "30天滚动窗口99.95%可用性"
      
    - name: "latency-30d"
      objective: 99.0
      sli:
        events:
          error_query: |
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[{{.window}}])) == 0
          total_query: sum(rate(http_request_duration_seconds_total[{{.window}}]))
      description: "99%的请求在500ms内完成"

SLI/SLO监控

Grafana仪表盘

{
  "title": "SLI/SLO监控",
  "panels": [
    {
      "title": "可用性SLI",
      "type": "timeseries",
      "targets": [{
        "expr": "sli:availability:ratio_rate5m",
        "legendFormat": "{{ service }}"
      }],
      "thresholds": [
        {"value": 0.9995, "color": "green"},
        {"value": 0.999, "color": "yellow"},
        {"value": 0.99, "color": "red"}
      ]
    },
    {
      "title": "延迟SLI P99",
      "type": "timeseries",
      "targets": [{
        "expr": "sli:latency:p99",
        "legendFormat": "{{ service }}"
      }],
      "thresholds": [
        {"value": 0.5, "color": "green"},
        {"value": 1.0, "color": "yellow"},
        {"value": 2.0, "color": "red"}
      ]
    }
  ]
}

告警规则

# slo-alerts.yaml
groups:
  - name: slo-alerts
    rules:
      # 可用性SLO告警
      - alert: AvailabilitySLOBreach
        expr: sli:availability:ratio_rate5m < 0.999
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "可用性SLO即将突破"
          description: "{{ $labels.service }} 可用性当前为 {{ $value | humanizePercentage }}"
      
      # 延迟SLO告警
      - alert: LatencySLOBreach
        expr: sli:latency:p99 > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "延迟SLO即将突破"
          description: "{{ $labels.service }} P99延迟当前为 {{ $value }}秒"
      
      # 错误预算告警
      - alert: ErrorBudgetLow
        expr: (1 - sli:availability:ratio_rate30m) / (1 - 0.9995) > 0.5
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "错误预算消耗超过50%"
          description: "{{ $labels.service }} 错误预算已消耗超过50%"

SLA管理

SLA定义

# sla-contract.yaml
sla:
  service: "api-server"
  provider: "Company Inc."
  customer: "Enterprise Customer"
  
  availability:
    target: 99.95
    measurement_period: "monthly"
    excluded_downtime:
      - "planned_maintenance"
      - "force_majeure"
  
  latency:
    target: 99.0
    threshold: "500ms"
    measurement: "p99"
  
  support:
    response_time:
      critical: "15 minutes"
      high: "1 hour"
      medium: "4 hours"
      low: "1 business day"
  
  penalties:
    availability_below_99.9: "10% credit"
    availability_below_99.0: "25% credit"
    availability_below_95.0: "50% credit"

SLA监控脚本

#!/bin/bash
# sla-monitor.sh - SLA监控脚本

SERVICE="api-server"
SLA_TARGET=99.95
MONTH_START=$(date -d "$(date +%Y-%m-01)" +%s)
NOW=$(date +%s)
DAYS_ELAPSED=$(( (NOW - MONTH_START) / 86400 ))

# 获取当前可用性
AVAILABILITY=$(curl -s "http://prometheus/api/v1/query?query=avg_over_time(sli:availability:ratio_rate5m[30d])" | jq -r '.data.result[0].value[1]')

# 计算SLA状态
if (( $(echo "$AVAILABILITY >= $SLA_TARGET" | bc -l) )); then
  echo "SLA达标: ${AVAILABILITY}% >= ${SLA_TARGET}%"
else
  echo "SLA未达标: ${AVAILABILITY}% < ${SLA_TARGET}%"
fi

SLO工具

Sloth（Kubernetes Operator）

# 安装Sloth
helm repo add sloth https://slok.github.io/sloth
helm install sloth sloth/sloth -n monitoring

# 创建SLO
kubectl apply -f - << 'EOF'
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: api-server
  namespace: default
spec:
  service: api-server
  labels:
    team: backend
  slos:
    - name: availability
      objective: 99.95
      sli:
        events:
          error_query: sum(rate(http_requests_total{code=~"5.."}[{{.window}}]))
          total_query: sum(rate(http_requests_total[{{.window}}]))
EOF

最佳实践

选择正确的SLI：基于用户体验选择指标
合理的SLO目标：考虑成本和收益
错误预算策略：制定明确的预算使用策略
持续监控：实时监控SLI/SLO状态
定期审查：定期审查和调整SLO目标