混沌工程高级实践
混沌工程高级实践
高级故障类型
资源耗尽
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
spec:
mode: one
selector:
labelSelectors:
app: myapp
stressors:
cpu:
workers: 4
load: 80
duration: "5m"
时钟偏移
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: time-shift
spec:
mode: all
selector:
labelSelectors:
app: myapp
timeOffset: "1h"
clockIds: ["CLOCK_REALTIME"]
duration: "10m"
网络分区
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
spec:
action: partition
mode: all
selector:
labelSelectors:
app: myapp
direction: both
target:
selector:
labelSelectors:
app: database
mode: all
duration: "5m"
实验设计
稳态假设
# 稳态指标定义
steady_state:
metrics:
- name: availability
query: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
threshold: 0.999
- name: latency_p99
query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
threshold: 0.5
- name: error_rate
query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
threshold: 0.01
实验配置
# 完整实验配置
experiment:
name: pod-kill-experiment
description: 验证Pod故障时的服务韧性
hypothesis:
- "Pod故障时,Kubernetes会自动重启Pod"
- "服务在30秒内恢复"
- "错误率不会超过1%"
steady_state:
- metric: availability
threshold: 0.999
fault:
type: pod-kill
target:
labelSelector:
app: myapp
duration: 5m
rollback:
automatic: true
timeout: 10m
实践:完整实验流程
#!/bin/bash
# 1. 实验前检查
echo "=== 实验前检查 ==="
# 检查服务健康
kubectl get pods -n production
# 检查当前稳态
curl -s "http://prometheus:9090/api/v1/query?query=availability"
# 2. 执行实验
echo "=== 执行混沌实验 ==="
kubectl apply -f experiment.yaml
# 3. 监控实验
echo "=== 监控实验 ==="
watch -n 5 'kubectl get pods -n production'
watch -n 5 'curl -s "http://prometheus:9090/api/v1/query?query=availability"'
# 4. 验证稳态
echo "=== 验证稳态 ==="
# 等待实验完成
sleep 300
# 检查结果
curl -s "http://prometheus:9090/api/v1/query?query=availability"
# 5. 实验后清理
echo "=== 实验后清理 ==="
kubectl delete -f experiment.yaml
实验指标
# Prometheus查询
metrics:
# 可用性
availability: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# 错误率
error_rate: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# 延迟P99
latency_p99: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 成功率
success_rate: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)
自动化实验
# 定期实验
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekly-chaos
spec:
schedule: "0 2 * * 1"
historyLimit: 5
concurrencyPolicy: Forbid
type: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
labelSelectors:
app: myapp
安全实践
- 设置爆炸半径
- 在生产环境前测试
- 配置自动回滚
- 监控实验影响
- 记录实验结果
最佳实践
- 从小规模开始
- 逐步增加复杂度
- 自动化实验流程
- 建立实验文化
- 持续改进韧性
总结
混沌工程是提高系统韧性的有效方法。通过科学的实验设计和自动化执行,可以持续验证和提升系统的容错能力。