混沌工程:系统韧性测试
混沌工程:系统韧性测试
什么是混沌工程
混沌工程是通过在系统中主动注入故障来验证系统韧性的实践。它帮助团队发现和修复潜在问题,在生产事故前提升系统的可靠性。Netflix最早提出并实践了这一理念。
混沌工程原则
混沌工程核心原则:
1. 建立稳态假设
2. 模拟真实世界的事件
3. 在生产环境运行实验
4. 持续自动化运行
5. 最小化爆炸半径
实验流程
混沌实验流程:
├── 1. 定义稳态指标
├── 2. 假设系统行为
├── 3. 设计实验
├── 4. 执行实验
├── 5. 分析结果
└── 6. 修复问题
常见故障类型
基础设施故障
# CPU压力
# 使用stress工具模拟CPU高负载
stress --cpu 4 --timeout 60s
# 内存压力
stress --vm 2 --vm-bytes 512M --timeout 60s
# 磁盘IO压力
stress --io 4 --timeout 60s
# 网络延迟
# 使用tc命令模拟网络延迟
sudo tc qdisc add dev eth0 root netem delay 100ms 50ms
# 网络丢包
sudo tc qdisc add dev eth0 root netem loss 10%
# 网络分区
sudo iptables -A INPUT -s 10.0.0.0/8 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.0/8 -j DROP
应用层故障
# 进程杀死
kill -9 <pid>
# 服务停止
systemctl stop nginx
# 端口不可达
sudo iptables -A INPUT -p tcp --dport 8080 -j DROP
# DNS故障
sudo echo "127.0.0.1" > /etc/resolv.conf
# 文件系统满
dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Chaos Mesh
安装Chaos Mesh
# 使用Helm安装
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
创建实验
# network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
namespace: default
spec:
action: delay
mode: all
selector:
labelSelectors:
app: my-service
delay:
latency: "100ms"
correlation: "100"
jitter: "10ms"
duration: "300s"
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
namespace: default
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: my-service
scheduler:
cron: "@every 5m"
# cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
namespace: default
spec:
mode: one
selector:
labelSelectors:
app: my-service
stressors:
cpu:
workers: 2
load: 80
duration: "300s"
Litmus Chaos
安装Litmus
# 安装Litmus Chaos
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-3.0.0.yaml
# 访问Dashboard
kubectl port-forward -n litmus svc/litmus-frontend 9091:9091
创建混沌实验
# nginx-kill.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-kill
namespace: default
spec:
engineState: active
appinfo:
appns: default
applabel: app=nginx
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
混沌实验工具箱
Chaos Toolkit
{
"title": "System Resilience Test",
"description": "验证系统在数据库故障时的韧性",
"steady-state-hypothesis": {
"title": "系统正常运行",
"probes": [
{
"type": "probe",
"name": "api-responds",
"tolerance": 200,
"provider": {
"type": "http",
"url": "http://api:8080/health"
}
}
]
},
"method": [
{
"type": "action",
"name": "kill-database",
"provider": {
"type": "python",
"module": "chaosk8s.pods.actions",
"func": "delete_pods",
"arguments": {
"name": "postgres",
"ns": "default",
"label": "app=database"
}
}
}
},
"rollbacks": [
{
"type": "action",
"name": "restore-database",
"provider": {
"type": "python",
"module": "chaosk8s.pods.actions",
"func": "create_pod",
"arguments": {
"name": "postgres",
"ns": "default"
}
}
}
]
}
实验脚本
#!/bin/bash
# chaos-experiment.sh - 混沌实验自动化脚本
set -e
SERVICE="my-service"
NAMESPACE="default"
DURATION=60
echo "开始混沌实验..."
# 1. 检查稳态
echo "检查系统稳态..."
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://api:8080/health)
if [ "$HEALTH" != "200" ]; then
echo "系统不在稳态,终止实验"
exit 1
fi
# 2. 注入故障
echo "注入网络延迟..."
kubectl exec -n $NAMESPACE deploy/$SERVICE -- \
tc qdisc add dev eth0 root netem delay 100ms
# 3. 持续时间
echo "等待 ${DURATION}秒..."
sleep $DURATION
# 4. 恢复
echo "恢复系统..."
kubectl exec -n $NAMESPACE deploy/$SERVICE -- \
tc qdisc del dev eth0 root
# 5. 验证恢复
sleep 10
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" http://api:8080/health)
if [ "$HEALTH" = "200" ]; then
echo "实验成功,系统恢复正常"
else
echo "实验失败,系统未恢复正常"
exit 1
fi
监控和可观测性
# Prometheus告警规则
groups:
- name: chaos-experiments
rules:
- alert: ChaosExperimentFailed
expr: chaos_experiment_status == 0
for: 5m
labels:
severity: critical
annotations:
summary: "混沌实验失败"
description: "实验 {{ $labels.experiment }} 未达到预期结果"
安全注意事项
# 1. 限制爆炸半径
# 使用标签选择器精确控制影响范围
# 配置最大故障持续时间
# 2. 监控实验
# 实时监控关键指标
# 设置自动回滚条件
# 3. 通知机制
# 实验前通知相关人员
# 异常情况立即告警
# 4. 权限控制
# 限制执行混沌实验的权限
# 记录所有实验操作
最佳实践
- 从小规模开始:先在测试环境验证
- 逐步扩大范围:从单个服务扩展到整个系统
- 自动化运行:定期自动执行混沌实验
- 持续改进:根据实验结果优化系统
- 文档记录:记录所有实验和结果