混沌工程基础
混沌工程基础
什么是混沌工程
混沌工程是通过主动注入故障来验证系统韧性的实践方法。
核心原则
- 建立稳态假设
- 引入真实世界的事件
- 在生产环境运行
- 自动化持续运行
- 最小化爆炸半径
混沌工程工具
| 工具 | 类型 | 特点 |
|---|---|---|
| Chaos Monkey | Netflix | 随机终止实例 |
| Litmus | CNCF | Kubernetes原生 |
| Chaos Mesh | PingCAP | 功能丰富 |
| Gremlin | 商业 | 企业级 |
Litmus Chaos
安装Litmus
# 安装Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-3.0.0.yaml
# 访问Dashboard
kubectl port-forward svc/litmus-frontend -n litmus 9091:9091
故障实验
# Pod故障注入
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete
namespace: default
spec:
engineState: active
appinfo:
appns: default
applabel: app=myapp
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
Chaos Mesh
安装Chaos Mesh
# 安装Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace
网络故障
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
namespace: default
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: myapp
delay:
latency: '100ms'
jitter: '10ms'
duration: '5m'
Pod故障
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
namespace: default
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: myapp
scheduler:
cron: '@every 5m'
实践:完整混沌实验
# 1. HTTP故障
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
name: http-error
namespace: default
spec:
mode: all
selector:
namespaces:
- default
labelSelectors:
app: myapp
target: Request
port: 8080
path: /api/*
method: GET
code: 500
duration: '5m'
---
# 2. IO故障
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-delay
namespace: default
spec:
action: latency
mode: all
selector:
namespaces:
- default
labelSelectors:
app: myapp
delay: '100ms'
volumePath: /data
duration: '5m'
---
# 3. DNS故障
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
name: dns-error
namespace: default
spec:
mode: all
selector:
namespaces:
- default
labelSelectors:
app: myapp
action: error
errors:
- "github.com"
duration: '5m'
稳态假设
# 定义稳态指标
- 错误率 < 1%
- 延迟 P99 < 500ms
- 可用性 > 99.9%
# 验证指标
curl -s http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])/rate(http_requests_total[5m])
安全实践
- 设置爆炸半径
- 在非生产环境先测试
- 有回滚计划
- 监控实验影响
- 记录实验结果
总结
混沌工程是提高系统韧性的重要手段。通过主动注入故障,可以发现系统的弱点并提前改进。