Chaos Mesh:云原生混沌平台
Chaos Mesh:云原生混沌平台
什么是Chaos Mesh
Chaos Mesh是CNCF孵化的云原生混沌工程平台,专为Kubernetes设计。它提供了丰富的故障注入能力,支持网络、IO、内核、压力等多种故障类型,并提供可视化Dashboard。
架构组件
Chaos Mesh架构:
├── Controller Manager: 实验控制器
├── Chaos Daemon: 执行故障注入
├── Dashboard: Web可视化界面
└── DNS Server: DNS故障支持
安装部署
Helm安装
# 添加Helm仓库
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
# 安装Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing \
--create-namespace \
--set dashboard.securityMode=false
# 验证安装
kubectl get pods -n chaos-testing
高级配置
# 使用自定义配置安装
cat > custom-values.yaml << 'EOF'
chaosDashboard:
securityMode: true
persistentVolume:
enabled: true
size: 10Gi
controllerManager:
replicaCount: 3
chaosDaemon:
runtime: containerd
socketPath: /run/containerd/containerd.sock
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
dnsServer:
enabled: true
image: ghcr.io/chaos-mesh/chaos-mesh:latest
EOF
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-testing \
-f custom-values.yaml
故障类型
网络故障
# network-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: web-server
delay:
latency: "100ms"
correlation: "100"
jitter: "10ms"
direction: to
duration: "600s"
---
# network-loss.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-loss
spec:
action: loss
mode: one
selector:
labelSelectors:
app: web-server
loss:
loss: "30"
correlation: "50"
duration: "300s"
---
# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
spec:
action: partition
mode: all
selector:
labelSelectors:
app: web-server
direction: both
duration: "180s"
Pod故障
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: web-server
scheduler:
cron: "@every 5m"
---
# pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure
spec:
action: pod-failure
mode: all
selector:
labelSelectors:
app: web-server
duration: "300s"
---
# pod-memory-hog.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress
spec:
mode: one
selector:
labelSelectors:
app: web-server
stressors:
memory:
workers: 1
size: "256MB"
duration: "600s"
IO故障
# io-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-delay
spec:
action: latency
mode: all
selector:
labelSelectors:
app: database
delay: "100ms"
volumePath: /var/lib/mysql
duration: "300s"
---
# io-mistake.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-error
spec:
action: mistake
mode: one
selector:
labelSelectors:
app: database
volumePath: /var/lib/mysql
mistake:
filling: zero
maxOccurrences: 10
duration: "60s"
时间故障
# time-shift.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: time-shift
spec:
mode: all
selector:
labelSelectors:
app: web-server
timeOffset: "-3600s"
clockIds: ["CLOCK_REALTIME"]
duration: "600s"
高级功能
混沌调度器
# scheduled-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: daily-chaos
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
startingDeadlineMinutes: 10
type: NetworkChaos
networkChaos:
action: delay
mode: all
selector:
labelSelectors:
app: web-server
delay:
latency: "50ms"
duration: "1800s"
工作流
# workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: resilience-workflow
spec:
entry: network-chaos
templates:
- name: network-chaos
templateType: NetworkChaos
networkChaos:
action: delay
mode: all
selector:
labelSelectors:
app: web-server
delay:
latency: "100ms"
duration: "300s"
- name: pod-chaos
templateType: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
labelSelectors:
app: web-server
Dashboard使用
# 访问Dashboard
kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333
# 使用API创建实验
curl -X POST http://localhost:2333/api/chaos \
-H "Content-Type: application/json" \
-d '{
"apiVersion": "chaos-mesh.org/v1alpha1",
"kind": "NetworkChaos",
"metadata": {
"name": "api-delay"
},
"spec": {
"action": "delay",
"mode": "all",
"selector": {
"labelSelectors": {
"app": "api"
}
},
"delay": {
"latency": "50ms"
},
"duration": "120s"
}
}'
监控集成
# Prometheus监控Chaos Mesh
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: chaos-mesh-monitor
namespace: chaos-testing
spec:
selector:
matchLabels:
app.kubernetes.io/component: controller-manager
endpoints:
- port: http
path: /metrics
# Grafana仪表盘导入
# Chaos Mesh Dashboard ID: 13605
最佳实践
实验前准备
- 确认监控系统正常
- 通知相关人员
- 准备回滚方案
爆炸半径控制
- 使用精确的标签选择器
- 设置合理的持续时间
- 限制并发实验数量
安全考虑
- 启用Dashboard安全模式
- 限制实验执行权限
- 记录所有实验操作
实验设计
- 从简单故障开始
- 逐步增加复杂度
- 验证恢复机制