← 返回首页
🔧

Chaos Mesh:云原生混沌平台

📂 devops ⏱ 3 min 503 words

Chaos Mesh:云原生混沌平台

什么是Chaos Mesh

Chaos Mesh是CNCF孵化的云原生混沌工程平台,专为Kubernetes设计。它提供了丰富的故障注入能力,支持网络、IO、内核、压力等多种故障类型,并提供可视化Dashboard。

架构组件

Chaos Mesh架构:
  ├── Controller Manager: 实验控制器
  ├── Chaos Daemon: 执行故障注入
  ├── Dashboard: Web可视化界面
  └── DNS Server: DNS故障支持

安装部署

Helm安装

# 添加Helm仓库
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

# 安装Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-testing \
  --create-namespace \
  --set dashboard.securityMode=false

# 验证安装
kubectl get pods -n chaos-testing

高级配置

# 使用自定义配置安装
cat > custom-values.yaml << 'EOF'
chaosDashboard:
  securityMode: true
  persistentVolume:
    enabled: true
    size: 10Gi

controllerManager:
  replicaCount: 3

chaosDaemon:
  runtime: containerd
  socketPath: /run/containerd/containerd.sock
  tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Exists"
      effect: "NoSchedule"

dnsServer:
  enabled: true
  image: ghcr.io/chaos-mesh/chaos-mesh:latest
EOF

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-testing \
  -f custom-values.yaml

故障类型

网络故障

# network-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: web-server
  delay:
    latency: "100ms"
    correlation: "100"
    jitter: "10ms"
  direction: to
  duration: "600s"

---
# network-loss.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-loss
spec:
  action: loss
  mode: one
  selector:
    labelSelectors:
      app: web-server
  loss:
    loss: "30"
    correlation: "50"
  duration: "300s"

---
# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
spec:
  action: partition
  mode: all
  selector:
    labelSelectors:
      app: web-server
  direction: both
  duration: "180s"

Pod故障

# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app: web-server
  scheduler:
    cron: "@every 5m"

---
# pod-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
spec:
  action: pod-failure
  mode: all
  selector:
    labelSelectors:
      app: web-server
  duration: "300s"

---
# pod-memory-hog.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: web-server
  stressors:
    memory:
      workers: 1
      size: "256MB"
  duration: "600s"

IO故障

# io-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-delay
spec:
  action: latency
  mode: all
  selector:
    labelSelectors:
      app: database
  delay: "100ms"
  volumePath: /var/lib/mysql
  duration: "300s"

---
# io-mistake.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-error
spec:
  action: mistake
  mode: one
  selector:
    labelSelectors:
      app: database
  volumePath: /var/lib/mysql
  mistake:
    filling: zero
    maxOccurrences: 10
  duration: "60s"

时间故障

# time-shift.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: time-shift
spec:
  mode: all
  selector:
    labelSelectors:
      app: web-server
  timeOffset: "-3600s"
  clockIds: ["CLOCK_REALTIME"]
  duration: "600s"

高级功能

混沌调度器

# scheduled-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: daily-chaos
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  startingDeadlineMinutes: 10
  type: NetworkChaos
  networkChaos:
    action: delay
    mode: all
    selector:
      labelSelectors:
        app: web-server
    delay:
      latency: "50ms"
    duration: "1800s"

工作流

# workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: resilience-workflow
spec:
  entry: network-chaos
  templates:
    - name: network-chaos
      templateType: NetworkChaos
      networkChaos:
        action: delay
        mode: all
        selector:
          labelSelectors:
            app: web-server
        delay:
          latency: "100ms"
        duration: "300s"
      
    - name: pod-chaos
      templateType: PodChaos
      podChaos:
        action: pod-kill
        mode: one
        selector:
          labelSelectors:
            app: web-server

Dashboard使用

# 访问Dashboard
kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333

# 使用API创建实验
curl -X POST http://localhost:2333/api/chaos \
  -H "Content-Type: application/json" \
  -d '{
    "apiVersion": "chaos-mesh.org/v1alpha1",
    "kind": "NetworkChaos",
    "metadata": {
      "name": "api-delay"
    },
    "spec": {
      "action": "delay",
      "mode": "all",
      "selector": {
        "labelSelectors": {
          "app": "api"
        }
      },
      "delay": {
        "latency": "50ms"
      },
      "duration": "120s"
    }
  }'

监控集成

# Prometheus监控Chaos Mesh
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-mesh-monitor
  namespace: chaos-testing
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
  endpoints:
    - port: http
      path: /metrics

# Grafana仪表盘导入
# Chaos Mesh Dashboard ID: 13605

最佳实践

  1. 实验前准备

    • 确认监控系统正常
    • 通知相关人员
    • 准备回滚方案
  2. 爆炸半径控制

    • 使用精确的标签选择器
    • 设置合理的持续时间
    • 限制并发实验数量
  3. 安全考虑

    • 启用Dashboard安全模式
    • 限制实验执行权限
    • 记录所有实验操作
  4. 实验设计

    • 从简单故障开始
    • 逐步增加复杂度
    • 验证恢复机制