← 返回首页
📊

Chaos Mesh实战:Pod故障与网络延迟注入

📂 architecture ⏱ 2 min 394 words

Chaos Mesh实战:Pod故障与网络延迟注入

Chaos Mesh实战概览

Chaos Mesh是CNCF孵化的混沌工程平台,支持在Kubernetes环境中注入多种类型的故障。本章聚焦实战场景和常见故障模式。

Chaos Mesh故障类型:
├── Pod故障
│   ├── Pod Kill(杀死Pod)
│   ├── Pod Failure(Pod不可用)
│   └── Pod Stress(CPU/内存压力)
├── 网络故障
│   ├── Network Delay(网络延迟)
│   ├── Network Loss(丢包)
│   ├── Network Duplicate(重复包)
│   └── Network Partition(网络分区)
├── IO故障
│   ├── IO Latency(IO延迟)
│   ├── IO Fault(IO错误)
│   └── IO Read/Write(读写失败)
├── 时间故障
│   └── Time Skew(时间偏移)
└── 内核故障
    ├── Kernel Panic
    └── Clock Skew

Pod故障注入实战

模拟Pod崩溃

# 模拟随机Pod被杀死
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: production
spec:
  action: pod-kill
  mode: one  # one: 一个, all: 所有, fixed: 固定数量
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 10m"  # 每10分钟执行一次

模拟Pod不可用

# 模拟Pod故障(5分钟不可用)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-example
spec:
  action: pod-failure
  mode: fixed
  value: "2"  # 故障Pod数量
  duration: "5m"
  selector:
    labelSelectors:
      app: order-service

CPU压力测试

# 模拟CPU高负载
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: user-service
  stressors:
    cpu:
      workers: 4
      load: 80  # 80% CPU使用率
  duration: "10m"

内存压力测试

# 模拟内存压力
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: cache-service
  stressors:
    memory:
      workers: 2
      size: "512MB"
  duration: "10m"

网络故障注入实战

网络延迟

# 模拟网络延迟(100ms,±10ms抖动)
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: all
  selector:
    labelSelectors:
      app: api-gateway
  delay:
    latency: "100ms"
    jitter: "10ms"
    correlation: "50"  # 50%相关性
  direction: to  # to: 出站, from: 入站, both: 双向
  target:
    selector:
      labelSelectors:
        app: payment-service
    mode: all

丢包模拟

# 模拟10%丢包
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-loss
spec:
  action: loss
  mode: one
  selector:
    labelSelectors:
      app: notification-service
  loss:
    loss: "10"
    correlation: "50"
  duration: "5m"

网络分区

# 模拟网络分区(服务间无法通信)
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
spec:
  action: partition
  mode: one
  selector:
    labelSelectors:
      app: user-service
  direction: both
  target:
    selector:
      labelSelectors:
        app: order-service
    mode: all
  duration: "3m"

IO故障注入

# 模拟IO延迟
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-delay
spec:
  action: latency
  mode: one
  selector:
    labelSelectors:
      app: database
  volumePath: /data
  delay: "200ms"
  duration: "5m"
  path: "/data/.*"  # 正则匹配文件路径
  percent: "50"     # 50%的IO操作受影响

混沌实验模板

# 完整的混沌实验模板
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: api-resilience-test
  namespace: chaos-testing
spec:
  entry: api-test
  templates:
    - name: api-test
      templateType: Serial
      children:
        - network-delay
        - pod-kill
        - cpu-stress
    
    - name: network-delay
      templateType: NetworkChaos
      deadline: 5m
      networkChaos:
        action: delay
        mode: all
        selector:
          labelSelectors:
            app: api
        delay:
          latency: 100ms
    
    - name: pod-kill
      templateType: PodChaos
      deadline: 3m
      podChaos:
        action: pod-kill
        mode: one
        selector:
          labelSelectors:
            app: api
    
    - name: cpu-stress
      templateType: StressChaos
      deadline: 10m
      stressChaos:
        mode: one
        selector:
          labelSelectors:
            app: api
        stressors:
          cpu:
            workers: 2
            load: 50

最佳实践

  1. 最小权限:只在测试环境执行混沌实验,生产环境需严格审批
  2. 爆炸半径控制:限制受影响的Pod数量和持续时间
  3. 自动回滚:监控系统健康状态,异常时自动停止实验
  4. 实验记录:记录每次实验的假设、过程和结论