← 返回首页
💥

混沌工程基础

📂 devops ⏱ 2 min 281 words

混沌工程基础

什么是混沌工程

混沌工程是通过主动注入故障来验证系统韧性的实践方法。

核心原则

  1. 建立稳态假设
  2. 引入真实世界的事件
  3. 在生产环境运行
  4. 自动化持续运行
  5. 最小化爆炸半径

混沌工程工具

工具 类型 特点
Chaos Monkey Netflix 随机终止实例
Litmus CNCF Kubernetes原生
Chaos Mesh PingCAP 功能丰富
Gremlin 商业 企业级

Litmus Chaos

安装Litmus

# 安装Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-3.0.0.yaml

# 访问Dashboard
kubectl port-forward svc/litmus-frontend -n litmus 9091:9091

故障实验

# Pod故障注入
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete
  namespace: default
spec:
  engineState: active
  appinfo:
    appns: default
    applabel: app=myapp
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

Chaos Mesh

安装Chaos Mesh

# 安装Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-mesh --create-namespace

网络故障

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
  namespace: default
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: myapp
  delay:
    latency: '100ms'
    jitter: '10ms'
  duration: '5m'

Pod故障

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill
  namespace: default
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: myapp
  scheduler:
    cron: '@every 5m'

实践:完整混沌实验

# 1. HTTP故障
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-error
  namespace: default
spec:
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: myapp
  target: Request
  port: 8080
  path: /api/*
  method: GET
  code: 500
  duration: '5m'

---
# 2. IO故障
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-delay
  namespace: default
spec:
  action: latency
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: myapp
  delay: '100ms'
  volumePath: /data
  duration: '5m'

---
# 3. DNS故障
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-error
  namespace: default
spec:
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: myapp
  action: error
  errors:
    - "github.com"
  duration: '5m'

稳态假设

# 定义稳态指标
- 错误率 < 1%
- 延迟 P99 < 500ms
- 可用性 > 99.9%

# 验证指标
curl -s http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])/rate(http_requests_total[5m])

安全实践

  1. 设置爆炸半径
  2. 在非生产环境先测试
  3. 有回滚计划
  4. 监控实验影响
  5. 记录实验结果

总结

混沌工程是提高系统韧性的重要手段。通过主动注入故障,可以发现系统的弱点并提前改进。