Chaos Mesh实战:Pod故障与网络延迟注入
Chaos Mesh实战:Pod故障与网络延迟注入
Chaos Mesh实战概览
Chaos Mesh是CNCF孵化的混沌工程平台,支持在Kubernetes环境中注入多种类型的故障。本章聚焦实战场景和常见故障模式。
Chaos Mesh故障类型:
├── Pod故障
│ ├── Pod Kill(杀死Pod)
│ ├── Pod Failure(Pod不可用)
│ └── Pod Stress(CPU/内存压力)
├── 网络故障
│ ├── Network Delay(网络延迟)
│ ├── Network Loss(丢包)
│ ├── Network Duplicate(重复包)
│ └── Network Partition(网络分区)
├── IO故障
│ ├── IO Latency(IO延迟)
│ ├── IO Fault(IO错误)
│ └── IO Read/Write(读写失败)
├── 时间故障
│ └── Time Skew(时间偏移)
└── 内核故障
├── Kernel Panic
└── Clock Skew
Pod故障注入实战
模拟Pod崩溃
# 模拟随机Pod被杀死
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: production
spec:
action: pod-kill
mode: one # one: 一个, all: 所有, fixed: 固定数量
selector:
namespaces:
- production
labelSelectors:
app: payment-service
scheduler:
cron: "@every 10m" # 每10分钟执行一次
模拟Pod不可用
# 模拟Pod故障(5分钟不可用)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-example
spec:
action: pod-failure
mode: fixed
value: "2" # 故障Pod数量
duration: "5m"
selector:
labelSelectors:
app: order-service
CPU压力测试
# 模拟CPU高负载
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
spec:
mode: one
selector:
labelSelectors:
app: user-service
stressors:
cpu:
workers: 4
load: 80 # 80% CPU使用率
duration: "10m"
内存压力测试
# 模拟内存压力
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: memory-stress
spec:
mode: one
selector:
labelSelectors:
app: cache-service
stressors:
memory:
workers: 2
size: "512MB"
duration: "10m"
网络故障注入实战
网络延迟
# 模拟网络延迟(100ms,±10ms抖动)
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: all
selector:
labelSelectors:
app: api-gateway
delay:
latency: "100ms"
jitter: "10ms"
correlation: "50" # 50%相关性
direction: to # to: 出站, from: 入站, both: 双向
target:
selector:
labelSelectors:
app: payment-service
mode: all
丢包模拟
# 模拟10%丢包
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-loss
spec:
action: loss
mode: one
selector:
labelSelectors:
app: notification-service
loss:
loss: "10"
correlation: "50"
duration: "5m"
网络分区
# 模拟网络分区(服务间无法通信)
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
spec:
action: partition
mode: one
selector:
labelSelectors:
app: user-service
direction: both
target:
selector:
labelSelectors:
app: order-service
mode: all
duration: "3m"
IO故障注入
# 模拟IO延迟
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-delay
spec:
action: latency
mode: one
selector:
labelSelectors:
app: database
volumePath: /data
delay: "200ms"
duration: "5m"
path: "/data/.*" # 正则匹配文件路径
percent: "50" # 50%的IO操作受影响
混沌实验模板
# 完整的混沌实验模板
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: api-resilience-test
namespace: chaos-testing
spec:
entry: api-test
templates:
- name: api-test
templateType: Serial
children:
- network-delay
- pod-kill
- cpu-stress
- name: network-delay
templateType: NetworkChaos
deadline: 5m
networkChaos:
action: delay
mode: all
selector:
labelSelectors:
app: api
delay:
latency: 100ms
- name: pod-kill
templateType: PodChaos
deadline: 3m
podChaos:
action: pod-kill
mode: one
selector:
labelSelectors:
app: api
- name: cpu-stress
templateType: StressChaos
deadline: 10m
stressChaos:
mode: one
selector:
labelSelectors:
app: api
stressors:
cpu:
workers: 2
load: 50
最佳实践
- 最小权限:只在测试环境执行混沌实验,生产环境需严格审批
- 爆炸半径控制:限制受影响的Pod数量和持续时间
- 自动回滚:监控系统健康状态,异常时自动停止实验
- 实验记录:记录每次实验的假设、过程和结论