chaos-engineering-resilience

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Chaos Engineering & Resilience Testing

Chaos Engineering与韧性测试

<default_to_action> When testing system resilience or injecting failures:
  1. DEFINE steady state (normal metrics: error rate, latency, throughput)
  2. HYPOTHESIZE system continues in steady state during failure
  3. INJECT real-world failures (network, instance, disk, CPU)
  4. OBSERVE and measure deviation from steady state
  5. FIX weaknesses discovered, document runbooks, repeat
Quick Chaos Steps:
  • Start small: Dev → Staging → 1% prod → gradual rollout
  • Define clear rollback triggers (error_rate > 5%)
  • Measure blast radius, never exceed planned scope
  • Document findings → runbooks → improved resilience
Critical Success Factors:
  • Controlled experiments with automatic rollback
  • Steady state must be measurable
  • Start in non-production, graduate to production </default_to_action>
<default_to_action> 在测试系统韧性或注入故障时:
  1. 定义steady state(正常指标:错误率、延迟、吞吐量)
  2. 假设系统在故障发生时仍能保持稳态
  3. 注入真实世界中的故障(网络、实例、磁盘、CPU故障)
  4. 观察并衡量与稳态的偏差
  5. 发现并修复弱点,编写运行手册,重复实验
快速混沌实验步骤:
  • 从小规模开始:Dev → Staging → 1% prod → 逐步扩大范围
  • 定义明确的回滚触发条件(错误率>5%)
  • 评估影响范围(blast radius),绝不超出计划范围
  • 记录实验结果→编写运行手册→提升系统韧性
关键成功因素:
  • 具备自动回滚机制的受控实验
  • steady state必须可量化
  • 从非生产环境开始,逐步推广到生产环境 </default_to_action>

Quick Reference Card

快速参考卡片

When to Use

适用场景

  • Distributed systems validation
  • Disaster recovery testing
  • Building confidence in fault tolerance
  • Pre-production resilience verification
  • 分布式系统验证
  • 灾难恢复测试
  • 构建系统容错信心
  • 预生产环境韧性验证

Failure Types to Inject

可注入的故障类型

CategoryFailuresTools
NetworkLatency, packet loss, partitiontc, toxiproxy
InfrastructureInstance kill, disk failure, CPUChaos Monkey
ApplicationExceptions, slow responses, leaksGremlin, LitmusChaos
DependenciesService outage, timeoutWireMock
类别故障类型工具
NetworkLatency, packet loss, partitiontc, toxiproxy
InfrastructureInstance kill, disk failure, CPUChaos Monkey
ApplicationExceptions, slow responses, leaksGremlin, LitmusChaos
DependenciesService outage, timeoutWireMock

Blast Radius Progression

影响范围(Blast Radius)推进路径

Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
     ↓           ↓         ↓        ↓
  Learn      Validate   Careful   Full confidence
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
     ↓           ↓         ↓        ↓
  Learn      Validate   Careful   Full confidence

Steady State Metrics

Steady State指标

MetricNormalAlert Threshold
Error rate< 0.1%> 1%
p99 latency< 200ms> 500ms
Throughputbaseline-20%

指标正常范围告警阈值
Error rate< 0.1%> 1%
p99 latency< 200ms> 500ms
Throughputbaseline-20%

Chaos Experiment Structure

混沌实验结构

typescript
// Chaos experiment definition
const experiment = {
  name: 'Database latency injection',
  hypothesis: 'System handles 500ms DB latency gracefully',
  steadyState: {
    errorRate: '< 0.1%',
    p99Latency: '< 300ms'
  },
  method: {
    type: 'network-latency',
    target: 'database',
    delay: '500ms',
    duration: '5m'
  },
  rollback: {
    automatic: true,
    trigger: 'errorRate > 5%'
  }
};

typescript
// Chaos experiment definition
const experiment = {
  name: 'Database latency injection',
  hypothesis: 'System handles 500ms DB latency gracefully',
  steadyState: {
    errorRate: '< 0.1%',
    p99Latency: '< 300ms'
  },
  method: {
    type: 'network-latency',
    target: 'database',
    delay: '500ms',
    duration: '5m'
  },
  rollback: {
    automatic: true,
    trigger: 'errorRate > 5%'
  }
};

Agent-Driven Chaos

Agent驱动的混沌实验

typescript
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
  target: 'payment-service',
  failure: 'terminate-random-instance',
  blastRadius: '10%',
  duration: '5m',
  steadyStateHypothesis: {
    metric: 'success-rate',
    threshold: 0.99
  },
  autoRollback: true
}, "qe-chaos-engineer");

// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately

typescript
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
  target: 'payment-service',
  failure: 'terminate-random-instance',
  blastRadius: '10%',
  duration: '5m',
  steadyStateHypothesis: {
    metric: 'success-rate',
    threshold: 0.99
  },
  autoRollback: true
}, "qe-chaos-engineer");

// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately

Agent Coordination Hints

Agent协作提示

Memory Namespace

内存命名空间

aqe/chaos-engineering/
├── experiments/*       - Experiment definitions & results
├── steady-states/*     - Baseline measurements
├── runbooks/*          - Generated recovery procedures
└── blast-radius/*      - Impact analysis
aqe/chaos-engineering/
├── experiments/*       - Experiment definitions & results
├── steady-states/*     - Baseline measurements
├── runbooks/*          - Generated recovery procedures
└── blast-radius/*      - Impact analysis

Fleet Coordination

集群协作

typescript
const chaosFleet = await FleetManager.coordinate({
  strategy: 'chaos-engineering',
  agents: [
    'qe-chaos-engineer',          // Experiment execution
    'qe-performance-tester',      // Baseline metrics
    'qe-production-intelligence'  // Production monitoring
  ],
  topology: 'sequential'
});

typescript
const chaosFleet = await FleetManager.coordinate({
  strategy: 'chaos-engineering',
  agents: [
    'qe-chaos-engineer',          // Experiment execution
    'qe-performance-tester',      // Baseline metrics
    'qe-production-intelligence'  // Production monitoring
  ],
  topology: 'sequential'
});

Related Skills

相关技能

  • shift-right-testing - Production testing
  • performance-testing - Load testing
  • test-environment-management - Environment stability

  • shift-right-testing - 生产环境测试
  • performance-testing - 负载测试
  • test-environment-management - 环境稳定性

Remember

注意事项

Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents:
qe-chaos-engineer
automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.
主动模拟故障,预防意外停机。 在用户发现问题前找出系统弱点。定义steady state、注入故障、评估影响、修复弱点、编写运行手册。从小规模开始,逐步扩大影响范围(blast radius)。
借助Agent:
qe-chaos-engineer
可自动执行混沌实验,具备影响范围控制、自动回滚和全面韧性验证功能,并能根据实验结果生成运行手册。