chaos-engineering-resilience
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChaos Engineering & Resilience Testing
Chaos Engineering与韧性测试
<default_to_action>
When testing system resilience or injecting failures:
- DEFINE steady state (normal metrics: error rate, latency, throughput)
- HYPOTHESIZE system continues in steady state during failure
- INJECT real-world failures (network, instance, disk, CPU)
- OBSERVE and measure deviation from steady state
- FIX weaknesses discovered, document runbooks, repeat
Quick Chaos Steps:
- Start small: Dev → Staging → 1% prod → gradual rollout
- Define clear rollback triggers (error_rate > 5%)
- Measure blast radius, never exceed planned scope
- Document findings → runbooks → improved resilience
Critical Success Factors:
- Controlled experiments with automatic rollback
- Steady state must be measurable
- Start in non-production, graduate to production </default_to_action>
<default_to_action>
在测试系统韧性或注入故障时:
- 定义steady state(正常指标:错误率、延迟、吞吐量)
- 假设系统在故障发生时仍能保持稳态
- 注入真实世界中的故障(网络、实例、磁盘、CPU故障)
- 观察并衡量与稳态的偏差
- 发现并修复弱点,编写运行手册,重复实验
快速混沌实验步骤:
- 从小规模开始:Dev → Staging → 1% prod → 逐步扩大范围
- 定义明确的回滚触发条件(错误率>5%)
- 评估影响范围(blast radius),绝不超出计划范围
- 记录实验结果→编写运行手册→提升系统韧性
关键成功因素:
- 具备自动回滚机制的受控实验
- steady state必须可量化
- 从非生产环境开始,逐步推广到生产环境 </default_to_action>
Quick Reference Card
快速参考卡片
When to Use
适用场景
- Distributed systems validation
- Disaster recovery testing
- Building confidence in fault tolerance
- Pre-production resilience verification
- 分布式系统验证
- 灾难恢复测试
- 构建系统容错信心
- 预生产环境韧性验证
Failure Types to Inject
可注入的故障类型
| Category | Failures | Tools |
|---|---|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
| 类别 | 故障类型 | 工具 |
|---|---|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
Blast Radius Progression
影响范围(Blast Radius)推进路径
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
↓ ↓ ↓ ↓
Learn Validate Careful Full confidenceDev (safe) → Staging → 1% prod → 10% → 50% → 100%
↓ ↓ ↓ ↓
Learn Validate Careful Full confidenceSteady State Metrics
Steady State指标
| Metric | Normal | Alert Threshold |
|---|---|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
| 指标 | 正常范围 | 告警阈值 |
|---|---|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
Chaos Experiment Structure
混沌实验结构
typescript
// Chaos experiment definition
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};typescript
// Chaos experiment definition
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};Agent-Driven Chaos
Agent驱动的混沌实验
typescript
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriatelytypescript
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriatelyAgent Coordination Hints
Agent协作提示
Memory Namespace
内存命名空间
aqe/chaos-engineering/
├── experiments/* - Experiment definitions & results
├── steady-states/* - Baseline measurements
├── runbooks/* - Generated recovery procedures
└── blast-radius/* - Impact analysisaqe/chaos-engineering/
├── experiments/* - Experiment definitions & results
├── steady-states/* - Baseline measurements
├── runbooks/* - Generated recovery procedures
└── blast-radius/* - Impact analysisFleet Coordination
集群协作
typescript
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer', // Experiment execution
'qe-performance-tester', // Baseline metrics
'qe-production-intelligence' // Production monitoring
],
topology: 'sequential'
});typescript
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer', // Experiment execution
'qe-performance-tester', // Baseline metrics
'qe-production-intelligence' // Production monitoring
],
topology: 'sequential'
});Related Skills
相关技能
- shift-right-testing - Production testing
- performance-testing - Load testing
- test-environment-management - Environment stability
- shift-right-testing - 生产环境测试
- performance-testing - 负载测试
- test-environment-management - 环境稳定性
Remember
注意事项
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.
qe-chaos-engineer主动模拟故障,预防意外停机。 在用户发现问题前找出系统弱点。定义steady state、注入故障、评估影响、修复弱点、编写运行手册。从小规模开始,逐步扩大影响范围(blast radius)。
借助Agent: 可自动执行混沌实验,具备影响范围控制、自动回滚和全面韧性验证功能,并能根据实验结果生成运行手册。
qe-chaos-engineer