chaos-engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Chaos Engineering

混沌工程

Design experiments that surface real weaknesses in production systems — without becoming outages. Most "chaos engineering" attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.

设计可暴露生产系统真实弱点的实验，同时避免引发故障。大多数“混沌工程”尝试会忽略steady-state measurement（稳态度量）、未定义abort criteria（中止条件），也未限制blast radius（影响范围）。本技能可确保混沌实验具备安全性与实用性的规范流程。

When to use

适用场景

Planning a chaos experiment (what to break, where, when, how to abort)
Calculating blast radius before running the experiment
Reviewing an existing experiment plan for safety
Choosing a chaos tool (Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
Writing a chaos experiment postmortem
Running a Game Day exercise

规划混沌实验（确定破坏对象、位置、时间及中止方式）
实验执行前计算blast radius
审查现有实验计划的安全性
选择混沌工具（Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS）
撰写混沌实验复盘报告
执行Game Day演练

When NOT to use

不适用场景

General incident response (use
```
incident-response
```
)
Threat hunting / red-team (use
```
red-team
```
,
```
threat-detection
```
)
Performance load testing (different goal — chaos is about failure modes, not capacity)
Production debugging (chaos discovers weaknesses preemptively, not after-the-fact)

通用事件响应（请使用
```
incident-response
```
）
威胁狩猎/红队测试（请使用
```
red-team
```
、
```
threat-detection
```
）
性能负载测试（目标不同——混沌工程关注故障模式，而非容量）
生产环境调试（混沌工程是提前发现弱点，而非事后排查）

Core principle: chaos without abort criteria is an outage

核心原则：无中止条件的混沌实验就是故障

The 4 Principles of Chaos Engineering (Netflix, 2016):

Build a hypothesis around steady-state behavior. Not "what breaks?" but "X holds; will it still hold under fault Y?"
Vary real-world events. Inject realistic failures: kill nodes, slow networks, lose cache, throttle dependencies.
Run experiments in production. Staging never has the same failure modes. Start small.
Automate experiments to run continuously. One-off chaos is a press release; continuous chaos is engineering.

Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.

混沌工程四大原则（Netflix，2016）：

围绕稳态行为构建假设。不是“什么会崩溃？”，而是“X成立；在故障Y下X是否仍成立？”
模拟真实世界事件。注入真实故障：终止节点、延迟网络、丢失缓存、限制依赖服务。
在生产环境运行实验。预发布环境永远不会出现与生产环境相同的故障模式。从小规模开始。
自动化实验持续运行。一次性混沌实验只是噱头；持续混沌才是工程实践。

补充第五条原则：提前定义中止条件。没有中止条件的混沌实验本质上就是故障。

Quick start

快速开始

bash

SKILL=engineering/chaos-engineering/skills/chaos-engineering

bash

SKILL=engineering/chaos-engineering/skills/chaos-engineering

1. Design an experiment

1. 设计实验

python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15

2. Calculate blast radius

2. 计算影响范围

python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15

3. Generate postmortem after the experiment

3. 实验结束后生成复盘报告

python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt

undefined

python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt

undefined

The 3 Python tools

三款Python工具

All stdlib-only. Run with

--help

均基于Python标准库开发，可添加

--help

查看帮助信息。

experiment_designer.py

experiment_designer.py

Generates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).

bash

python scripts/experiment_designer.py \
  --target "checkout-svc" \
  --hypothesis "p99 latency stays <500ms when payment-svc is slow" \
  --attack latency \
  --magnitude "+200ms" \
  --duration-min 15 \
  --blast-radius "5% of US traffic" \
  --abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"

Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.

根据输入生成结构化实验计划，强制包含必要章节（假设、稳态指标、blast radius、中止条件、回滚流程）。

bash

python scripts/experiment_designer.py \
  --target "checkout-svc" \
  --hypothesis "p99 latency stays <500ms when payment-svc is slow" \
  --attack latency \
  --magnitude "+200ms" \
  --duration-min 15 \
  --blast-radius "5% of US traffic" \
  --abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"

输出包含以下内容的Markdown计划：假设、稳态指标、攻击方式、强度、时长、blast radius、中止条件、回滚流程、监控仪表板、学习问题。

blast_radius_calculator.py

blast_radius_calculator.py

Computes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.

bash

python scripts/blast_radius_calculator.py \
  --traffic-share 0.05 \
  --user-pop 1000000 \
  --duration-min 15 \
  --baseline-availability 0.999 \
  --expected-impact-availability 0.95

Outputs:

Expected affected users
Error budget consumed (in minutes of error budget)
Risk score: GREEN / YELLOW / RED
Recommendation: PROCEED / REDUCE / ABORT

GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.

计算计划实验的blast radius。根据流量占比、用户规模、时长，计算预期受影响用户数、预期错误预算消耗及风险评分。

bash

python scripts/blast_radius_calculator.py \
  --traffic-share 0.05 \
  --user-pop 1000000 \
  --duration-min 15 \
  --baseline-availability 0.999 \
  --expected-impact-availability 0.95

输出内容：

预期受影响用户数
消耗的错误预算（以错误预算分钟数计）
风险评分：GREEN / YELLOW / RED
建议：PROCEED / REDUCE / ABORT

GREEN = 错误预算消耗<1%；YELLOW = 1-10%；RED = >10%。

experiment_postmortem.py

experiment_postmortem.py

Produces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.

bash

python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt

Outputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.

根据实验计划和结果生成结构化复盘报告，避免常见的复盘缺陷：未记录学习内容、无后续行动、包含指责性语言。

bash

python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt

输出包含以下内容的Markdown：摘要、假设验证情况（是否成立）、学习到的内容、意外发现、带负责人的后续行动、下一次实验链接。

The 7 attack types (taxonomy)

七种攻击类型（分类）

Different attacks reveal different weaknesses. See

references/attack_taxonomy.md

for full detail.

Attack	What it tests	Tooling
Latency	Timeouts, retries, circuit breakers	tc, Chaos Mesh `NetworkChaos`
Error	Error handling, fallback paths	Chaos Mesh `HTTPChaos` , Toxiproxy
Resource (CPU, memory, disk)	Saturation handling, autoscaling	Chaos Mesh `StressChaos` , stress-ng
Network partition	Split-brain, consensus, failover	Chaos Mesh `NetworkChaos` partition
Dependency failure	Graceful degradation, fallback	Service mesh fault injection
Time	Clock skew, NTP issues	libfaketime, Chaos Mesh `TimeChaos`
Infrastructure (kill instance)	Auto-recovery, failover	AWS FIS, Chaos Monkey

Pick the attack that matches the hypothesis. "What happens if X is slow?" → latency. "What happens if X loses network?" → partition.

不同攻击方式可暴露不同弱点。详情请查看

references/attack_taxonomy.md

。

Attack	测试目标	工具
Latency	超时、重试、断路器	tc, Chaos Mesh `NetworkChaos`
Error	错误处理、降级路径	Chaos Mesh `HTTPChaos` , Toxiproxy
Resource（CPU、内存、磁盘）	饱和处理、自动扩缩容	Chaos Mesh `StressChaos` , stress-ng
Network partition	脑裂、共识机制、故障转移	Chaos Mesh `NetworkChaos` partition
Dependency failure	优雅降级、备选方案	服务网格故障注入
Time	时钟偏差、NTP问题	libfaketime, Chaos Mesh `TimeChaos`
Infrastructure（终止实例）	自动恢复、故障转移	AWS FIS, Chaos Monkey

选择与假设匹配的攻击方式。例如“如果X变慢会发生什么？”→ Latency；“如果X失去网络连接会发生什么？”→ partition。

Tooling chooser

工具选择指南

Tool	Best for	Pricing	Stack
Chaos Toolkit	Lightweight, language-agnostic, JSON experiments	OSS	Any
Chaos Mesh	Kubernetes-native, rich CRDs, in-cluster	OSS	Kubernetes
Litmus	Kubernetes, Argo-integrated, large library	OSS + Enterprise	Kubernetes
Gremlin	Enterprise SaaS, multi-cloud, audit	Paid	Any
AWS FIS	AWS-native, IAM-integrated, EC2/ECS/EKS	Paid (AWS)	AWS
Custom	Niche needs, single-cloud, low budget	None	Any

Decision rules:

k8s-only stack + OSS → Chaos Mesh or Litmus (Litmus has bigger experiment library)
Multi-cloud + OSS → Chaos Toolkit
AWS-heavy + simple needs → AWS FIS
Enterprise + audit/compliance → Gremlin

See

references/tooling_landscape.md

for trade-offs.

工具	最佳适用场景	定价	技术栈
Chaos Toolkit	轻量、语言无关、JSON格式实验	开源	任意
Chaos Mesh	Kubernetes原生、丰富CRD、集群内运行	开源	Kubernetes
Litmus	Kubernetes、集成Argo、丰富实验库	开源+企业版	Kubernetes
Gremlin	企业级SaaS、多云、审计功能	付费	任意
AWS FIS	AWS原生、集成IAM、支持EC2/ECS/EKS	付费（AWS计费）	AWS
Custom	小众需求、单一云环境、低预算	免费	任意

决策规则：

仅Kubernetes栈+开源 → Chaos Mesh或Litmus（Litmus实验库更丰富）
多云+开源 → Chaos Toolkit
以AWS为主+需求简单 → AWS FIS
企业级+审计/合规 → Gremlin

更多权衡分析请查看

references/tooling_landscape.md

。

Workflows

工作流程

Workflow 1: Design and run a single experiment

流程1：设计并执行单次实验

1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.

1. 提出假设：“当[故障发生]时，稳态指标X保持在Y范围内。”
2. 确定稳态指标——必须在实验前可度量。
3. 运行blast_radius_calculator.py——确认风险评分为GREEN后再继续。
4. 运行experiment_designer.py生成实验计划。
5. 邀请同行评审实验计划；确认中止条件具体明确。
6. 在#incidents（或指定频道）通知值班团队。
7. 打开监控并执行实验。
8. 若触发中止条件，立即中止实验；记录发生的情况。
9. 运行experiment_postmortem.py记录学习内容。
10. 创建后续行动任务；关联下一次实验。

Workflow 2: Game Day exercise

流程2：Game Day演练

1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.

1. 选择场景（例如：“主数据库故障转移”）。
2. 确定所有应保持正常运行的依赖服务。
3. 构建覆盖各层级的多实验计划。
4. 与利益相关方确认时间；需安排值班人员。
5. 在引导者的管理下执行场景。
6. 在共享文档中实时记录观察结果。
7. 生成涵盖所有观察结果的统一复盘报告。
8. 在任务看板中跟踪带负责人的后续行动。

Workflow 3: Continuous chaos (game days → daily)

流程3：持续混沌（从Game Day到日常执行）

1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.

1. 起步：每周在预发布环境进行Game Day演练。
2. 进阶：每周在生产环境进行小影响范围的Game Day演练。
3. 成熟：通过定时实验实现持续混沌（Litmus混沌调度、Gremlin场景）。
4. 关联部署：每次生产部署触发基线混沌扫描。
5. 跟踪指标：每周实验次数、发现的弱点数量、MTTR趋势。

Composition with other skills

与其他技能组合

This skill explicitly composes with two others in this library:

Skill	Composition
`feature-flags-architect`	Kill switches defined there are the abort triggers here
`kubernetes-operator`	Operators are common chaos targets (test reconcile under fault)
`incident-response`	Chaos experiments that escalate become incidents

本技能可与库中的另外两个技能组合使用：

技能	组合方式
`feature-flags-architect`	该技能中定义的kill switch作为本技能的中止触发器
`kubernetes-operator`	Operator是常见的混沌实验目标（测试故障下的调和逻辑）
`incident-response`	升级的混沌实验将触发事件响应流程

Anti-patterns

反模式

No hypothesis — "let's break things" is sabotage, not engineering
No steady-state metric — without a baseline, you can't tell if X broke
No blast radius bound — full-prod experiment without limits = outage
No abort criteria — see above; this is mandatory
No on-call coverage — chaos without monitoring is unmonitored production
Chaos in staging only — staging never has prod failure modes
Chaos in dev — useless; dev has different failure modes from prod
One-off chaos — single experiment is a press release; learning requires recurrence
Blame-laden postmortem — record causes, not blame; teams stop running chaos otherwise

无假设 —— “随便破坏点什么”是破坏行为，而非工程实践
无稳态指标 —— 没有基线，无法判断系统是否出现异常
无blast radius限制 —— 无限制的全生产环境实验=故障
无中止条件 —— 如前所述；这是强制要求
无值班人员覆盖 —— 无监控的混沌实验就是无人看管的生产环境
仅在预发布环境进行混沌实验 —— 预发布环境永远不会出现生产环境的故障模式
在开发环境进行混沌实验 —— 毫无意义；开发环境与生产环境的故障模式完全不同
一次性混沌实验 —— 单次实验只是噱头；学习需要持续进行
包含指责性语言的复盘报告 —— 记录原因而非指责；否则团队会停止开展混沌实验

References

参考资料

```
references/chaos_principles.md
```
— the 4 principles, history, when to start
```
references/experiment_design.md
```
— hypothesis structure, steady-state metrics, abort criteria
```
references/attack_taxonomy.md
```
— 7 attack types with examples and tooling
```
references/tooling_landscape.md
```
— Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY

```
references/chaos_principles.md
```
—— 四大原则、发展历史、启动时机
```
references/experiment_design.md
```
—— 假设结构、稳态指标、中止条件
```
references/attack_taxonomy.md
```
—— 七种攻击类型及示例、工具
```
references/tooling_landscape.md
```
—— Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / 自定义工具对比

Slash command

斜杠命令

/chaos-experiment

— interactive experiment design wizard that runs all 3 tools.

/chaos-experiment

—— 交互式实验设计向导，可运行全部三款工具。

Asset templates

资产模板

```
assets/experiment_template.md
```
— fill-in plan template
```
assets/postmortem_template.md
```
— structured postmortem template

```
assets/experiment_template.md
```
—— 填空式实验计划模板
```
assets/postmortem_template.md
```
—— 结构化复盘报告模板

Verifiable success

可验证的成功指标

A team using this skill should achieve:

100% of chaos experiments have a written hypothesis, abort criteria, and blast-radius calculation
Blast radius for any single experiment never exceeds 10% of error budget
Mean time between chaos experiments <14 days (continuous, not one-off)
Each experiment produces ≥1 follow-up action that gets shipped
No chaos experiment escalates to a customer-impacting incident in trailing 90 days

使用本技能的团队应达成：

100%的混沌实验具备书面假设、中止条件和blast radius计算
任何单次实验的blast radius消耗不超过错误预算的10%
混沌实验间隔时间<14天（持续进行，而非一次性）
每个实验产生至少1个可落地的后续行动
过去90天内无混沌实验升级为影响客户的事件