chaos-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Chaos Engineering

混沌工程

Design experiments that surface real weaknesses in production systems — without becoming outages. Most "chaos engineering" attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.
设计可暴露生产系统真实弱点的实验,同时避免引发故障。大多数“混沌工程”尝试会忽略steady-state measurement(稳态度量)、未定义abort criteria(中止条件),也未限制blast radius(影响范围)。本技能可确保混沌实验具备安全性与实用性的规范流程。

When to use

适用场景

  • Planning a chaos experiment (what to break, where, when, how to abort)
  • Calculating blast radius before running the experiment
  • Reviewing an existing experiment plan for safety
  • Choosing a chaos tool (Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
  • Writing a chaos experiment postmortem
  • Running a Game Day exercise
  • 规划混沌实验(确定破坏对象、位置、时间及中止方式)
  • 实验执行前计算blast radius
  • 审查现有实验计划的安全性
  • 选择混沌工具(Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
  • 撰写混沌实验复盘报告
  • 执行Game Day演练

When NOT to use

不适用场景

  • General incident response (use
    incident-response
    )
  • Threat hunting / red-team (use
    red-team
    ,
    threat-detection
    )
  • Performance load testing (different goal — chaos is about failure modes, not capacity)
  • Production debugging (chaos discovers weaknesses preemptively, not after-the-fact)
  • 通用事件响应(请使用
    incident-response
  • 威胁狩猎/红队测试(请使用
    red-team
    threat-detection
  • 性能负载测试(目标不同——混沌工程关注故障模式,而非容量)
  • 生产环境调试(混沌工程是提前发现弱点,而非事后排查)

Core principle: chaos without abort criteria is an outage

核心原则:无中止条件的混沌实验就是故障

The 4 Principles of Chaos Engineering (Netflix, 2016):
  1. Build a hypothesis around steady-state behavior. Not "what breaks?" but "X holds; will it still hold under fault Y?"
  2. Vary real-world events. Inject realistic failures: kill nodes, slow networks, lose cache, throttle dependencies.
  3. Run experiments in production. Staging never has the same failure modes. Start small.
  4. Automate experiments to run continuously. One-off chaos is a press release; continuous chaos is engineering.
Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.
混沌工程四大原则(Netflix,2016):
  1. 围绕稳态行为构建假设。不是“什么会崩溃?”,而是“X成立;在故障Y下X是否仍成立?”
  2. 模拟真实世界事件。注入真实故障:终止节点、延迟网络、丢失缓存、限制依赖服务。
  3. 在生产环境运行实验。预发布环境永远不会出现与生产环境相同的故障模式。从小规模开始。
  4. 自动化实验持续运行。一次性混沌实验只是噱头;持续混沌才是工程实践。
补充第五条原则:提前定义中止条件。没有中止条件的混沌实验本质上就是故障。

Quick start

快速开始

bash
SKILL=engineering/chaos-engineering/skills/chaos-engineering
bash
SKILL=engineering/chaos-engineering/skills/chaos-engineering

1. Design an experiment

1. 设计实验

python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15

2. Calculate blast radius

2. 计算影响范围

python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15

3. Generate postmortem after the experiment

3. 实验结束后生成复盘报告

python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
undefined
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
undefined

The 3 Python tools

三款Python工具

All stdlib-only. Run with
--help
.
均基于Python标准库开发,可添加
--help
查看帮助信息。

experiment_designer.py

experiment_designer.py

Generates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).
bash
python scripts/experiment_designer.py \
  --target "checkout-svc" \
  --hypothesis "p99 latency stays <500ms when payment-svc is slow" \
  --attack latency \
  --magnitude "+200ms" \
  --duration-min 15 \
  --blast-radius "5% of US traffic" \
  --abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"
Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.
根据输入生成结构化实验计划,强制包含必要章节(假设、稳态指标、blast radius、中止条件、回滚流程)。
bash
python scripts/experiment_designer.py \
  --target "checkout-svc" \
  --hypothesis "p99 latency stays <500ms when payment-svc is slow" \
  --attack latency \
  --magnitude "+200ms" \
  --duration-min 15 \
  --blast-radius "5% of US traffic" \
  --abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"
输出包含以下内容的Markdown计划:假设、稳态指标、攻击方式、强度、时长、blast radius、中止条件、回滚流程、监控仪表板、学习问题。

blast_radius_calculator.py

blast_radius_calculator.py

Computes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.
bash
python scripts/blast_radius_calculator.py \
  --traffic-share 0.05 \
  --user-pop 1000000 \
  --duration-min 15 \
  --baseline-availability 0.999 \
  --expected-impact-availability 0.95
Outputs:
  • Expected affected users
  • Error budget consumed (in minutes of error budget)
  • Risk score: GREEN / YELLOW / RED
  • Recommendation: PROCEED / REDUCE / ABORT
GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.
计算计划实验的blast radius。根据流量占比、用户规模、时长,计算预期受影响用户数、预期错误预算消耗及风险评分。
bash
python scripts/blast_radius_calculator.py \
  --traffic-share 0.05 \
  --user-pop 1000000 \
  --duration-min 15 \
  --baseline-availability 0.999 \
  --expected-impact-availability 0.95
输出内容:
  • 预期受影响用户数
  • 消耗的错误预算(以错误预算分钟数计)
  • 风险评分:GREEN / YELLOW / RED
  • 建议:PROCEED / REDUCE / ABORT
GREEN = 错误预算消耗<1%;YELLOW = 1-10%;RED = >10%。

experiment_postmortem.py

experiment_postmortem.py

Produces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.
bash
python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt
Outputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.
根据实验计划和结果生成结构化复盘报告,避免常见的复盘缺陷:未记录学习内容、无后续行动、包含指责性语言。
bash
python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt
输出包含以下内容的Markdown:摘要、假设验证情况(是否成立)、学习到的内容、意外发现、带负责人的后续行动、下一次实验链接。

The 7 attack types (taxonomy)

七种攻击类型(分类)

Different attacks reveal different weaknesses. See
references/attack_taxonomy.md
for full detail.
AttackWhat it testsTooling
LatencyTimeouts, retries, circuit breakerstc, Chaos Mesh
NetworkChaos
ErrorError handling, fallback pathsChaos Mesh
HTTPChaos
, Toxiproxy
Resource (CPU, memory, disk)Saturation handling, autoscalingChaos Mesh
StressChaos
, stress-ng
Network partitionSplit-brain, consensus, failoverChaos Mesh
NetworkChaos
partition
Dependency failureGraceful degradation, fallbackService mesh fault injection
TimeClock skew, NTP issueslibfaketime, Chaos Mesh
TimeChaos
Infrastructure (kill instance)Auto-recovery, failoverAWS FIS, Chaos Monkey
Pick the attack that matches the hypothesis. "What happens if X is slow?" → latency. "What happens if X loses network?" → partition.
不同攻击方式可暴露不同弱点。详情请查看
references/attack_taxonomy.md
Attack测试目标工具
Latency超时、重试、断路器tc, Chaos Mesh
NetworkChaos
Error错误处理、降级路径Chaos Mesh
HTTPChaos
, Toxiproxy
Resource(CPU、内存、磁盘)饱和处理、自动扩缩容Chaos Mesh
StressChaos
, stress-ng
Network partition脑裂、共识机制、故障转移Chaos Mesh
NetworkChaos
partition
Dependency failure优雅降级、备选方案服务网格故障注入
Time时钟偏差、NTP问题libfaketime, Chaos Mesh
TimeChaos
Infrastructure(终止实例)自动恢复、故障转移AWS FIS, Chaos Monkey
选择与假设匹配的攻击方式。例如“如果X变慢会发生什么?”→ Latency;“如果X失去网络连接会发生什么?”→ partition。

Tooling chooser

工具选择指南

ToolBest forPricingStack
Chaos ToolkitLightweight, language-agnostic, JSON experimentsOSSAny
Chaos MeshKubernetes-native, rich CRDs, in-clusterOSSKubernetes
LitmusKubernetes, Argo-integrated, large libraryOSS + EnterpriseKubernetes
GremlinEnterprise SaaS, multi-cloud, auditPaidAny
AWS FISAWS-native, IAM-integrated, EC2/ECS/EKSPaid (AWS)AWS
CustomNiche needs, single-cloud, low budgetNoneAny
Decision rules:
  • k8s-only stack + OSS → Chaos Mesh or Litmus (Litmus has bigger experiment library)
  • Multi-cloud + OSS → Chaos Toolkit
  • AWS-heavy + simple needs → AWS FIS
  • Enterprise + audit/compliance → Gremlin
See
references/tooling_landscape.md
for trade-offs.
工具最佳适用场景定价技术栈
Chaos Toolkit轻量、语言无关、JSON格式实验开源任意
Chaos MeshKubernetes原生、丰富CRD、集群内运行开源Kubernetes
LitmusKubernetes、集成Argo、丰富实验库开源+企业版Kubernetes
Gremlin企业级SaaS、多云、审计功能付费任意
AWS FISAWS原生、集成IAM、支持EC2/ECS/EKS付费(AWS计费)AWS
Custom小众需求、单一云环境、低预算免费任意
决策规则:
  • 仅Kubernetes栈+开源 → Chaos Mesh或Litmus(Litmus实验库更丰富)
  • 多云+开源 → Chaos Toolkit
  • 以AWS为主+需求简单 → AWS FIS
  • 企业级+审计/合规 → Gremlin
更多权衡分析请查看
references/tooling_landscape.md

Workflows

工作流程

Workflow 1: Design and run a single experiment

流程1:设计并执行单次实验

1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.
1. 提出假设:“当[故障发生]时,稳态指标X保持在Y范围内。”
2. 确定稳态指标——必须在实验前可度量。
3. 运行blast_radius_calculator.py——确认风险评分为GREEN后再继续。
4. 运行experiment_designer.py生成实验计划。
5. 邀请同行评审实验计划;确认中止条件具体明确。
6. 在#incidents(或指定频道)通知值班团队。
7. 打开监控并执行实验。
8. 若触发中止条件,立即中止实验;记录发生的情况。
9. 运行experiment_postmortem.py记录学习内容。
10. 创建后续行动任务;关联下一次实验。

Workflow 2: Game Day exercise

流程2:Game Day演练

1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.
1. 选择场景(例如:“主数据库故障转移”)。
2. 确定所有应保持正常运行的依赖服务。
3. 构建覆盖各层级的多实验计划。
4. 与利益相关方确认时间;需安排值班人员。
5. 在引导者的管理下执行场景。
6. 在共享文档中实时记录观察结果。
7. 生成涵盖所有观察结果的统一复盘报告。
8. 在任务看板中跟踪带负责人的后续行动。

Workflow 3: Continuous chaos (game days → daily)

流程3:持续混沌(从Game Day到日常执行)

1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.
1. 起步:每周在预发布环境进行Game Day演练。
2. 进阶:每周在生产环境进行小影响范围的Game Day演练。
3. 成熟:通过定时实验实现持续混沌(Litmus混沌调度、Gremlin场景)。
4. 关联部署:每次生产部署触发基线混沌扫描。
5. 跟踪指标:每周实验次数、发现的弱点数量、MTTR趋势。

Composition with other skills

与其他技能组合

This skill explicitly composes with two others in this library:
SkillComposition
feature-flags-architect
Kill switches defined there are the abort triggers here
kubernetes-operator
Operators are common chaos targets (test reconcile under fault)
incident-response
Chaos experiments that escalate become incidents
本技能可与库中的另外两个技能组合使用:
技能组合方式
feature-flags-architect
该技能中定义的kill switch作为本技能的中止触发器
kubernetes-operator
Operator是常见的混沌实验目标(测试故障下的调和逻辑)
incident-response
升级的混沌实验将触发事件响应流程

Anti-patterns

反模式

  • No hypothesis — "let's break things" is sabotage, not engineering
  • No steady-state metric — without a baseline, you can't tell if X broke
  • No blast radius bound — full-prod experiment without limits = outage
  • No abort criteria — see above; this is mandatory
  • No on-call coverage — chaos without monitoring is unmonitored production
  • Chaos in staging only — staging never has prod failure modes
  • Chaos in dev — useless; dev has different failure modes from prod
  • One-off chaos — single experiment is a press release; learning requires recurrence
  • Blame-laden postmortem — record causes, not blame; teams stop running chaos otherwise
  • 无假设 —— “随便破坏点什么”是破坏行为,而非工程实践
  • 无稳态指标 —— 没有基线,无法判断系统是否出现异常
  • 无blast radius限制 —— 无限制的全生产环境实验=故障
  • 无中止条件 —— 如前所述;这是强制要求
  • 无值班人员覆盖 —— 无监控的混沌实验就是无人看管的生产环境
  • 仅在预发布环境进行混沌实验 —— 预发布环境永远不会出现生产环境的故障模式
  • 在开发环境进行混沌实验 —— 毫无意义;开发环境与生产环境的故障模式完全不同
  • 一次性混沌实验 —— 单次实验只是噱头;学习需要持续进行
  • 包含指责性语言的复盘报告 —— 记录原因而非指责;否则团队会停止开展混沌实验

References

参考资料

  • references/chaos_principles.md
    — the 4 principles, history, when to start
  • references/experiment_design.md
    — hypothesis structure, steady-state metrics, abort criteria
  • references/attack_taxonomy.md
    — 7 attack types with examples and tooling
  • references/tooling_landscape.md
    — Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY
  • references/chaos_principles.md
    —— 四大原则、发展历史、启动时机
  • references/experiment_design.md
    —— 假设结构、稳态指标、中止条件
  • references/attack_taxonomy.md
    —— 七种攻击类型及示例、工具
  • references/tooling_landscape.md
    —— Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / 自定义工具对比

Slash command

斜杠命令

/chaos-experiment
— interactive experiment design wizard that runs all 3 tools.
/chaos-experiment
—— 交互式实验设计向导,可运行全部三款工具。

Asset templates

资产模板

  • assets/experiment_template.md
    — fill-in plan template
  • assets/postmortem_template.md
    — structured postmortem template
  • assets/experiment_template.md
    —— 填空式实验计划模板
  • assets/postmortem_template.md
    —— 结构化复盘报告模板

Verifiable success

可验证的成功指标

A team using this skill should achieve:
  • 100% of chaos experiments have a written hypothesis, abort criteria, and blast-radius calculation
  • Blast radius for any single experiment never exceeds 10% of error budget
  • Mean time between chaos experiments <14 days (continuous, not one-off)
  • Each experiment produces ≥1 follow-up action that gets shipped
  • No chaos experiment escalates to a customer-impacting incident in trailing 90 days
使用本技能的团队应达成:
  • 100%的混沌实验具备书面假设、中止条件和blast radius计算
  • 任何单次实验的blast radius消耗不超过错误预算的10%
  • 混沌实验间隔时间<14天(持续进行,而非一次性)
  • 每个实验产生至少1个可落地的后续行动
  • 过去90天内无混沌实验升级为影响客户的事件