chaos-engineering
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChaos Engineering
混沌工程
Design experiments that surface real weaknesses in production systems — without becoming outages. Most "chaos engineering" attempts skip steady-state measurement, define no abort criteria, and have no blast-radius bound. This skill enforces the discipline that makes chaos experiments safe and useful.
设计可暴露生产系统真实弱点的实验,同时避免引发故障。大多数“混沌工程”尝试会忽略steady-state measurement(稳态度量)、未定义abort criteria(中止条件),也未限制blast radius(影响范围)。本技能可确保混沌实验具备安全性与实用性的规范流程。
When to use
适用场景
- Planning a chaos experiment (what to break, where, when, how to abort)
- Calculating blast radius before running the experiment
- Reviewing an existing experiment plan for safety
- Choosing a chaos tool (Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
- Writing a chaos experiment postmortem
- Running a Game Day exercise
- 规划混沌实验(确定破坏对象、位置、时间及中止方式)
- 实验执行前计算blast radius
- 审查现有实验计划的安全性
- 选择混沌工具(Chaos Toolkit / Chaos Mesh / Litmus / Gremlin / AWS FIS)
- 撰写混沌实验复盘报告
- 执行Game Day演练
When NOT to use
不适用场景
- General incident response (use )
incident-response - Threat hunting / red-team (use ,
red-team)threat-detection - Performance load testing (different goal — chaos is about failure modes, not capacity)
- Production debugging (chaos discovers weaknesses preemptively, not after-the-fact)
- 通用事件响应(请使用)
incident-response - 威胁狩猎/红队测试(请使用、
red-team)threat-detection - 性能负载测试(目标不同——混沌工程关注故障模式,而非容量)
- 生产环境调试(混沌工程是提前发现弱点,而非事后排查)
Core principle: chaos without abort criteria is an outage
核心原则:无中止条件的混沌实验就是故障
The 4 Principles of Chaos Engineering (Netflix, 2016):
- Build a hypothesis around steady-state behavior. Not "what breaks?" but "X holds; will it still hold under fault Y?"
- Vary real-world events. Inject realistic failures: kill nodes, slow networks, lose cache, throttle dependencies.
- Run experiments in production. Staging never has the same failure modes. Start small.
- Automate experiments to run continuously. One-off chaos is a press release; continuous chaos is engineering.
Add a fifth: Define abort criteria up front. A chaos experiment with no abort criteria is an outage by another name.
混沌工程四大原则(Netflix,2016):
- 围绕稳态行为构建假设。不是“什么会崩溃?”,而是“X成立;在故障Y下X是否仍成立?”
- 模拟真实世界事件。注入真实故障:终止节点、延迟网络、丢失缓存、限制依赖服务。
- 在生产环境运行实验。预发布环境永远不会出现与生产环境相同的故障模式。从小规模开始。
- 自动化实验持续运行。一次性混沌实验只是噱头;持续混沌才是工程实践。
补充第五条原则:提前定义中止条件。没有中止条件的混沌实验本质上就是故障。
Quick start
快速开始
bash
SKILL=engineering/chaos-engineering/skills/chaos-engineeringbash
SKILL=engineering/chaos-engineering/skills/chaos-engineering1. Design an experiment
1. 设计实验
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
2. Calculate blast radius
2. 计算影响范围
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
3. Generate postmortem after the experiment
3. 实验结束后生成复盘报告
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
undefinedpython "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt
undefinedThe 3 Python tools
三款Python工具
All stdlib-only. Run with .
--help均基于Python标准库开发,可添加查看帮助信息。
--helpexperiment_designer.py
experiment_designer.pyexperiment_designer.py
experiment_designer.pyGenerates a structured experiment plan from inputs. Enforces the required sections (hypothesis, steady-state metric, blast radius, abort criteria, rollback).
bash
python scripts/experiment_designer.py \
--target "checkout-svc" \
--hypothesis "p99 latency stays <500ms when payment-svc is slow" \
--attack latency \
--magnitude "+200ms" \
--duration-min 15 \
--blast-radius "5% of US traffic" \
--abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"Outputs a markdown plan with: hypothesis, steady-state, attack, magnitude, duration, blast radius, abort criteria, rollback procedure, monitoring dashboards, and learning question.
根据输入生成结构化实验计划,强制包含必要章节(假设、稳态指标、blast radius、中止条件、回滚流程)。
bash
python scripts/experiment_designer.py \
--target "checkout-svc" \
--hypothesis "p99 latency stays <500ms when payment-svc is slow" \
--attack latency \
--magnitude "+200ms" \
--duration-min 15 \
--blast-radius "5% of US traffic" \
--abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"输出包含以下内容的Markdown计划:假设、稳态指标、攻击方式、强度、时长、blast radius、中止条件、回滚流程、监控仪表板、学习问题。
blast_radius_calculator.py
blast_radius_calculator.pyblast_radius_calculator.py
blast_radius_calculator.pyComputes the blast radius of a planned experiment. Given traffic share + user population + duration, calculates expected affected users, expected error budget burn, and a risk score.
bash
python scripts/blast_radius_calculator.py \
--traffic-share 0.05 \
--user-pop 1000000 \
--duration-min 15 \
--baseline-availability 0.999 \
--expected-impact-availability 0.95Outputs:
- Expected affected users
- Error budget consumed (in minutes of error budget)
- Risk score: GREEN / YELLOW / RED
- Recommendation: PROCEED / REDUCE / ABORT
GREEN = <1% error budget; YELLOW = 1-10%; RED = >10%.
计算计划实验的blast radius。根据流量占比、用户规模、时长,计算预期受影响用户数、预期错误预算消耗及风险评分。
bash
python scripts/blast_radius_calculator.py \
--traffic-share 0.05 \
--user-pop 1000000 \
--duration-min 15 \
--baseline-availability 0.999 \
--expected-impact-availability 0.95输出内容:
- 预期受影响用户数
- 消耗的错误预算(以错误预算分钟数计)
- 风险评分:GREEN / YELLOW / RED
- 建议:PROCEED / REDUCE / ABORT
GREEN = 错误预算消耗<1%;YELLOW = 1-10%;RED = >10%。
experiment_postmortem.py
experiment_postmortem.pyexperiment_postmortem.py
experiment_postmortem.pyProduces a structured postmortem from an experiment plan + results. Catches the common postmortem failure modes: no learning recorded, no follow-up actions, blame-laden language.
bash
python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txtOutputs markdown with: summary, hypothesis (was it confirmed/refuted?), what we learned, what surprised us, follow-up actions with owners, and link to next experiment.
根据实验计划和结果生成结构化复盘报告,避免常见的复盘缺陷:未记录学习内容、无后续行动、包含指责性语言。
bash
python scripts/experiment_postmortem.py --plan experiment.json --result-log results.txt输出包含以下内容的Markdown:摘要、假设验证情况(是否成立)、学习到的内容、意外发现、带负责人的后续行动、下一次实验链接。
The 7 attack types (taxonomy)
七种攻击类型(分类)
Different attacks reveal different weaknesses. See for full detail.
references/attack_taxonomy.md| Attack | What it tests | Tooling |
|---|---|---|
| Latency | Timeouts, retries, circuit breakers | tc, Chaos Mesh |
| Error | Error handling, fallback paths | Chaos Mesh |
| Resource (CPU, memory, disk) | Saturation handling, autoscaling | Chaos Mesh |
| Network partition | Split-brain, consensus, failover | Chaos Mesh |
| Dependency failure | Graceful degradation, fallback | Service mesh fault injection |
| Time | Clock skew, NTP issues | libfaketime, Chaos Mesh |
| Infrastructure (kill instance) | Auto-recovery, failover | AWS FIS, Chaos Monkey |
Pick the attack that matches the hypothesis. "What happens if X is slow?" → latency. "What happens if X loses network?" → partition.
不同攻击方式可暴露不同弱点。详情请查看。
references/attack_taxonomy.md| Attack | 测试目标 | 工具 |
|---|---|---|
| Latency | 超时、重试、断路器 | tc, Chaos Mesh |
| Error | 错误处理、降级路径 | Chaos Mesh |
| Resource(CPU、内存、磁盘) | 饱和处理、自动扩缩容 | Chaos Mesh |
| Network partition | 脑裂、共识机制、故障转移 | Chaos Mesh |
| Dependency failure | 优雅降级、备选方案 | 服务网格故障注入 |
| Time | 时钟偏差、NTP问题 | libfaketime, Chaos Mesh |
| Infrastructure(终止实例) | 自动恢复、故障转移 | AWS FIS, Chaos Monkey |
选择与假设匹配的攻击方式。例如“如果X变慢会发生什么?”→ Latency;“如果X失去网络连接会发生什么?”→ partition。
Tooling chooser
工具选择指南
| Tool | Best for | Pricing | Stack |
|---|---|---|---|
| Chaos Toolkit | Lightweight, language-agnostic, JSON experiments | OSS | Any |
| Chaos Mesh | Kubernetes-native, rich CRDs, in-cluster | OSS | Kubernetes |
| Litmus | Kubernetes, Argo-integrated, large library | OSS + Enterprise | Kubernetes |
| Gremlin | Enterprise SaaS, multi-cloud, audit | Paid | Any |
| AWS FIS | AWS-native, IAM-integrated, EC2/ECS/EKS | Paid (AWS) | AWS |
| Custom | Niche needs, single-cloud, low budget | None | Any |
Decision rules:
- k8s-only stack + OSS → Chaos Mesh or Litmus (Litmus has bigger experiment library)
- Multi-cloud + OSS → Chaos Toolkit
- AWS-heavy + simple needs → AWS FIS
- Enterprise + audit/compliance → Gremlin
See for trade-offs.
references/tooling_landscape.md| 工具 | 最佳适用场景 | 定价 | 技术栈 |
|---|---|---|---|
| Chaos Toolkit | 轻量、语言无关、JSON格式实验 | 开源 | 任意 |
| Chaos Mesh | Kubernetes原生、丰富CRD、集群内运行 | 开源 | Kubernetes |
| Litmus | Kubernetes、集成Argo、丰富实验库 | 开源+企业版 | Kubernetes |
| Gremlin | 企业级SaaS、多云、审计功能 | 付费 | 任意 |
| AWS FIS | AWS原生、集成IAM、支持EC2/ECS/EKS | 付费(AWS计费) | AWS |
| Custom | 小众需求、单一云环境、低预算 | 免费 | 任意 |
决策规则:
- 仅Kubernetes栈+开源 → Chaos Mesh或Litmus(Litmus实验库更丰富)
- 多云+开源 → Chaos Toolkit
- 以AWS为主+需求简单 → AWS FIS
- 企业级+审计/合规 → Gremlin
更多权衡分析请查看。
references/tooling_landscape.mdWorkflows
工作流程
Workflow 1: Design and run a single experiment
流程1:设计并执行单次实验
1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.1. 提出假设:“当[故障发生]时,稳态指标X保持在Y范围内。”
2. 确定稳态指标——必须在实验前可度量。
3. 运行blast_radius_calculator.py——确认风险评分为GREEN后再继续。
4. 运行experiment_designer.py生成实验计划。
5. 邀请同行评审实验计划;确认中止条件具体明确。
6. 在#incidents(或指定频道)通知值班团队。
7. 打开监控并执行实验。
8. 若触发中止条件,立即中止实验;记录发生的情况。
9. 运行experiment_postmortem.py记录学习内容。
10. 创建后续行动任务;关联下一次实验。Workflow 2: Game Day exercise
流程2:Game Day演练
1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.1. 选择场景(例如:“主数据库故障转移”)。
2. 确定所有应保持正常运行的依赖服务。
3. 构建覆盖各层级的多实验计划。
4. 与利益相关方确认时间;需安排值班人员。
5. 在引导者的管理下执行场景。
6. 在共享文档中实时记录观察结果。
7. 生成涵盖所有观察结果的统一复盘报告。
8. 在任务看板中跟踪带负责人的后续行动。Workflow 3: Continuous chaos (game days → daily)
流程3:持续混沌(从Game Day到日常执行)
1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.1. 起步:每周在预发布环境进行Game Day演练。
2. 进阶:每周在生产环境进行小影响范围的Game Day演练。
3. 成熟:通过定时实验实现持续混沌(Litmus混沌调度、Gremlin场景)。
4. 关联部署:每次生产部署触发基线混沌扫描。
5. 跟踪指标:每周实验次数、发现的弱点数量、MTTR趋势。Composition with other skills
与其他技能组合
This skill explicitly composes with two others in this library:
| Skill | Composition |
|---|---|
| Kill switches defined there are the abort triggers here |
| Operators are common chaos targets (test reconcile under fault) |
| Chaos experiments that escalate become incidents |
本技能可与库中的另外两个技能组合使用:
| 技能 | 组合方式 |
|---|---|
| 该技能中定义的kill switch作为本技能的中止触发器 |
| Operator是常见的混沌实验目标(测试故障下的调和逻辑) |
| 升级的混沌实验将触发事件响应流程 |
Anti-patterns
反模式
- No hypothesis — "let's break things" is sabotage, not engineering
- No steady-state metric — without a baseline, you can't tell if X broke
- No blast radius bound — full-prod experiment without limits = outage
- No abort criteria — see above; this is mandatory
- No on-call coverage — chaos without monitoring is unmonitored production
- Chaos in staging only — staging never has prod failure modes
- Chaos in dev — useless; dev has different failure modes from prod
- One-off chaos — single experiment is a press release; learning requires recurrence
- Blame-laden postmortem — record causes, not blame; teams stop running chaos otherwise
- 无假设 —— “随便破坏点什么”是破坏行为,而非工程实践
- 无稳态指标 —— 没有基线,无法判断系统是否出现异常
- 无blast radius限制 —— 无限制的全生产环境实验=故障
- 无中止条件 —— 如前所述;这是强制要求
- 无值班人员覆盖 —— 无监控的混沌实验就是无人看管的生产环境
- 仅在预发布环境进行混沌实验 —— 预发布环境永远不会出现生产环境的故障模式
- 在开发环境进行混沌实验 —— 毫无意义;开发环境与生产环境的故障模式完全不同
- 一次性混沌实验 —— 单次实验只是噱头;学习需要持续进行
- 包含指责性语言的复盘报告 —— 记录原因而非指责;否则团队会停止开展混沌实验
References
参考资料
- — the 4 principles, history, when to start
references/chaos_principles.md - — hypothesis structure, steady-state metrics, abort criteria
references/experiment_design.md - — 7 attack types with examples and tooling
references/attack_taxonomy.md - — Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / DIY
references/tooling_landscape.md
- —— 四大原则、发展历史、启动时机
references/chaos_principles.md - —— 假设结构、稳态指标、中止条件
references/experiment_design.md - —— 七种攻击类型及示例、工具
references/attack_taxonomy.md - —— Chaos Toolkit / Mesh / Litmus / Gremlin / FIS / 自定义工具对比
references/tooling_landscape.md
Slash command
斜杠命令
/chaos-experiment/chaos-experimentAsset templates
资产模板
- — fill-in plan template
assets/experiment_template.md - — structured postmortem template
assets/postmortem_template.md
- —— 填空式实验计划模板
assets/experiment_template.md - —— 结构化复盘报告模板
assets/postmortem_template.md
Verifiable success
可验证的成功指标
A team using this skill should achieve:
- 100% of chaos experiments have a written hypothesis, abort criteria, and blast-radius calculation
- Blast radius for any single experiment never exceeds 10% of error budget
- Mean time between chaos experiments <14 days (continuous, not one-off)
- Each experiment produces ≥1 follow-up action that gets shipped
- No chaos experiment escalates to a customer-impacting incident in trailing 90 days
使用本技能的团队应达成:
- 100%的混沌实验具备书面假设、中止条件和blast radius计算
- 任何单次实验的blast radius消耗不超过错误预算的10%
- 混沌实验间隔时间<14天(持续进行,而非一次性)
- 每个实验产生至少1个可落地的后续行动
- 过去90天内无混沌实验升级为影响客户的事件