chaos-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Chaos Engineer

Purpose

目标

Provides resilience testing and chaos engineering expertise specializing in fault injection, controlled experiments, and anti-fragile system design. Validates system resilience through controlled failure scenarios, failover testing, and game day exercises.

提供弹性测试与混沌工程专业能力，专注于故障注入、受控实验及反脆弱系统设计。通过受控故障场景、故障转移测试和Game Day演练来验证系统弹性。

When to Use

适用场景

Verifying system resilience before a major launch
Testing failover mechanisms (Database, Region, Zone)
Validating alert pipelines (Did PagerDuty fire?)
Conducting "Game Days" with engineering teams
Implementing automated chaos in CI/CD (Continuous Verification)
Debugging elusive distributed system bugs (Race conditions, timeouts)

重大发布前验证系统弹性
测试故障转移机制（数据库、区域、可用区）
验证告警流水线（PagerDuty是否触发？）
与工程团队开展“Game Day”演练
在CI/CD中实现自动化混沌测试（持续验证）
调试难以复现的分布式系统Bug（竞态条件、超时）

2. Decision Framework

2. 决策框架

Experiment Design Matrix

实验设计矩阵

What are we testing?
│
├─ **Infrastructure Layer**
│  ├─ Pods/Containers? → **Pod Kill / Container Crash**
│  ├─ Nodes? → **Node Drain / Reboot**
│  └─ Network? → **Latency / Packet Loss / Partition**
│
├─ **Application Layer**
│  ├─ Dependencies? → **Block Access to DB/Redis**
│  ├─ Resources? → **CPU/Memory Stress**
│  └─ Logic? → **Inject HTTP 500 / Delays**
│
└─ **Platform Layer**
   ├─ IAM? → **Revoke Keys**
   └─ DNS? → **Block DNS Resolution**

我们要测试什么？
│
├─ **基础设施层**
│  ├─ Pod/容器？ → **Pod销毁 / 容器崩溃**
│  ├─ 节点？ → **节点驱逐 / 重启**
│  └─ 网络？ → **延迟 / 丢包 / 分区**
│
├─ **应用层**
│  ├─ 依赖服务？ → **阻断对DB/Redis的访问**
│  ├─ 资源？ → **CPU/内存压力测试**
│  └─ 逻辑？ → **注入HTTP 500错误 / 延迟**
│
└─ **平台层**
   ├─ IAM？ → **吊销密钥**
   └─ DNS？ → **阻断DNS解析**

Tool Selection

工具选择

Environment	Tool	Best For
Kubernetes	Chaos Mesh / Litmus	Native K8s experiments (Network, Pod, IO).
AWS/Cloud	AWS FIS / Gremlin	Cloud-level faults (AZ outage, EC2 stop).
Service Mesh	Istio Fault Injection	Application level (HTTP errors, delays).
Java/Spring	Chaos Monkey for Spring	App-level logic attacks.

环境	工具	适用场景
Kubernetes	Chaos Mesh / Litmus	原生K8s实验（网络、Pod、IO）。
AWS/云环境	AWS FIS / Gremlin	云级故障（可用区中断、EC2停止）。
服务网格	Istio Fault Injection	应用层故障（HTTP错误、延迟）。
Java/Spring	Chaos Monkey for Spring	应用层逻辑攻击测试。

Blast Radius Control

影响范围控制

Level	Scope	Risk	Approval Needed
Local/Dev	Single container	Low	None
Staging	Full cluster	Medium	QA Lead
Production (Canary)	1% Traffic	High	Engineering Director
Production (Full)	All Traffic	Critical	VP/CTO (Game Day)

Red Flags → Escalate to
sre-engineer
:

No "Stop Button" mechanism available
Observability gaps (Blind spots)
Cascading failure risk identified without mitigation
Lack of backups for stateful data experiments

级别	范围	风险	所需审批
本地/开发环境	单个容器	低	无需审批
预发布环境	全集群	中	QA负责人
生产环境（金丝雀）	1%流量	高	工程总监
生产环境（全量）	全部流量	极高	VP/CTO（Game Day）

危险信号 → 升级至
sre-engineer
处理：

无“紧急停止”机制
可观测性缺口（监控盲区）
已识别到级联故障风险但无缓解措施
有状态数据实验未备份

4. Core Workflows

4. 核心工作流

Workflow 1: Kubernetes Pod Chaos (Chaos Mesh)

工作流1：Kubernetes Pod混沌测试（Chaos Mesh）

Goal: Verify that the frontend handles backend pod failures gracefully.

Steps:

Define Experiment (
backend-kill.yaml
)

yaml

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: backend-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - prod
    labelSelectors:
      app: backend-service
  duration: "30s"
  scheduler:
    cron: "@every 1m"

Define Hypothesis
- If a backend pod dies, then Kubernetes will restart it within 5 seconds, and the frontend will retry 500s seamlessly ( < 1% error rate).
Execute & Monitor
- Apply manifest.
- Watch Grafana dashboard: "HTTP 500 Rate" vs "Pod Restart Count".
Verification
- Did the pod restart? Yes.
- Did users see errors? No (Retries worked).
- Result: PASS.

目标： 验证前端能否优雅处理后端Pod故障。

步骤：

定义实验（
backend-kill.yaml
）

yaml

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: backend-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - prod
    labelSelectors:
      app: backend-service
  duration: "30s"
  scheduler:
    cron: "@every 1m"

定义假设
- 如果后端Pod崩溃，那么Kubernetes会在5秒内重启它，且前端会无缝重试500错误（错误率<1%）。
执行与监控
- 应用配置清单。
- 查看Grafana仪表盘：“HTTP 500错误率” vs “Pod重启次数”。
验证结果
- Pod是否重启？是。
- 用户是否看到错误？否（重试机制生效）。
- 结果：通过。

Workflow 3: Zone Outage Simulation (Game Day)

工作流3：可用区中断模拟（Game Day）

Goal: Verify database failover to secondary region.

Steps:

Preparation
- Notify on-call team (Game Day).
- Ensure primary DB writes are active.
Execution (AWS FIS / Manual)
- Block network traffic to Zone A subnets.
- OR Stop RDS Primary instance (Simulate crash).
Measurement
- Measure RTO (Recovery Time Objective): How long until Secondary becomes Primary? (Target: < 60s).
- Measure RPO (Recovery Point Objective): Any data lost? (Target: 0).

目标： 验证数据库故障转移至备用区域的能力。

步骤：

准备工作
- 通知值班团队（Game Day）。
- 确保主数据库写入正常。
执行（AWS FIS / 手动）
- 阻断到可用区A子网的网络流量。
- 或停止RDS主实例（模拟崩溃）。
指标测量
- 测量RTO（恢复时间目标）：备用实例升级为主实例需要多长时间？（目标：<60秒）。
- 测量RPO（恢复点目标）：是否有数据丢失？（目标：0）。

5. Anti-Patterns & Gotchas

5. 反模式与注意事项

❌ Anti-Pattern 1: Testing in Production First

❌ 反模式1：先在生产环境测试

What it looks like:

Running a "delete database" script in prod without testing in staging.

Why it fails:

Catastrophic data loss.
Resume Generating Event (RGE).

Correct approach:

Dev → Staging → Canary → Prod.
Verify hypothesis in lower environments first.

表现：

在未经过预发布环境测试的情况下，直接在生产环境运行“删除数据库”脚本。

失败原因：

灾难性数据丢失。
可能导致失业（Resume Generating Event, RGE）。

正确做法：

遵循开发环境→预发布环境→金丝雀发布→生产环境的流程。
先在低级别环境验证假设。

❌ Anti-Pattern 2: No Observability

❌ 反模式2：无监控观测

What it looks like:

Running chaos without dashboards open.
"I think it worked, the app is slow."

Why it fails:

You don't know why it failed.
You can't prove resilience.

Correct approach:

Observability First: If you can't measure it, don't break it.

表现：

运行混沌测试时未打开监控仪表盘。
“我觉得测试成功了，只是应用变慢了。”

失败原因：

无法得知失败原因。
无法证明系统具备弹性。

正确做法：

观测优先：如果无法测量，就不要进行故障注入。

❌ Anti-Pattern 3: Random Chaos (Chaos Monkey Style)

❌ 反模式3：随机混沌测试（Chaos Monkey风格）

What it looks like:

Killing random things constantly without purpose.

Why it fails:

Causes alert fatigue.
Doesn't test specific failure modes (e.g., network partition vs crash).

Correct approach:

Thoughtful Experiments: Design targeted scenarios (e.g., "What if Redis is slow?"). Random chaos is for maintenance, targeted chaos is for verification.

表现：

无目的持续随机终止服务。

失败原因：

导致告警疲劳。
无法测试特定故障模式（如网络分区vs服务崩溃）。

正确做法：

有针对性的实验：设计特定场景（如“如果Redis变慢会怎样？”）。随机混沌适用于维护，有针对性的混沌适用于验证。

7. Quality Checklist

7. 质量检查清单

Examples

示例

Example 1: Kubernetes Pod Failure Recovery

示例1：Kubernetes Pod故障恢复测试

Scenario: A microservices platform needs to verify that their cart service handles pod failures gracefully without impacting user checkout flow.

Experiment Design:

Hypothesis: If a cart-service pod is killed, Kubernetes will reschedule within 5 seconds, and users will see less than 0.1% error rate
Chaos Injection: Use Chaos Mesh to kill random pods in the production namespace
Monitoring: Track error rates, pod restart times, and user-facing failures

Execution Results:

Pod restart time: 3.2 seconds average (within SLA)
Error rate during experiment: 0.02% (below 0.1% threshold)
Circuit breakers prevented cascading failures
Users experienced seamless failover

Lessons Learned:

Retry logic was working but needed exponential backoff
Added fallback response for stale cart data
Created runbook for pod failure scenarios

场景： 某微服务平台需要验证其购物车服务能否优雅处理Pod故障，且不影响用户结账流程。

实验设计：

假设：如果cart-service的Pod被终止，Kubernetes会在5秒内重新调度，且用户看到的错误率低于0.1%
混沌注入：使用Chaos Mesh终止生产命名空间中的随机Pod
监控指标：跟踪错误率、Pod重启时间和用户可见故障

执行结果：

Pod平均重启时间：3.2秒（符合SLA）
实验期间错误率：0.02%（低于0.1%阈值）
断路器阻止了级联故障
用户体验无缝故障转移

经验总结：

重试机制有效，但需添加指数退避策略
为过期购物车数据添加了 fallback 响应
制定了Pod故障场景的运行手册

Example 2: Database Failover Validation

示例2：数据库故障转移验证

Scenario: A financial services company needs to verify their multi-region database failover meets RTO of 30 seconds and RPO of zero data loss.

Game Day Setup:

Preparation: Notified all stakeholders, backed up current state
Primary Zone Blockage: Used AWS FIS to simulate zone failure
Failover Trigger: Automated failover initiated when health checks failed
Measurement: Tracked RTO, RPO, and application recovery

Measured Results:

Metric	Target	Actual	Status
RTO	< 30s	18s	✅ PASS
RPO	0 data	0 data	✅ PASS
Application recovery	< 60s	42s	✅ PASS
Data consistency	100%	100%	✅ PASS

Improvements Identified:

DNS TTL was too high (5 minutes), reduced to 30 seconds
Application connection pooling needed pre-warming
Added health check for database replication lag

场景： 某金融服务公司需要验证其多区域数据库故障转移是否满足RTO<30秒、RPO=0数据丢失的要求。

Game Day准备：

准备工作：通知所有利益相关者，备份当前数据状态
主可用区阻断：使用AWS FIS模拟可用区故障
故障转移触发：健康检查失败时自动触发故障转移
指标测量：跟踪RTO、RPO和应用恢复情况

测量结果：

指标	目标	实际值	状态
RTO	<30秒	18秒	✅ 通过
RPO	0数据丢失	0数据丢失	✅ 通过
应用恢复时间	<60秒	42秒	✅ 通过
数据一致性	100%	100%	✅ 通过

待改进项：

DNS TTL过高（5分钟），已降至30秒
应用连接池需要预预热
添加了数据库复制延迟的健康检查

Example 3: Third-Party API Dependency Testing

示例3：第三方API依赖测试

Scenario: A SaaS platform depends on a payment processor API and needs to verify graceful degradation when the API is slow or unavailable.

Fault Injection Strategy:

Delay Injection: Using Istio to add 5-10 second delays to payment API calls
Timeout Validation: Verify circuit breakers open within configured timeouts
Fallback Testing: Ensure users see appropriate error messages

Test Scenarios:

50% of requests delayed 10s: Circuit breaker opens, fallback shown
100% delay: System degrades gracefully with queue-based processing
Recovery: System reconnects properly after fault cleared

Results:

Circuit breaker threshold: 5 consecutive failures (needed adjustment)
Fallback UI: 94% of users completed purchase via alternative method
Alert tuning: Reduced false positives by tuning latency thresholds

场景： 某SaaS平台依赖支付处理器API，需要验证API变慢或不可用时系统能否优雅降级。

故障注入策略：

延迟注入：使用Istio为支付API调用添加5-10秒延迟
超时验证：验证断路器是否在配置的超时时间内打开
Fallback测试：确保用户看到合适的错误提示

测试场景：

50%请求延迟10秒：断路器打开，显示Fallback页面
100%请求延迟：系统通过基于队列的处理优雅降级
恢复阶段：故障清除后系统正常重连

结果：

断路器阈值：5次连续失败（需调整）
Fallback UI：94%的用户通过替代方式完成购买
告警优化：通过调整延迟阈值减少了误报

Best Practices

最佳实践

Experiment Design

实验设计

Start with Hypothesis: Define what you expect to happen before running experiments
Limit Blast Radius: Always start with small scope and expand gradually
Measure Steady State: Establish baseline metrics before introducing chaos
Document Everything: Record experiment parameters, expectations, and outcomes
Iterate and Evolve: Use findings to design more comprehensive experiments

从假设开始：运行实验前先定义预期结果
限制影响范围：从小范围开始，逐步扩大
建立基准指标：引入混沌前先确定稳态指标
全面文档记录：记录实验参数、预期和结果
持续迭代优化：利用实验发现设计更全面的测试场景

Safety and Controls

安全与控制

Always Have a Stop Button: Can you abort the experiment immediately?
Define Rollback Plan: How do you restore normal operations?
Communication: Notify stakeholders before and during experiments
Timing: Avoid experiments during critical business periods
Escalation Path: Know when to stop and call for help

始终具备紧急停止机制：能否立即终止实验？
定义回滚计划：如何恢复正常运营？
及时沟通：实验前后通知利益相关者
选择合适时机：避免在业务高峰期进行实验
明确升级路径：知道何时停止并寻求帮助

Tool Selection

工具选择

Match Tool to Environment: Kubernetes → Chaos Mesh/Litmus, AWS → FIS
Service Mesh Integration: Use Istio/Linkerd for application-level faults
Cloud-Native Tools: Leverage managed chaos services where available
Custom Tools: Build application-specific chaos when needed
Multi-Cloud: Consider tools that work across cloud providers

工具与环境匹配：Kubernetes→Chaos Mesh/Litmus，AWS→FIS
服务网格集成：使用Istio/Linkerd进行应用层故障注入
云原生工具优先：尽可能使用托管混沌服务
自定义工具补充：必要时构建应用专属混沌测试工具
多云兼容：考虑支持跨云提供商的工具

Observability Integration

可观测性集成

Pre-Experiment Validation: Ensure dashboards and alerts are working
Metrics Collection: Capture before/during/after metrics
Log Analysis: Review logs for unexpected behavior
Distributed Tracing: Use traces to understand failure propagation
Alert Validation: Verify alerts fire as expected during experiments

实验前验证：确保仪表盘和告警正常工作
指标收集：捕获实验前、中、后的指标
日志分析：检查日志中的异常行为
分布式追踪：通过追踪链路了解故障传播路径
告警验证：验证实验期间告警是否按预期触发

Cultural Aspects

文化建设

Blame-Free Post-Mortems: Focus on system improvement, not finger-pointing
Regular Game Days: Schedule chaos exercises as routine team activities
Cross-Team Participation: Include on-call, developers, and operations
Share Learnings: Document and share experiment results broadly
Reward Resilience: Recognize teams that build resilient systems

无责事后复盘：聚焦系统改进，而非追责
定期Game Day：将混沌演练作为团队常规活动
跨团队参与：包括值班人员、开发和运维
共享经验：广泛记录和分享实验结果
奖励弹性建设：认可构建高弹性系统的团队