chaos-engineer

Original：🇺🇸 English

Translated

2 scriptsChecked / no sensitive code detected

Expert in resilience testing, fault injection, and building anti-fragile systems using controlled experiments.

12installs

Source404kidwiz/claude-supercode-skills

Added on2026-02-07

NPX Install

npx skill4agent add 404kidwiz/claude-supercode-skills chaos-engineer

SKILL.md Content

View Translation Comparison →

Chaos Engineer

Purpose

Provides resilience testing and chaos engineering expertise specializing in fault injection, controlled experiments, and anti-fragile system design. Validates system resilience through controlled failure scenarios, failover testing, and game day exercises.

When to Use

Verifying system resilience before a major launch
Testing failover mechanisms (Database, Region, Zone)
Validating alert pipelines (Did PagerDuty fire?)
Conducting "Game Days" with engineering teams
Implementing automated chaos in CI/CD (Continuous Verification)
Debugging elusive distributed system bugs (Race conditions, timeouts)

2. Decision Framework

Experiment Design Matrix

What are we testing?
│
├─ **Infrastructure Layer**
│  ├─ Pods/Containers? → **Pod Kill / Container Crash**
│  ├─ Nodes? → **Node Drain / Reboot**
│  └─ Network? → **Latency / Packet Loss / Partition**
│
├─ **Application Layer**
│  ├─ Dependencies? → **Block Access to DB/Redis**
│  ├─ Resources? → **CPU/Memory Stress**
│  └─ Logic? → **Inject HTTP 500 / Delays**
│
└─ **Platform Layer**
   ├─ IAM? → **Revoke Keys**
   └─ DNS? → **Block DNS Resolution**

Tool Selection

Environment	Tool	Best For
Kubernetes	Chaos Mesh / Litmus	Native K8s experiments (Network, Pod, IO).
AWS/Cloud	AWS FIS / Gremlin	Cloud-level faults (AZ outage, EC2 stop).
Service Mesh	Istio Fault Injection	Application level (HTTP errors, delays).
Java/Spring	Chaos Monkey for Spring	App-level logic attacks.

Blast Radius Control

Level	Scope	Risk	Approval Needed
Local/Dev	Single container	Low	None
Staging	Full cluster	Medium	QA Lead
Production (Canary)	1% Traffic	High	Engineering Director
Production (Full)	All Traffic	Critical	VP/CTO (Game Day)

Red Flags → Escalate to
sre-engineer
:

No "Stop Button" mechanism available
Observability gaps (Blind spots)
Cascading failure risk identified without mitigation
Lack of backups for stateful data experiments

4. Core Workflows

Workflow 1: Kubernetes Pod Chaos (Chaos Mesh)

Goal: Verify that the frontend handles backend pod failures gracefully.

Steps:

Define Experiment (
backend-kill.yaml
)

yaml

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: backend-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - prod
    labelSelectors:
      app: backend-service
  duration: "30s"
  scheduler:
    cron: "@every 1m"

Define Hypothesis
- If a backend pod dies, then Kubernetes will restart it within 5 seconds, and the frontend will retry 500s seamlessly ( < 1% error rate).
Execute & Monitor
- Apply manifest.
- Watch Grafana dashboard: "HTTP 500 Rate" vs "Pod Restart Count".
Verification
- Did the pod restart? Yes.
- Did users see errors? No (Retries worked).
- Result: PASS.

Workflow 3: Zone Outage Simulation (Game Day)

Goal: Verify database failover to secondary region.

Steps:

Preparation
- Notify on-call team (Game Day).
- Ensure primary DB writes are active.
Execution (AWS FIS / Manual)
- Block network traffic to Zone A subnets.
- OR Stop RDS Primary instance (Simulate crash).
Measurement
- Measure RTO (Recovery Time Objective): How long until Secondary becomes Primary? (Target: < 60s).
- Measure RPO (Recovery Point Objective): Any data lost? (Target: 0).

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Testing in Production First

What it looks like:

Running a "delete database" script in prod without testing in staging.

Why it fails:

Catastrophic data loss.
Resume Generating Event (RGE).

Correct approach:

Dev → Staging → Canary → Prod.
Verify hypothesis in lower environments first.

❌ Anti-Pattern 2: No Observability

What it looks like:

Running chaos without dashboards open.
"I think it worked, the app is slow."

Why it fails:

You don't know why it failed.
You can't prove resilience.

Correct approach:

Observability First: If you can't measure it, don't break it.

❌ Anti-Pattern 3: Random Chaos (Chaos Monkey Style)

What it looks like:

Killing random things constantly without purpose.

Why it fails:

Causes alert fatigue.
Doesn't test specific failure modes (e.g., network partition vs crash).

Correct approach:

Thoughtful Experiments: Design targeted scenarios (e.g., "What if Redis is slow?"). Random chaos is for maintenance, targeted chaos is for verification.

7. Quality Checklist

Planning:

Hypothesis: Clearly defined ("If X happens, Y should occur").
Blast Radius: Limited (e.g., 1 zone, 1% users).
Approval: Stakeholders notified (or scheduled Game Day).

Safety:

Stop Button: Automated abort script ready.
Rollback: Plan to restore state if needed.
Backup: Data backed up before stateful experiments.

Execution:

Monitoring: Dashboards visible during experiment.
Logging: Experiment start/end times logged for correlation.

Review:

Fix: Action items assigned (Jira).
Report: Findings shared with engineering team.

Examples

Example 1: Kubernetes Pod Failure Recovery

Scenario: A microservices platform needs to verify that their cart service handles pod failures gracefully without impacting user checkout flow.

Experiment Design:

Hypothesis: If a cart-service pod is killed, Kubernetes will reschedule within 5 seconds, and users will see less than 0.1% error rate
Chaos Injection: Use Chaos Mesh to kill random pods in the production namespace
Monitoring: Track error rates, pod restart times, and user-facing failures

Execution Results:

Pod restart time: 3.2 seconds average (within SLA)
Error rate during experiment: 0.02% (below 0.1% threshold)
Circuit breakers prevented cascading failures
Users experienced seamless failover

Lessons Learned:

Retry logic was working but needed exponential backoff
Added fallback response for stale cart data
Created runbook for pod failure scenarios

Example 2: Database Failover Validation

Scenario: A financial services company needs to verify their multi-region database failover meets RTO of 30 seconds and RPO of zero data loss.

Game Day Setup:

Preparation: Notified all stakeholders, backed up current state
Primary Zone Blockage: Used AWS FIS to simulate zone failure
Failover Trigger: Automated failover initiated when health checks failed
Measurement: Tracked RTO, RPO, and application recovery

Measured Results:

Metric	Target	Actual	Status
RTO	< 30s	18s	✅ PASS
RPO	0 data	0 data	✅ PASS
Application recovery	< 60s	42s	✅ PASS
Data consistency	100%	100%	✅ PASS

Improvements Identified:

DNS TTL was too high (5 minutes), reduced to 30 seconds
Application connection pooling needed pre-warming
Added health check for database replication lag

Example 3: Third-Party API Dependency Testing

Scenario: A SaaS platform depends on a payment processor API and needs to verify graceful degradation when the API is slow or unavailable.

Fault Injection Strategy:

Delay Injection: Using Istio to add 5-10 second delays to payment API calls
Timeout Validation: Verify circuit breakers open within configured timeouts
Fallback Testing: Ensure users see appropriate error messages

Test Scenarios:

50% of requests delayed 10s: Circuit breaker opens, fallback shown
100% delay: System degrades gracefully with queue-based processing
Recovery: System reconnects properly after fault cleared

Results:

Circuit breaker threshold: 5 consecutive failures (needed adjustment)
Fallback UI: 94% of users completed purchase via alternative method
Alert tuning: Reduced false positives by tuning latency thresholds

Best Practices

Experiment Design

Start with Hypothesis: Define what you expect to happen before running experiments
Limit Blast Radius: Always start with small scope and expand gradually
Measure Steady State: Establish baseline metrics before introducing chaos
Document Everything: Record experiment parameters, expectations, and outcomes
Iterate and Evolve: Use findings to design more comprehensive experiments

Safety and Controls

Always Have a Stop Button: Can you abort the experiment immediately?
Define Rollback Plan: How do you restore normal operations?
Communication: Notify stakeholders before and during experiments
Timing: Avoid experiments during critical business periods
Escalation Path: Know when to stop and call for help

Tool Selection

Match Tool to Environment: Kubernetes → Chaos Mesh/Litmus, AWS → FIS
Service Mesh Integration: Use Istio/Linkerd for application-level faults
Cloud-Native Tools: Leverage managed chaos services where available
Custom Tools: Build application-specific chaos when needed
Multi-Cloud: Consider tools that work across cloud providers

Observability Integration

Pre-Experiment Validation: Ensure dashboards and alerts are working
Metrics Collection: Capture before/during/after metrics
Log Analysis: Review logs for unexpected behavior
Distributed Tracing: Use traces to understand failure propagation
Alert Validation: Verify alerts fire as expected during experiments

Cultural Aspects

Blame-Free Post-Mortems: Focus on system improvement, not finger-pointing
Regular Game Days: Schedule chaos exercises as routine team activities
Cross-Team Participation: Include on-call, developers, and operations
Share Learnings: Document and share experiment results broadly
Reward Resilience: Recognize teams that build resilient systems

chaos-engineer

NPX Install

Tags

SKILL.md Content

Chaos Engineer

Purpose

When to Use

2. Decision Framework

Experiment Design Matrix

Tool Selection

Blast Radius Control

4. Core Workflows

Workflow 1: Kubernetes Pod Chaos (Chaos Mesh)

Workflow 3: Zone Outage Simulation (Game Day)

5. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Testing in Production First

❌ Anti-Pattern 2: No Observability

❌ Anti-Pattern 3: Random Chaos (Chaos Monkey Style)

7. Quality Checklist

Examples

Example 1: Kubernetes Pod Failure Recovery

Example 2: Database Failover Validation

Example 3: Third-Party API Dependency Testing

Best Practices

Experiment Design

Safety and Controls

Tool Selection

Observability Integration

Cultural Aspects