Loading...
Loading...
Expert in resilience testing, fault injection, and building anti-fragile systems using controlled experiments.
npx skill4agent add 404kidwiz/claude-supercode-skills chaos-engineerWhat are we testing?
│
├─ **Infrastructure Layer**
│ ├─ Pods/Containers? → **Pod Kill / Container Crash**
│ ├─ Nodes? → **Node Drain / Reboot**
│ └─ Network? → **Latency / Packet Loss / Partition**
│
├─ **Application Layer**
│ ├─ Dependencies? → **Block Access to DB/Redis**
│ ├─ Resources? → **CPU/Memory Stress**
│ └─ Logic? → **Inject HTTP 500 / Delays**
│
└─ **Platform Layer**
├─ IAM? → **Revoke Keys**
└─ DNS? → **Block DNS Resolution**| Environment | Tool | Best For |
|---|---|---|
| Kubernetes | Chaos Mesh / Litmus | Native K8s experiments (Network, Pod, IO). |
| AWS/Cloud | AWS FIS / Gremlin | Cloud-level faults (AZ outage, EC2 stop). |
| Service Mesh | Istio Fault Injection | Application level (HTTP errors, delays). |
| Java/Spring | Chaos Monkey for Spring | App-level logic attacks. |
| Level | Scope | Risk | Approval Needed |
|---|---|---|---|
| Local/Dev | Single container | Low | None |
| Staging | Full cluster | Medium | QA Lead |
| Production (Canary) | 1% Traffic | High | Engineering Director |
| Production (Full) | All Traffic | Critical | VP/CTO (Game Day) |
sre-engineerbackend-kill.yamlapiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: backend-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- prod
labelSelectors:
app: backend-service
duration: "30s"
scheduler:
cron: "@every 1m"| Metric | Target | Actual | Status |
|---|---|---|---|
| RTO | < 30s | 18s | ✅ PASS |
| RPO | 0 data | 0 data | ✅ PASS |
| Application recovery | < 60s | 42s | ✅ PASS |
| Data consistency | 100% | 100% | ✅ PASS |