Loading...
Loading...
Compare original and translation side by side
What are we testing?
│
├─ **Infrastructure Layer**
│ ├─ Pods/Containers? → **Pod Kill / Container Crash**
│ ├─ Nodes? → **Node Drain / Reboot**
│ └─ Network? → **Latency / Packet Loss / Partition**
│
├─ **Application Layer**
│ ├─ Dependencies? → **Block Access to DB/Redis**
│ ├─ Resources? → **CPU/Memory Stress**
│ └─ Logic? → **Inject HTTP 500 / Delays**
│
└─ **Platform Layer**
├─ IAM? → **Revoke Keys**
└─ DNS? → **Block DNS Resolution**我们要测试什么?
│
├─ **基础设施层**
│ ├─ Pod/容器? → **Pod销毁 / 容器崩溃**
│ ├─ 节点? → **节点驱逐 / 重启**
│ └─ 网络? → **延迟 / 丢包 / 分区**
│
├─ **应用层**
│ ├─ 依赖服务? → **阻断对DB/Redis的访问**
│ ├─ 资源? → **CPU/内存压力测试**
│ └─ 逻辑? → **注入HTTP 500错误 / 延迟**
│
└─ **平台层**
├─ IAM? → **吊销密钥**
└─ DNS? → **阻断DNS解析**| Environment | Tool | Best For |
|---|---|---|
| Kubernetes | Chaos Mesh / Litmus | Native K8s experiments (Network, Pod, IO). |
| AWS/Cloud | AWS FIS / Gremlin | Cloud-level faults (AZ outage, EC2 stop). |
| Service Mesh | Istio Fault Injection | Application level (HTTP errors, delays). |
| Java/Spring | Chaos Monkey for Spring | App-level logic attacks. |
| 环境 | 工具 | 适用场景 |
|---|---|---|
| Kubernetes | Chaos Mesh / Litmus | 原生K8s实验(网络、Pod、IO)。 |
| AWS/云环境 | AWS FIS / Gremlin | 云级故障(可用区中断、EC2停止)。 |
| 服务网格 | Istio Fault Injection | 应用层故障(HTTP错误、延迟)。 |
| Java/Spring | Chaos Monkey for Spring | 应用层逻辑攻击测试。 |
| Level | Scope | Risk | Approval Needed |
|---|---|---|---|
| Local/Dev | Single container | Low | None |
| Staging | Full cluster | Medium | QA Lead |
| Production (Canary) | 1% Traffic | High | Engineering Director |
| Production (Full) | All Traffic | Critical | VP/CTO (Game Day) |
sre-engineer| 级别 | 范围 | 风险 | 所需审批 |
|---|---|---|---|
| 本地/开发环境 | 单个容器 | 低 | 无需审批 |
| 预发布环境 | 全集群 | 中 | QA负责人 |
| 生产环境(金丝雀) | 1%流量 | 高 | 工程总监 |
| 生产环境(全量) | 全部流量 | 极高 | VP/CTO(Game Day) |
sre-engineerbackend-kill.yamlapiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: backend-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- prod
labelSelectors:
app: backend-service
duration: "30s"
scheduler:
cron: "@every 1m"backend-kill.yamlapiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: backend-kill
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- prod
labelSelectors:
app: backend-service
duration: "30s"
scheduler:
cron: "@every 1m"| Metric | Target | Actual | Status |
|---|---|---|---|
| RTO | < 30s | 18s | ✅ PASS |
| RPO | 0 data | 0 data | ✅ PASS |
| Application recovery | < 60s | 42s | ✅ PASS |
| Data consistency | 100% | 100% | ✅ PASS |
| 指标 | 目标 | 实际值 | 状态 |
|---|---|---|---|
| RTO | <30秒 | 18秒 | ✅ 通过 |
| RPO | 0数据丢失 | 0数据丢失 | ✅ 通过 |
| 应用恢复时间 | <60秒 | 42秒 | ✅ 通过 |
| 数据一致性 | 100% | 100% | ✅ 通过 |