Loading...
Loading...
Expert in SRE practices, incident management, root cause analysis, and automated remediation.
npx skill4agent add neversight/skills_feed devops-incident-responder| Level | Criteria | Response | SLA (Response) |
|---|---|---|---|
| SEV-1 | Critical user impact (Site Down, Data Loss). | Wake up everyone. CEO notified. | 15 mins |
| SEV-2 | Major feature broken (Checkout fails). | Wake up on-call. | 30 mins |
| SEV-3 | Minor issue (Internal tool slow). | Handle next business day. | 8 business hours |
| SEV-4 | Trivial bug / Cosmetic. | Backlog. | N/A |
security-engineernetstatDiskSpaceLowdocker system prune -fjournalctl --vacuum-time=1d# Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
outlierDetection:
consecutive5xxErrors: 1
interval: 1s
baseEjectionTime: 3m
maxEjectionPercent: 100# Runbook: High Database CPU
**Severity:** SEV-2
**Trigger:** RDS CPU > 90% for 5 mins
## 1. Triage
- Check [Database Dashboard](link).
- Is it a specific query? (See "Top SQL" panel).
## 2. Mitigation Actions
- **Option A (Bad Query):** Kill the session.
`SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE ...`
- **Option B (Traffic Spike):** Scale Read Replicas (Terraform apply).
- **Option C (Maintenance):** Stop non-essential cron jobs.
## 3. Escalation
- If CPU remains > 95% for 15 mins, page @database-team.