instantly-incident-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInstantly Incident Runbook
Instantly事件响应运行手册
Overview
概述
Rapid incident response procedures for Instantly-related outages.
针对Instantly相关故障的快速事件响应流程。
Prerequisites
前置条件
- Access to Instantly dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
- 拥有Instantly仪表盘和状态页的访问权限
- 生产集群的kubectl访问权限
- Prometheus/Grafana访问权限
- 沟通渠道(Slack、PagerDuty)
Severity Levels
严重级别
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | Instantly API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |
| 级别 | 定义 | 响应时效 | 示例 |
|---|---|---|---|
| P1 | 服务完全中断 | < 15分钟 | Instantly API无法访问 |
| P2 | 服务降级 | < 1小时 | 高延迟、部分功能故障 |
| P3 | 轻微影响 | < 4小时 | Webhook延迟、非关键错误 |
| P4 | 无用户影响 | 下个工作日 | 监控缺口 |
Quick Triage
快速分诊
bash
undefinedbash
undefined1. Check Instantly status
1. 检查Instantly状态
curl -s https://status.instantly.com | jq
curl -s https://status.instantly.com | jq
2. Check our integration health
2. 检查我方集成健康状态
curl -s https://api.yourapp.com/health | jq '.services.instantly'
curl -s https://api.yourapp.com/health | jq '.services.instantly'
3. Check error rate (last 5 min)
3. 查看近5分钟错误率
curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])
curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])
4. Recent error logs
4. 查看近期错误日志
kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20
undefinedkubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20
undefinedDecision Tree
决策树
Instantly API returning errors?
├─ YES: Is status.instantly.com showing incident?
│ ├─ YES → Wait for Instantly to resolve. Enable fallback.
│ └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
├─ YES → Likely resolved or intermittent. Monitor.
└─ NO → Our infrastructure issue. Check pods, memory, network.Instantly API返回错误?
├─ 是:status.instantly.com是否显示有事件?
│ ├─ 是 → 等待Instantly修复,启用降级方案。
│ └─ 否 → 我方集成问题,检查凭证、配置。
└─ 否:我方服务是否正常?
├─ 是 → 大概率已恢复或为偶发问题,持续监控。
└─ 否 → 我方基础设施问题,检查pod、内存、网络。Immediate Actions by Error Type
按错误类型采取即时措施
401/403 - Authentication
401/403 - 认证错误
bash
undefinedbash
undefinedVerify API key is set
验证API密钥是否配置
kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d
kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d
Check if key was rotated
检查密钥是否已轮换
→ Verify in Instantly dashboard
→ 在Instantly仪表盘验证
Remediation: Update secret and restart pods
修复方案:更新secret并重启pod
kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/instantly-integration
undefinedkubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/instantly-integration
undefined429 - Rate Limited
429 - 触发限流
bash
undefinedbash
undefinedCheck rate limit headers
检查限流头
curl -v https://api.instantly.com 2>&1 | grep -i rate
curl -v https://api.instantly.com 2>&1 | grep -i rate
Enable request queuing
启用请求排队
kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue
kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue
Long-term: Contact Instantly for limit increase
长期方案:联系Instantly申请提升限流阈值
undefinedundefined500/503 - Instantly Errors
500/503 - Instantly侧错误
bash
undefinedbash
undefinedEnable graceful degradation
启用优雅降级
kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true
kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true
Notify users of degraded service
通知用户服务降级
Update status page
更新状态页
Monitor Instantly status for resolution
监控Instantly状态等待修复
undefinedundefinedCommunication Templates
沟通模板
Internal (Slack)
内部(Slack)
🔴 P1 INCIDENT: Instantly Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]🔴 P1事件:Instantly集成
状态:排查中
影响:[描述对用户的影响]
当前动作:[正在执行的操作]
下次更新时间:[时间]
事件负责人:@[姓名]External (Status Page)
外部(状态页)
Instantly Integration Issue
We're experiencing issues with our Instantly integration.
Some users may experience [specific impact].
We're actively investigating and will provide updates.
Last updated: [timestamp]Instantly集成故障
我们的Instantly集成目前出现问题,部分用户可能遇到[具体影响]。
我们正在积极排查,会持续同步最新进展。
最后更新:[时间戳]Post-Incident
事后处理
Evidence Collection
证据收集
bash
undefinedbash
undefinedGenerate debug bundle
生成调试包
./scripts/instantly-debug-bundle.sh
./scripts/instantly-debug-bundle.sh
Export relevant logs
导出相关日志
kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt
kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt
Capture metrics
导出指标
curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json
undefinedcurl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json
undefinedPostmortem Template
事后复盘模板
markdown
undefinedmarkdown
undefinedIncident: Instantly [Error Type]
事件:Instantly [错误类型]
Date: YYYY-MM-DD
Duration: X hours Y minutes
Severity: P[1-4]
日期: YYYY-MM-DD
持续时间: X小时Y分钟
严重级别: P[1-4]
Summary
摘要
[1-2 sentence description]
[1-2句话描述事件]
Timeline
时间线
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [事件]
- HH:MM - [事件]
Root Cause
根因
[Technical explanation]
[技术说明]
Impact
影响
- Users affected: N
- Revenue impact: $X
- 受影响用户数:N
- 收入影响:$X
Action Items
行动项
- [Preventive measure] - Owner - Due date
undefined- [预防措施] - 负责人 - 截止日期
undefinedInstructions
使用说明
Step 1: Quick Triage
步骤1:快速分诊
Run the triage commands to identify the issue source.
执行分诊命令定位问题来源。
Step 2: Follow Decision Tree
步骤2:遵循决策树
Determine if the issue is Instantly-side or internal.
判断问题出在Instantly侧还是我方内部。
Step 3: Execute Immediate Actions
步骤3:执行即时处理动作
Apply the appropriate remediation for the error type.
针对错误类型采取对应的修复措施。
Step 4: Communicate Status
步骤4:同步状态
Update internal and external stakeholders.
更新内部和外部相关方。
Output
产出
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem
- 问题已识别并分类
- 修复措施已执行
- 相关方已收到通知
- 已收集事后复盘所需证据
Error Handling
错误处理
| Issue | Cause | Solution |
|---|---|---|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 无法访问状态页 | 网络问题 | 使用移动网络或VPN |
| kubectl执行失败 | 认证过期 | 重新认证 |
| 指标不可用 | Prometheus宕机 | 检查备用指标 |
| 密钥轮换失败 | 权限不足 | 上报管理员处理 |
Examples
示例
One-Line Health Check
单行健康检查
bash
curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"bash
curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"Resources
资源
Next Steps
后续步骤
For data handling, see .
instantly-data-handling数据处理相关内容请参考。
instantly-data-handling