instantly-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Instantly Incident Runbook

Instantly事件响应运行手册

Overview

概述

Rapid incident response procedures for Instantly-related outages.
针对Instantly相关故障的快速事件响应流程。

Prerequisites

前置条件

  • Access to Instantly dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)
  • 拥有Instantly仪表盘和状态页的访问权限
  • 生产集群的kubectl访问权限
  • Prometheus/Grafana访问权限
  • 沟通渠道(Slack、PagerDuty)

Severity Levels

严重级别

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minInstantly API unreachable
P2Degraded service< 1 hourHigh latency, partial failures
P3Minor impact< 4 hoursWebhook delays, non-critical errors
P4No user impactNext business dayMonitoring gaps
级别定义响应时效示例
P1服务完全中断< 15分钟Instantly API无法访问
P2服务降级< 1小时高延迟、部分功能故障
P3轻微影响< 4小时Webhook延迟、非关键错误
P4无用户影响下个工作日监控缺口

Quick Triage

快速分诊

bash
undefined
bash
undefined

1. Check Instantly status

1. 检查Instantly状态

2. Check our integration health

2. 检查我方集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.instantly'
curl -s https://api.yourapp.com/health | jq '.services.instantly'

3. Check error rate (last 5 min)

3. 查看近5分钟错误率

curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])
curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])

4. Recent error logs

4. 查看近期错误日志

kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20
undefined
kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20
undefined

Decision Tree

决策树

Instantly API returning errors?
├─ YES: Is status.instantly.com showing incident?
│   ├─ YES → Wait for Instantly to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
Instantly API返回错误?
├─ 是:status.instantly.com是否显示有事件?
│   ├─ 是 → 等待Instantly修复,启用降级方案。
│   └─ 否 → 我方集成问题,检查凭证、配置。
└─ 否:我方服务是否正常?
    ├─ 是 → 大概率已恢复或为偶发问题,持续监控。
    └─ 否 → 我方基础设施问题,检查pod、内存、网络。

Immediate Actions by Error Type

按错误类型采取即时措施

401/403 - Authentication

401/403 - 认证错误

bash
undefined
bash
undefined

Verify API key is set

验证API密钥是否配置

kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d
kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in Instantly dashboard

→ 在Instantly仪表盘验证

Remediation: Update secret and restart pods

修复方案:更新secret并重启pod

kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/instantly-integration
undefined
kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/instantly-integration
undefined

429 - Rate Limited

429 - 触发限流

bash
undefined
bash
undefined

Check rate limit headers

检查限流头

curl -v https://api.instantly.com 2>&1 | grep -i rate
curl -v https://api.instantly.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队

kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue
kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue

Long-term: Contact Instantly for limit increase

长期方案:联系Instantly申请提升限流阈值

undefined
undefined

500/503 - Instantly Errors

500/503 - Instantly侧错误

bash
undefined
bash
undefined

Enable graceful degradation

启用优雅降级

kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true
kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页

Monitor Instantly status for resolution

监控Instantly状态等待修复

undefined
undefined

Communication Templates

沟通模板

Internal (Slack)

内部(Slack)

🔴 P1 INCIDENT: Instantly Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
🔴 P1事件:Instantly集成
状态:排查中
影响:[描述对用户的影响]
当前动作:[正在执行的操作]
下次更新时间:[时间]
事件负责人:@[姓名]

External (Status Page)

外部(状态页)

Instantly Integration Issue

We're experiencing issues with our Instantly integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
Instantly集成故障

我们的Instantly集成目前出现问题,部分用户可能遇到[具体影响]。

我们正在积极排查,会持续同步最新进展。

最后更新:[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash
undefined
bash
undefined

Generate debug bundle

生成调试包

./scripts/instantly-debug-bundle.sh
./scripts/instantly-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt
kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt

Capture metrics

导出指标

curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json
undefined
curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json
undefined

Postmortem Template

事后复盘模板

markdown
undefined
markdown
undefined

Incident: Instantly [Error Type]

事件:Instantly [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]
日期: YYYY-MM-DD 持续时间: X小时Y分钟 严重级别: P[1-4]

Summary

摘要

[1-2 sentence description]
[1-2句话描述事件]

Timeline

时间线

  • HH:MM - [Event]
  • HH:MM - [Event]
  • HH:MM - [事件]
  • HH:MM - [事件]

Root Cause

根因

[Technical explanation]
[技术说明]

Impact

影响

  • Users affected: N
  • Revenue impact: $X
  • 受影响用户数:N
  • 收入影响:$X

Action Items

行动项

  • [Preventive measure] - Owner - Due date
undefined
  • [预防措施] - 负责人 - 截止日期
undefined

Instructions

使用说明

Step 1: Quick Triage

步骤1:快速分诊

Run the triage commands to identify the issue source.
执行分诊命令定位问题来源。

Step 2: Follow Decision Tree

步骤2:遵循决策树

Determine if the issue is Instantly-side or internal.
判断问题出在Instantly侧还是我方内部。

Step 3: Execute Immediate Actions

步骤3:执行即时处理动作

Apply the appropriate remediation for the error type.
针对错误类型采取对应的修复措施。

Step 4: Communicate Status

步骤4:同步状态

Update internal and external stakeholders.
更新内部和外部相关方。

Output

产出

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem
  • 问题已识别并分类
  • 修复措施已执行
  • 相关方已收到通知
  • 已收集事后复盘所需证据

Error Handling

错误处理

IssueCauseSolution
Can't reach status pageNetwork issueUse mobile or VPN
kubectl failsAuth expiredRe-authenticate
Metrics unavailablePrometheus downCheck backup metrics
Secret rotation failsPermission deniedEscalate to admin
问题原因解决方案
无法访问状态页网络问题使用移动网络或VPN
kubectl执行失败认证过期重新认证
指标不可用Prometheus宕机检查备用指标
密钥轮换失败权限不足上报管理员处理

Examples

示例

One-Line Health Check

单行健康检查

bash
curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"
bash
curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"

Resources

资源

Next Steps

后续步骤

For data handling, see
instantly-data-handling
.
数据处理相关内容请参考
instantly-data-handling