instantly-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Instantly Incident Runbook

Instantly事件响应运行手册

Overview

概述

Rapid incident response procedures for Instantly-related outages.

针对Instantly相关故障的快速事件响应流程。

Prerequisites

前置条件

Access to Instantly dashboard and status page
kubectl access to production cluster
Prometheus/Grafana access
Communication channels (Slack, PagerDuty)

拥有Instantly仪表盘和状态页的访问权限
生产集群的kubectl访问权限
Prometheus/Grafana访问权限
沟通渠道（Slack、PagerDuty）

Severity Levels

严重级别

Level	Definition	Response Time	Examples
P1	Complete outage	< 15 min	Instantly API unreachable
P2	Degraded service	< 1 hour	High latency, partial failures
P3	Minor impact	< 4 hours	Webhook delays, non-critical errors
P4	No user impact	Next business day	Monitoring gaps

级别	定义	响应时效	示例
P1	服务完全中断	< 15分钟	Instantly API无法访问
P2	服务降级	< 1小时	高延迟、部分功能故障
P3	轻微影响	< 4小时	Webhook延迟、非关键错误
P4	无用户影响	下个工作日	监控缺口

Quick Triage

快速分诊

bash

undefined

bash

undefined

1. Check Instantly status

1. 检查Instantly状态

curl -s https://status.instantly.com | jq

2. Check our integration health

2. 检查我方集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.instantly'

3. Check error rate (last 5 min)

3. 查看近5分钟错误率

curl -s localhost:9090/api/v1/query?query=rate(instantly_errors_total[5m])

4. Recent error logs

4. 查看近期错误日志

kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20

undefined

kubectl logs -l app=instantly-integration --since=5m | grep -i error | tail -20

undefined

Decision Tree

决策树

Instantly API returning errors?
├─ YES: Is status.instantly.com showing incident?
│   ├─ YES → Wait for Instantly to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

Instantly API返回错误？
├─ 是：status.instantly.com是否显示有事件？
│   ├─ 是 → 等待Instantly修复，启用降级方案。
│   └─ 否 → 我方集成问题，检查凭证、配置。
└─ 否：我方服务是否正常？
    ├─ 是 → 大概率已恢复或为偶发问题，持续监控。
    └─ 否 → 我方基础设施问题，检查pod、内存、网络。

Immediate Actions by Error Type

按错误类型采取即时措施

401/403 - Authentication

401/403 - 认证错误

bash

undefined

bash

undefined

Verify API key is set

验证API密钥是否配置

kubectl get secret instantly-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in Instantly dashboard

→ 在Instantly仪表盘验证

Remediation: Update secret and restart pods

修复方案：更新secret并重启pod

kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/instantly-integration

undefined

kubectl create secret generic instantly-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/instantly-integration

undefined

429 - Rate Limited

429 - 触发限流

bash

undefined

bash

undefined

Check rate limit headers

检查限流头

curl -v https://api.instantly.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队

kubectl set env deployment/instantly-integration RATE_LIMIT_MODE=queue

Long-term: Contact Instantly for limit increase

长期方案：联系Instantly申请提升限流阈值

undefined

undefined

500/503 - Instantly Errors

500/503 - Instantly侧错误

bash

undefined

bash

undefined

Enable graceful degradation

启用优雅降级

kubectl set env deployment/instantly-integration INSTANTLY_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页

Monitor Instantly status for resolution

监控Instantly状态等待修复

undefined

undefined

Communication Templates

沟通模板

Internal (Slack)

内部（Slack）

🔴 P1 INCIDENT: Instantly Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

🔴 P1事件：Instantly集成
状态：排查中
影响：[描述对用户的影响]
当前动作：[正在执行的操作]
下次更新时间：[时间]
事件负责人：@[姓名]

External (Status Page)

外部（状态页）

Instantly Integration Issue

We're experiencing issues with our Instantly integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

Instantly集成故障

我们的Instantly集成目前出现问题，部分用户可能遇到[具体影响]。

我们正在积极排查，会持续同步最新进展。

最后更新：[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash

undefined

bash

undefined

Generate debug bundle

生成调试包

./scripts/instantly-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=instantly-integration --since=1h > incident-logs.txt

Capture metrics

导出指标

curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json

undefined

curl "localhost:9090/api/v1/query_range?query=instantly_errors_total&start=2h" > metrics.json

undefined

Postmortem Template

事后复盘模板

markdown

undefined

markdown

undefined

Incident: Instantly [Error Type]

事件：Instantly [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]

日期： YYYY-MM-DD 持续时间： X小时Y分钟 严重级别： P[1-4]

Summary

摘要

[1-2 sentence description]

[1-2句话描述事件]

Timeline

时间线

HH:MM - [Event]
HH:MM - [Event]

HH:MM - [事件]
HH:MM - [事件]

Root Cause

根因

[Technical explanation]

[技术说明]

Impact

影响

Users affected: N
Revenue impact: $X

受影响用户数：N
收入影响：$X

Action Items

行动项

[Preventive measure] - Owner - Due date

undefined

[预防措施] - 负责人 - 截止日期

undefined

Instructions

使用说明

Step 1: Quick Triage

步骤1：快速分诊

Run the triage commands to identify the issue source.

执行分诊命令定位问题来源。

Step 2: Follow Decision Tree

步骤2：遵循决策树

Determine if the issue is Instantly-side or internal.

判断问题出在Instantly侧还是我方内部。

Step 3: Execute Immediate Actions

步骤3：执行即时处理动作

Apply the appropriate remediation for the error type.

针对错误类型采取对应的修复措施。

Step 4: Communicate Status

步骤4：同步状态

Update internal and external stakeholders.

更新内部和外部相关方。

Output

产出

Issue identified and categorized
Remediation applied
Stakeholders notified
Evidence collected for postmortem

问题已识别并分类
修复措施已执行
相关方已收到通知
已收集事后复盘所需证据

Error Handling

错误处理

Issue	Cause	Solution
Can't reach status page	Network issue	Use mobile or VPN
kubectl fails	Auth expired	Re-authenticate
Metrics unavailable	Prometheus down	Check backup metrics
Secret rotation fails	Permission denied	Escalate to admin

问题	原因	解决方案
无法访问状态页	网络问题	使用移动网络或VPN
kubectl执行失败	认证过期	重新认证
指标不可用	Prometheus宕机	检查备用指标
密钥轮换失败	权限不足	上报管理员处理

Examples

示例

One-Line Health Check

单行健康检查

bash

curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"

bash

curl -sf https://api.yourapp.com/health | jq '.services.instantly.status' || echo "UNHEALTHY"

Resources

资源

Next Steps

后续步骤

For data handling, see

instantly-data-handling

数据处理相关内容请参考

instantly-data-handling

。