posthog-incident-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePostHog Incident Runbook
PostHog事件响应手册
Overview
概述
Rapid incident response procedures for PostHog-related outages.
针对PostHog相关故障的快速事件响应流程。
Prerequisites
前置条件
- Access to PostHog dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
- 拥有PostHog仪表盘和状态页面的访问权限
- 拥有生产集群的kubectl访问权限
- 拥有Prometheus/Grafana访问权限
- 沟通渠道(Slack、PagerDuty)
Severity Levels
严重级别
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | PostHog API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |
| 级别 | 定义 | 响应时间 | 示例 |
|---|---|---|---|
| P1 | 完全故障 | < 15分钟 | PostHog API无法访问 |
| P2 | 服务降级 | < 1小时 | 高延迟、部分功能故障 |
| P3 | 轻微影响 | < 4小时 | Webhook延迟、非关键错误 |
| P4 | 无用户影响 | 下一个工作日 | 监控缺口 |
Quick Triage
快速分类排查
bash
undefinedbash
undefined1. Check PostHog status
1. 检查PostHog状态
curl -s https://status.posthog.com | jq
curl -s https://status.posthog.com | jq
2. Check our integration health
2. 检查集成健康状态
curl -s https://api.yourapp.com/health | jq '.services.posthog'
curl -s https://api.yourapp.com/health | jq '.services.posthog'
3. Check error rate (last 5 min)
3. 检查错误率(最近5分钟)
curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])
curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])
4. Recent error logs
4. 近期错误日志
kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20
undefinedkubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20
undefinedDecision Tree
决策树
PostHog API returning errors?
├─ YES: Is status.posthog.com showing incident?
│ ├─ YES → Wait for PostHog to resolve. Enable fallback.
│ └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
├─ YES → Likely resolved or intermittent. Monitor.
└─ NO → Our infrastructure issue. Check pods, memory, network.PostHog API返回错误?
├─ 是:status.posthog.com是否显示事件?
│ ├─ 是 → 等待PostHog修复。启用降级方案。
│ └─ 否 → 我方集成问题。检查凭证、配置。
└─ 否:我方服务是否健康?
├─ 是 → 可能已恢复或为间歇性问题。持续监控。
└─ 否 → 我方基础设施问题。检查Pod、内存、网络。Immediate Actions by Error Type
按错误类型执行即时操作
401/403 - Authentication
401/403 - 认证错误
bash
undefinedbash
undefinedVerify API key is set
验证API密钥是否已设置
kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d
kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d
Check if key was rotated
检查密钥是否已轮换
→ Verify in PostHog dashboard
→ 在PostHog仪表盘中验证
Remediation: Update secret and restart pods
修复措施:更新密钥并重启Pod
kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/posthog-integration
undefinedkubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/posthog-integration
undefined429 - Rate Limited
429 - 请求受限
bash
undefinedbash
undefinedCheck rate limit headers
检查速率限制头部信息
curl -v https://api.posthog.com 2>&1 | grep -i rate
curl -v https://api.posthog.com 2>&1 | grep -i rate
Enable request queuing
启用请求排队
kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue
kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue
Long-term: Contact PostHog for limit increase
长期方案:联系PostHog提升限制
undefinedundefined500/503 - PostHog Errors
500/503 - PostHog内部错误
bash
undefinedbash
undefinedEnable graceful degradation
启用优雅降级
kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true
kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true
Notify users of degraded service
通知用户服务降级
Update status page
更新状态页面
Monitor PostHog status for resolution
监控PostHog状态以确认恢复
undefinedundefinedCommunication Templates
沟通模板
Internal (Slack)
内部(Slack)
🔴 P1 INCIDENT: PostHog Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]🔴 P1事件:PostHog集成
状态:调查中
影响:[描述用户影响]
当前行动:[正在执行的操作]
下次更新:[时间]
事件负责人:@[姓名]External (Status Page)
外部(状态页面)
PostHog Integration Issue
We're experiencing issues with our PostHog integration.
Some users may experience [specific impact].
We're actively investigating and will provide updates.
Last updated: [timestamp]PostHog集成问题
我们的PostHog集成遇到问题。
部分用户可能会受到[具体影响]。
我们正在积极调查,将及时提供更新。
最后更新时间:[时间戳]Post-Incident
事后处理
Evidence Collection
证据收集
bash
undefinedbash
undefinedGenerate debug bundle
生成调试包
./scripts/posthog-debug-bundle.sh
./scripts/posthog-debug-bundle.sh
Export relevant logs
导出相关日志
kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt
kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt
Capture metrics
捕获指标数据
curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json
undefinedcurl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json
undefinedPostmortem Template
事后复盘模板
markdown
undefinedmarkdown
undefinedIncident: PostHog [Error Type]
事件:PostHog [错误类型]
Date: YYYY-MM-DD
Duration: X hours Y minutes
Severity: P[1-4]
日期: YYYY-MM-DD
持续时间: X小时Y分钟
严重级别: P[1-4]
Summary
摘要
[1-2 sentence description]
[1-2句话描述事件]
Timeline
时间线
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [事件]
- HH:MM - [事件]
Root Cause
根本原因
[Technical explanation]
[技术说明]
Impact
影响
- Users affected: N
- Revenue impact: $X
- 受影响用户数:N
- 收入影响:$X
Action Items
行动项
- [Preventive measure] - Owner - Due date
undefined- [预防措施] - 负责人 - 截止日期
undefinedInstructions
操作步骤
Step 1: Quick Triage
步骤1:快速分类排查
Run the triage commands to identify the issue source.
运行分类排查命令以确定问题来源。
Step 2: Follow Decision Tree
步骤2:遵循决策树
Determine if the issue is PostHog-side or internal.
判断问题来自PostHog端还是内部。
Step 3: Execute Immediate Actions
步骤3:执行即时操作
Apply the appropriate remediation for the error type.
针对错误类型应用相应的修复措施。
Step 4: Communicate Status
步骤4:同步状态更新
Update internal and external stakeholders.
向内部和外部相关方同步状态。
Output
输出结果
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem
- 问题已识别并分类
- 已应用修复措施
- 已通知相关方
- 已收集事后复盘所需证据
Error Handling
错误处理
| Issue | Cause | Solution |
|---|---|---|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 无法访问状态页面 | 网络问题 | 使用移动网络或VPN |
| kubectl执行失败 | 认证过期 | 重新认证 |
| 指标数据不可用 | Prometheus故障 | 检查备用指标 |
| 密钥轮换失败 | 权限不足 | 上报给管理员 |
Examples
示例
One-Line Health Check
单行健康检查
bash
curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"bash
curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"Resources
参考资源
Next Steps
后续步骤
For data handling, see .
posthog-data-handling关于数据处理,请查看。
posthog-data-handling