posthog-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PostHog Incident Runbook

PostHog事件响应手册

Overview

概述

Rapid incident response procedures for PostHog-related outages.
针对PostHog相关故障的快速事件响应流程。

Prerequisites

前置条件

  • Access to PostHog dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)
  • 拥有PostHog仪表盘和状态页面的访问权限
  • 拥有生产集群的kubectl访问权限
  • 拥有Prometheus/Grafana访问权限
  • 沟通渠道(Slack、PagerDuty)

Severity Levels

严重级别

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minPostHog API unreachable
P2Degraded service< 1 hourHigh latency, partial failures
P3Minor impact< 4 hoursWebhook delays, non-critical errors
P4No user impactNext business dayMonitoring gaps
级别定义响应时间示例
P1完全故障< 15分钟PostHog API无法访问
P2服务降级< 1小时高延迟、部分功能故障
P3轻微影响< 4小时Webhook延迟、非关键错误
P4无用户影响下一个工作日监控缺口

Quick Triage

快速分类排查

bash
undefined
bash
undefined

1. Check PostHog status

1. 检查PostHog状态

2. Check our integration health

2. 检查集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.posthog'
curl -s https://api.yourapp.com/health | jq '.services.posthog'

3. Check error rate (last 5 min)

3. 检查错误率(最近5分钟)

curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])
curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])

4. Recent error logs

4. 近期错误日志

kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20
undefined
kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20
undefined

Decision Tree

决策树

PostHog API returning errors?
├─ YES: Is status.posthog.com showing incident?
│   ├─ YES → Wait for PostHog to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
PostHog API返回错误?
├─ 是:status.posthog.com是否显示事件?
│   ├─ 是 → 等待PostHog修复。启用降级方案。
│   └─ 否 → 我方集成问题。检查凭证、配置。
└─ 否:我方服务是否健康?
    ├─ 是 → 可能已恢复或为间歇性问题。持续监控。
    └─ 否 → 我方基础设施问题。检查Pod、内存、网络。

Immediate Actions by Error Type

按错误类型执行即时操作

401/403 - Authentication

401/403 - 认证错误

bash
undefined
bash
undefined

Verify API key is set

验证API密钥是否已设置

kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d
kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in PostHog dashboard

→ 在PostHog仪表盘中验证

Remediation: Update secret and restart pods

修复措施:更新密钥并重启Pod

kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/posthog-integration
undefined
kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/posthog-integration
undefined

429 - Rate Limited

429 - 请求受限

bash
undefined
bash
undefined

Check rate limit headers

检查速率限制头部信息

curl -v https://api.posthog.com 2>&1 | grep -i rate
curl -v https://api.posthog.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队

kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue
kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue

Long-term: Contact PostHog for limit increase

长期方案:联系PostHog提升限制

undefined
undefined

500/503 - PostHog Errors

500/503 - PostHog内部错误

bash
undefined
bash
undefined

Enable graceful degradation

启用优雅降级

kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true
kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页面

Monitor PostHog status for resolution

监控PostHog状态以确认恢复

undefined
undefined

Communication Templates

沟通模板

Internal (Slack)

内部(Slack)

🔴 P1 INCIDENT: PostHog Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
🔴 P1事件:PostHog集成
状态:调查中
影响:[描述用户影响]
当前行动:[正在执行的操作]
下次更新:[时间]
事件负责人:@[姓名]

External (Status Page)

外部(状态页面)

PostHog Integration Issue

We're experiencing issues with our PostHog integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
PostHog集成问题

我们的PostHog集成遇到问题。
部分用户可能会受到[具体影响]。

我们正在积极调查,将及时提供更新。

最后更新时间:[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash
undefined
bash
undefined

Generate debug bundle

生成调试包

./scripts/posthog-debug-bundle.sh
./scripts/posthog-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt
kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt

Capture metrics

捕获指标数据

curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json
undefined
curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json
undefined

Postmortem Template

事后复盘模板

markdown
undefined
markdown
undefined

Incident: PostHog [Error Type]

事件:PostHog [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]
日期: YYYY-MM-DD 持续时间: X小时Y分钟 严重级别: P[1-4]

Summary

摘要

[1-2 sentence description]
[1-2句话描述事件]

Timeline

时间线

  • HH:MM - [Event]
  • HH:MM - [Event]
  • HH:MM - [事件]
  • HH:MM - [事件]

Root Cause

根本原因

[Technical explanation]
[技术说明]

Impact

影响

  • Users affected: N
  • Revenue impact: $X
  • 受影响用户数:N
  • 收入影响:$X

Action Items

行动项

  • [Preventive measure] - Owner - Due date
undefined
  • [预防措施] - 负责人 - 截止日期
undefined

Instructions

操作步骤

Step 1: Quick Triage

步骤1:快速分类排查

Run the triage commands to identify the issue source.
运行分类排查命令以确定问题来源。

Step 2: Follow Decision Tree

步骤2:遵循决策树

Determine if the issue is PostHog-side or internal.
判断问题来自PostHog端还是内部。

Step 3: Execute Immediate Actions

步骤3:执行即时操作

Apply the appropriate remediation for the error type.
针对错误类型应用相应的修复措施。

Step 4: Communicate Status

步骤4:同步状态更新

Update internal and external stakeholders.
向内部和外部相关方同步状态。

Output

输出结果

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem
  • 问题已识别并分类
  • 已应用修复措施
  • 已通知相关方
  • 已收集事后复盘所需证据

Error Handling

错误处理

IssueCauseSolution
Can't reach status pageNetwork issueUse mobile or VPN
kubectl failsAuth expiredRe-authenticate
Metrics unavailablePrometheus downCheck backup metrics
Secret rotation failsPermission deniedEscalate to admin
问题原因解决方案
无法访问状态页面网络问题使用移动网络或VPN
kubectl执行失败认证过期重新认证
指标数据不可用Prometheus故障检查备用指标
密钥轮换失败权限不足上报给管理员

Examples

示例

One-Line Health Check

单行健康检查

bash
curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"
bash
curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"

Resources

参考资源

Next Steps

后续步骤

For data handling, see
posthog-data-handling
.
关于数据处理,请查看
posthog-data-handling