posthog-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PostHog Incident Runbook

PostHog事件响应手册

Overview

概述

Rapid incident response procedures for PostHog-related outages.

针对PostHog相关故障的快速事件响应流程。

Prerequisites

前置条件

Access to PostHog dashboard and status page
kubectl access to production cluster
Prometheus/Grafana access
Communication channels (Slack, PagerDuty)

拥有PostHog仪表盘和状态页面的访问权限
拥有生产集群的kubectl访问权限
拥有Prometheus/Grafana访问权限
沟通渠道（Slack、PagerDuty）

Severity Levels

严重级别

Level	Definition	Response Time	Examples
P1	Complete outage	< 15 min	PostHog API unreachable
P2	Degraded service	< 1 hour	High latency, partial failures
P3	Minor impact	< 4 hours	Webhook delays, non-critical errors
P4	No user impact	Next business day	Monitoring gaps

级别	定义	响应时间	示例
P1	完全故障	< 15分钟	PostHog API无法访问
P2	服务降级	< 1小时	高延迟、部分功能故障
P3	轻微影响	< 4小时	Webhook延迟、非关键错误
P4	无用户影响	下一个工作日	监控缺口

Quick Triage

快速分类排查

bash

undefined

bash

undefined

1. Check PostHog status

1. 检查PostHog状态

curl -s https://status.posthog.com | jq

2. Check our integration health

2. 检查集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.posthog'

3. Check error rate (last 5 min)

3. 检查错误率（最近5分钟）

curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m])

4. Recent error logs

4. 近期错误日志

kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20

undefined

kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20

undefined

Decision Tree

决策树

PostHog API returning errors?
├─ YES: Is status.posthog.com showing incident?
│   ├─ YES → Wait for PostHog to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

PostHog API返回错误？
├─ 是：status.posthog.com是否显示事件？
│   ├─ 是 → 等待PostHog修复。启用降级方案。
│   └─ 否 → 我方集成问题。检查凭证、配置。
└─ 否：我方服务是否健康？
    ├─ 是 → 可能已恢复或为间歇性问题。持续监控。
    └─ 否 → 我方基础设施问题。检查Pod、内存、网络。

Immediate Actions by Error Type

按错误类型执行即时操作

401/403 - Authentication

401/403 - 认证错误

bash

undefined

bash

undefined

Verify API key is set

验证API密钥是否已设置

kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in PostHog dashboard

→ 在PostHog仪表盘中验证

Remediation: Update secret and restart pods

修复措施：更新密钥并重启Pod

kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/posthog-integration

undefined

kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/posthog-integration

undefined

429 - Rate Limited

429 - 请求受限

bash

undefined

bash

undefined

Check rate limit headers

检查速率限制头部信息

curl -v https://api.posthog.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队

kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue

Long-term: Contact PostHog for limit increase

长期方案：联系PostHog提升限制

undefined

undefined

500/503 - PostHog Errors

500/503 - PostHog内部错误

bash

undefined

bash

undefined

Enable graceful degradation

启用优雅降级

kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页面

Monitor PostHog status for resolution

监控PostHog状态以确认恢复

undefined

undefined

Communication Templates

沟通模板

Internal (Slack)

内部（Slack）

🔴 P1 INCIDENT: PostHog Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

🔴 P1事件：PostHog集成
状态：调查中
影响：[描述用户影响]
当前行动：[正在执行的操作]
下次更新：[时间]
事件负责人：@[姓名]

External (Status Page)

外部（状态页面）

PostHog Integration Issue

We're experiencing issues with our PostHog integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

PostHog集成问题

我们的PostHog集成遇到问题。
部分用户可能会受到[具体影响]。

我们正在积极调查，将及时提供更新。

最后更新时间：[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash

undefined

bash

undefined

Generate debug bundle

生成调试包

./scripts/posthog-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt

Capture metrics

捕获指标数据

curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json

undefined

curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json

undefined

Postmortem Template

事后复盘模板

markdown

undefined

markdown

undefined

Incident: PostHog [Error Type]

事件：PostHog [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]

日期： YYYY-MM-DD 持续时间： X小时Y分钟 严重级别： P[1-4]

Summary

摘要

[1-2 sentence description]

[1-2句话描述事件]

Timeline

时间线

HH:MM - [Event]
HH:MM - [Event]

HH:MM - [事件]
HH:MM - [事件]

Root Cause

根本原因

[Technical explanation]

[技术说明]

Impact

影响

Users affected: N
Revenue impact: $X

受影响用户数：N
收入影响：$X

Action Items

行动项

[Preventive measure] - Owner - Due date

undefined

[预防措施] - 负责人 - 截止日期

undefined

Instructions

操作步骤

Step 1: Quick Triage

步骤1：快速分类排查

Run the triage commands to identify the issue source.

运行分类排查命令以确定问题来源。

Step 2: Follow Decision Tree

步骤2：遵循决策树

Determine if the issue is PostHog-side or internal.

判断问题来自PostHog端还是内部。

Step 3: Execute Immediate Actions

步骤3：执行即时操作

Apply the appropriate remediation for the error type.

针对错误类型应用相应的修复措施。

Step 4: Communicate Status

步骤4：同步状态更新

Update internal and external stakeholders.

向内部和外部相关方同步状态。

Output

输出结果

Issue identified and categorized
Remediation applied
Stakeholders notified
Evidence collected for postmortem

问题已识别并分类
已应用修复措施
已通知相关方
已收集事后复盘所需证据

Error Handling

错误处理

Issue	Cause	Solution
Can't reach status page	Network issue	Use mobile or VPN
kubectl fails	Auth expired	Re-authenticate
Metrics unavailable	Prometheus down	Check backup metrics
Secret rotation fails	Permission denied	Escalate to admin

问题	原因	解决方案
无法访问状态页面	网络问题	使用移动网络或VPN
kubectl执行失败	认证过期	重新认证
指标数据不可用	Prometheus故障	检查备用指标
密钥轮换失败	权限不足	上报给管理员

Examples

示例

One-Line Health Check

单行健康检查

bash

curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"

bash

curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"

Resources

参考资源

Next Steps

后续步骤

For data handling, see

posthog-data-handling

关于数据处理，请查看

posthog-data-handling

。