firecrawl-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

FireCrawl Incident Runbook

FireCrawl事件响应手册

Overview

概述

Rapid incident response procedures for FireCrawl-related outages.
针对FireCrawl相关故障的快速事件响应流程。

Prerequisites

前置条件

  • Access to FireCrawl dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)
  • 拥有FireCrawl控制台和状态页面的访问权限
  • 拥有生产集群的kubectl访问权限
  • 拥有Prometheus/Grafana访问权限
  • 可用的沟通渠道(Slack、PagerDuty)

Severity Levels

严重级别

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minFireCrawl API unreachable
P2Degraded service< 1 hourHigh latency, partial failures
P3Minor impact< 4 hoursWebhook delays, non-critical errors
P4No user impactNext business dayMonitoring gaps
级别定义响应时间示例
P1完全故障< 15分钟FireCrawl API无法访问
P2服务降级< 1小时高延迟、部分功能故障
P3轻微影响< 4小时Webhook延迟、非关键错误
P4无用户影响下一个工作日监控缺口

Quick Triage

快速分类排查

bash
undefined
bash
undefined

1. Check FireCrawl status

1. 检查FireCrawl状态

2. Check our integration health

2. 检查集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.firecrawl'
curl -s https://api.yourapp.com/health | jq '.services.firecrawl'

3. Check error rate (last 5 min)

3. 检查错误率(最近5分钟)

curl -s localhost:9090/api/v1/query?query=rate(firecrawl_errors_total[5m])
curl -s localhost:9090/api/v1/query?query=rate(firecrawl_errors_total[5m])

4. Recent error logs

4. 近期错误日志

kubectl logs -l app=firecrawl-integration --since=5m | grep -i error | tail -20
undefined
kubectl logs -l app=firecrawl-integration --since=5m | grep -i error | tail -20
undefined

Decision Tree

决策树

FireCrawl API returning errors?
├─ YES: Is status.firecrawl.com showing incident?
│   ├─ YES → Wait for FireCrawl to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
FireCrawl API返回错误?
├─ 是:status.firecrawl.com是否显示事件?
│   ├─ 是 → 等待FireCrawl修复。启用 fallback 机制。
│   └─ 否 → 内部集成问题。检查凭证、配置。
└─ 否:内部服务是否健康?
    ├─ 是 → 问题已解决或为间歇性故障。持续监控。
    └─ 否 → 内部基础设施问题。检查Pod、内存、网络。

Immediate Actions by Error Type

按错误类型执行即时操作

401/403 - Authentication

401/403 - 认证错误

bash
undefined
bash
undefined

Verify API key is set

验证API密钥是否已配置

kubectl get secret firecrawl-secrets -o jsonpath='{.data.api-key}' | base64 -d
kubectl get secret firecrawl-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in FireCrawl dashboard

→ 在FireCrawl控制台中验证

Remediation: Update secret and restart pods

修复措施:更新密钥并重启Pod

kubectl create secret generic firecrawl-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/firecrawl-integration
undefined
kubectl create secret generic firecrawl-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/firecrawl-integration
undefined

429 - Rate Limited

429 - 速率限制

bash
undefined
bash
undefined

Check rate limit headers

检查速率限制响应头

curl -v https://api.firecrawl.com 2>&1 | grep -i rate
curl -v https://api.firecrawl.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队机制

kubectl set env deployment/firecrawl-integration RATE_LIMIT_MODE=queue
kubectl set env deployment/firecrawl-integration RATE_LIMIT_MODE=queue

Long-term: Contact FireCrawl for limit increase

长期方案:联系FireCrawl提升限制

undefined
undefined

500/503 - FireCrawl Errors

500/503 - FireCrawl内部错误

bash
undefined
bash
undefined

Enable graceful degradation

启用优雅降级机制

kubectl set env deployment/firecrawl-integration FIRECRAWL_FALLBACK=true
kubectl set env deployment/firecrawl-integration FIRECRAWL_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页面

Monitor FireCrawl status for resolution

监控FireCrawl状态直至问题解决

undefined
undefined

Communication Templates

沟通模板

Internal (Slack)

内部(Slack)

🔴 P1 INCIDENT: FireCrawl Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
🔴 P1事件:FireCrawl集成
状态:调查中
影响:[描述用户影响]
当前操作:[正在执行的动作]
下次更新时间:[具体时间]
事件负责人:@[姓名]

External (Status Page)

外部(状态页面)

FireCrawl Integration Issue

We're experiencing issues with our FireCrawl integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
FireCrawl集成问题

我们的FireCrawl集成目前出现问题。
部分用户可能会遇到[具体影响]。

我们正在积极调查,将及时更新进展。

最后更新时间:[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash
undefined
bash
undefined

Generate debug bundle

生成调试包

./scripts/firecrawl-debug-bundle.sh
./scripts/firecrawl-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=firecrawl-integration --since=1h > incident-logs.txt
kubectl logs -l app=firecrawl-integration --since=1h > incident-logs.txt

Capture metrics

采集指标数据

curl "localhost:9090/api/v1/query_range?query=firecrawl_errors_total&start=2h" > metrics.json
undefined
curl "localhost:9090/api/v1/query_range?query=firecrawl_errors_total&start=2h" > metrics.json
undefined

Postmortem Template

事后复盘模板

markdown
undefined
markdown
undefined

Incident: FireCrawl [Error Type]

事件:FireCrawl [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]
日期: YYYY-MM-DD 持续时间: X小时Y分钟 严重级别: P[1-4]

Summary

摘要

[1-2 sentence description]
[1-2句话描述事件]

Timeline

时间线

  • HH:MM - [Event]
  • HH:MM - [Event]
  • HH:MM - [事件]
  • HH:MM - [事件]

Root Cause

根本原因

[Technical explanation]
[技术层面的解释]

Impact

影响

  • Users affected: N
  • Revenue impact: $X
  • 受影响用户数:N
  • 营收影响:$X

Action Items

行动项

  • [Preventive measure] - Owner - Due date
undefined
  • [预防措施] - 负责人 - 截止日期
undefined

Instructions

操作步骤

Step 1: Quick Triage

步骤1:快速分类排查

Run the triage commands to identify the issue source.
运行分类排查命令以确定问题来源。

Step 2: Follow Decision Tree

步骤2:遵循决策树

Determine if the issue is FireCrawl-side or internal.
判断问题来自FireCrawl侧还是内部。

Step 3: Execute Immediate Actions

步骤3:执行即时操作

Apply the appropriate remediation for the error type.
针对错误类型应用对应的修复措施。

Step 4: Communicate Status

步骤4:同步状态

Update internal and external stakeholders.
向内部和外部相关方更新事件状态。

Output

输出结果

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem
  • 问题已识别并分类
  • 已应用修复措施
  • 已通知相关方
  • 已收集事后复盘所需证据

Error Handling

问题处理

IssueCauseSolution
Can't reach status pageNetwork issueUse mobile or VPN
kubectl failsAuth expiredRe-authenticate
Metrics unavailablePrometheus downCheck backup metrics
Secret rotation failsPermission deniedEscalate to admin
问题原因解决方案
无法访问状态页面网络问题使用移动网络或VPN
kubectl执行失败认证过期重新认证
指标数据不可用Prometheus故障检查备用指标
密钥轮换失败权限不足升级至管理员权限

Examples

示例

One-Line Health Check

单行健康检查命令

bash
curl -sf https://api.yourapp.com/health | jq '.services.firecrawl.status' || echo "UNHEALTHY"
bash
curl -sf https://api.yourapp.com/health | jq '.services.firecrawl.status' || echo "UNHEALTHY"

Resources

参考资源

Next Steps

后续步骤

For data handling, see
firecrawl-data-handling
.
关于数据处理,请参考
firecrawl-data-handling
文档。