firecrawl-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

FireCrawl Incident Runbook

FireCrawl事件响应手册

Overview

概述

Rapid incident response procedures for FireCrawl-related outages.

针对FireCrawl相关故障的快速事件响应流程。

Prerequisites

前置条件

Access to FireCrawl dashboard and status page
kubectl access to production cluster
Prometheus/Grafana access
Communication channels (Slack, PagerDuty)

拥有FireCrawl控制台和状态页面的访问权限
拥有生产集群的kubectl访问权限
拥有Prometheus/Grafana访问权限
可用的沟通渠道（Slack、PagerDuty）

Severity Levels

严重级别

Level	Definition	Response Time	Examples
P1	Complete outage	< 15 min	FireCrawl API unreachable
P2	Degraded service	< 1 hour	High latency, partial failures
P3	Minor impact	< 4 hours	Webhook delays, non-critical errors
P4	No user impact	Next business day	Monitoring gaps

级别	定义	响应时间	示例
P1	完全故障	< 15分钟	FireCrawl API无法访问
P2	服务降级	< 1小时	高延迟、部分功能故障
P3	轻微影响	< 4小时	Webhook延迟、非关键错误
P4	无用户影响	下一个工作日	监控缺口

Quick Triage

快速分类排查

bash

undefined

bash

undefined

1. Check FireCrawl status

1. 检查FireCrawl状态

curl -s https://status.firecrawl.com | jq

2. Check our integration health

2. 检查集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.firecrawl'

3. Check error rate (last 5 min)

3. 检查错误率（最近5分钟）

curl -s localhost:9090/api/v1/query?query=rate(firecrawl_errors_total[5m])

4. Recent error logs

4. 近期错误日志

kubectl logs -l app=firecrawl-integration --since=5m | grep -i error | tail -20

undefined

kubectl logs -l app=firecrawl-integration --since=5m | grep -i error | tail -20

undefined

Decision Tree

决策树

FireCrawl API returning errors?
├─ YES: Is status.firecrawl.com showing incident?
│   ├─ YES → Wait for FireCrawl to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

FireCrawl API返回错误？
├─ 是：status.firecrawl.com是否显示事件？
│   ├─ 是 → 等待FireCrawl修复。启用 fallback 机制。
│   └─ 否 → 内部集成问题。检查凭证、配置。
└─ 否：内部服务是否健康？
    ├─ 是 → 问题已解决或为间歇性故障。持续监控。
    └─ 否 → 内部基础设施问题。检查Pod、内存、网络。

Immediate Actions by Error Type

按错误类型执行即时操作

401/403 - Authentication

401/403 - 认证错误

bash

undefined

bash

undefined

Verify API key is set

验证API密钥是否已配置

kubectl get secret firecrawl-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in FireCrawl dashboard

→ 在FireCrawl控制台中验证

Remediation: Update secret and restart pods

修复措施：更新密钥并重启Pod

kubectl create secret generic firecrawl-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/firecrawl-integration

undefined

kubectl create secret generic firecrawl-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/firecrawl-integration

undefined

429 - Rate Limited

429 - 速率限制

bash

undefined

bash

undefined

Check rate limit headers

检查速率限制响应头

curl -v https://api.firecrawl.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队机制

kubectl set env deployment/firecrawl-integration RATE_LIMIT_MODE=queue

Long-term: Contact FireCrawl for limit increase

长期方案：联系FireCrawl提升限制

undefined

undefined

500/503 - FireCrawl Errors

500/503 - FireCrawl内部错误

bash

undefined

bash

undefined

Enable graceful degradation

启用优雅降级机制

kubectl set env deployment/firecrawl-integration FIRECRAWL_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页面

Monitor FireCrawl status for resolution

监控FireCrawl状态直至问题解决

undefined

undefined

Communication Templates

沟通模板

Internal (Slack)

内部（Slack）

🔴 P1 INCIDENT: FireCrawl Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

🔴 P1事件：FireCrawl集成
状态：调查中
影响：[描述用户影响]
当前操作：[正在执行的动作]
下次更新时间：[具体时间]
事件负责人：@[姓名]

External (Status Page)

外部（状态页面）

FireCrawl Integration Issue

We're experiencing issues with our FireCrawl integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

FireCrawl集成问题

我们的FireCrawl集成目前出现问题。
部分用户可能会遇到[具体影响]。

我们正在积极调查，将及时更新进展。

最后更新时间：[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash

undefined

bash

undefined

Generate debug bundle

生成调试包

./scripts/firecrawl-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=firecrawl-integration --since=1h > incident-logs.txt

Capture metrics

采集指标数据

curl "localhost:9090/api/v1/query_range?query=firecrawl_errors_total&start=2h" > metrics.json

undefined

curl "localhost:9090/api/v1/query_range?query=firecrawl_errors_total&start=2h" > metrics.json

undefined

Postmortem Template

事后复盘模板

markdown

undefined

markdown

undefined

Incident: FireCrawl [Error Type]

事件：FireCrawl [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]

日期： YYYY-MM-DD 持续时间： X小时Y分钟 严重级别： P[1-4]

Summary

摘要

[1-2 sentence description]

[1-2句话描述事件]

Timeline

时间线

HH:MM - [Event]
HH:MM - [Event]

HH:MM - [事件]
HH:MM - [事件]

Root Cause

根本原因

[Technical explanation]

[技术层面的解释]

Impact

影响

Users affected: N
Revenue impact: $X

受影响用户数：N
营收影响：$X

Action Items

行动项

[Preventive measure] - Owner - Due date

undefined

[预防措施] - 负责人 - 截止日期

undefined

Instructions

操作步骤

Step 1: Quick Triage

步骤1：快速分类排查

Run the triage commands to identify the issue source.

运行分类排查命令以确定问题来源。

Step 2: Follow Decision Tree

步骤2：遵循决策树

Determine if the issue is FireCrawl-side or internal.

判断问题来自FireCrawl侧还是内部。

Step 3: Execute Immediate Actions

步骤3：执行即时操作

Apply the appropriate remediation for the error type.

针对错误类型应用对应的修复措施。

Step 4: Communicate Status

步骤4：同步状态

Update internal and external stakeholders.

向内部和外部相关方更新事件状态。

Output

输出结果

Issue identified and categorized
Remediation applied
Stakeholders notified
Evidence collected for postmortem

问题已识别并分类
已应用修复措施
已通知相关方
已收集事后复盘所需证据

Error Handling

问题处理

Issue	Cause	Solution
Can't reach status page	Network issue	Use mobile or VPN
kubectl fails	Auth expired	Re-authenticate
Metrics unavailable	Prometheus down	Check backup metrics
Secret rotation fails	Permission denied	Escalate to admin

问题	原因	解决方案
无法访问状态页面	网络问题	使用移动网络或VPN
kubectl执行失败	认证过期	重新认证
指标数据不可用	Prometheus故障	检查备用指标
密钥轮换失败	权限不足	升级至管理员权限

Examples

示例

One-Line Health Check

单行健康检查命令

bash

curl -sf https://api.yourapp.com/health | jq '.services.firecrawl.status' || echo "UNHEALTHY"

bash

curl -sf https://api.yourapp.com/health | jq '.services.firecrawl.status' || echo "UNHEALTHY"

Resources

参考资源

Next Steps

后续步骤

For data handling, see

firecrawl-data-handling

关于数据处理，请参考

firecrawl-data-handling

文档。