juicebox-incident-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseJuicebox Incident Runbook
Juicebox事件响应手册
Overview
概述
Standardized incident response procedures for Juicebox integration issues.
Juicebox集成问题的标准化事件响应流程。
Incident Severity Levels
事件严重等级
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Critical | < 15 min | Complete outage, data loss |
| P2 | High | < 1 hour | Major feature broken, degraded performance |
| P3 | Medium | < 4 hours | Minor feature issue, workaround exists |
| P4 | Low | < 24 hours | Cosmetic, non-blocking |
| 严重等级 | 描述 | 响应时间 | 示例 |
|---|---|---|---|
| P1 | 关键 | < 15分钟 | 完全停机、数据丢失 |
| P2 | 高 | < 1小时 | 主要功能故障、性能下降 |
| P3 | 中等 | < 4小时 | 次要功能问题、存在临时解决方案 |
| P4 | 低 | < 24小时 | 界面美化问题、无阻塞影响 |
Quick Diagnostics
快速诊断
Step 1: Immediate Assessment
步骤1:即时评估
bash
#!/bin/bashbash
#!/bin/bashquick-diag.sh - Run immediately when incident detected
quick-diag.sh - 事件检测到后立即运行
echo "=== Juicebox Quick Diagnostics ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "=== Juicebox快速诊断 ==="
echo "时间戳: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
Check Juicebox status page
检查Juicebox状态页面
echo ""
echo "=== Juicebox Status ==="
curl -s https://status.juicebox.ai/api/status | jq '.status'
echo ""
echo "=== Juicebox状态 ==="
curl -s https://status.juicebox.ai/api/status | jq '.status'
Check our API health
检查我方API健康状态
echo ""
echo "=== Our API Health ==="
curl -s http://localhost:8080/health/ready | jq '.'
echo ""
echo "=== 我方API健康状态 ==="
curl -s http://localhost:8080/health/ready | jq '.'
Check recent error logs
检查近期错误日志
echo ""
echo "=== Recent Errors (last 5 min) ==="
kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20
echo ""
echo "=== 近期错误(最近5分钟) ==="
kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20
Check metrics
检查指标
echo ""
echo "=== Error Rate ==="
curl -s http://localhost:9090/api/v1/query?query=rate\(juicebox_requests_total\{status=\"error\"\}\[5m\]\) | jq '.data.result[0].value[1]'
undefinedecho ""
echo "=== 错误率 ==="
curl -s http://localhost:9090/api/v1/query?query=rate\(juicebox_requests_total\{status=\"error\"\}\[5m\]\) | jq '.data.result[0].value[1]'
undefinedStep 2: Identify Root Cause
步骤2:确定根本原因
markdown
undefinedmarkdown
undefinedIncident Triage Decision Tree
事件分类决策树
-
Is Juicebox status page showing issues?
- YES → External outage, skip to "External Outage Response"
- NO → Continue
-
Are we getting authentication errors (401)?
- YES → Check API key validity, skip to "Auth Issues"
- NO → Continue
-
Are we getting rate limited (429)?
- YES → Skip to "Rate Limit Response"
- NO → Continue
-
Are requests timing out?
- YES → Skip to "Timeout Response"
- NO → Continue
-
Are we getting unexpected errors?
- YES → Skip to "Application Error Response"
- NO → Gather more data
undefined-
Juicebox状态页面是否显示异常?
- 是 → 外部停机,跳转到“外部停机响应”
- 否 → 继续
-
是否出现认证错误(401)?
- 是 → 检查API密钥有效性,跳转到“认证问题响应”
- 否 → 继续
-
是否被限流(429)?
- 是 → 跳转到“限流响应”
- 否 → 继续
-
请求是否超时?
- 是 → 跳转到“超时响应”
- 否 → 继续
-
是否出现意外错误?
- 是 → 跳转到“应用程序错误响应”
- 否 → 收集更多数据
undefinedResponse Procedures
响应流程
External Outage Response
外部停机响应
markdown
undefinedmarkdown
undefinedWhen Juicebox is Down
当Juicebox停机时
-
Confirm Outage
- Check https://status.juicebox.ai
- Verify with curl test to API
-
Enable Fallback Modebash
kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true -
Notify Stakeholders
- Post to #incidents channel
- Update status page if customer-facing
-
Monitor Recovery
- Set up alert for Juicebox status change
- Prepare to disable fallback mode
-
Post-Incident
- Disable fallback when Juicebox recovers
- Document timeline and impact
undefined-
确认停机状态
- 查看https://status.juicebox.ai
- 通过curl测试API验证
-
启用 fallback 模式bash
kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true -
通知相关人员
- 在#incidents频道发布消息
- 若面向客户,更新状态页面
-
监控恢复情况
- 设置Juicebox状态变化告警
- 准备禁用fallback模式
-
事件后处理
- Juicebox恢复后禁用fallback模式
- 记录时间线和影响范围
undefinedAuth Issues Response
认证问题响应
markdown
undefinedmarkdown
undefinedWhen Authentication Fails
当认证失败时
-
Verify API Keybash
# Mask key for logging echo "Key prefix: ${JUICEBOX_API_KEY:0:10}..." # Test auth curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \ https://api.juicebox.ai/v1/auth/me -
Check Key Status in Dashboard
- Log into https://app.juicebox.ai
- Verify key is active and not revoked
-
Rotate Key if Compromised
- Generate new key in dashboard
- Update secret manager
- Restart pods
bashkubectl rollout restart deployment/juicebox-integration -
Verify Recovery
- Check health endpoint
- Monitor error rate
undefined-
验证API密钥bash
# 为日志屏蔽密钥 echo "密钥前缀: ${JUICEBOX_API_KEY:0:10}..." # 测试认证 curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \ https://api.juicebox.ai/v1/auth/me -
在控制台检查密钥状态
- 登录https://app.juicebox.ai
- 验证密钥是否处于活跃状态且未被撤销
-
若密钥泄露则轮换密钥
- 在控制台生成新密钥
- 更新密钥管理器
- 重启Pod
bashkubectl rollout restart deployment/juicebox-integration -
验证恢复情况
- 检查健康端点
- 监控错误率
undefinedRate Limit Response
限流响应
markdown
undefinedmarkdown
undefinedWhen Rate Limited
当被限流时
-
Check Current Usagebash
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \ https://api.juicebox.ai/v1/usage -
Immediate Mitigation
- Enable aggressive caching
- Reduce request rate
bashkubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10 -
If Quota Exhausted
- Contact Juicebox support for temporary increase
- Implement request queuing
-
Long-term Fix
- Review usage patterns
- Implement better caching
- Consider plan upgrade
undefined-
检查当前使用情况bash
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \ https://api.juicebox.ai/v1/usage -
即时缓解措施
- 启用激进缓存策略
- 降低请求频率
bashkubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10 -
若配额耗尽
- 联系Juicebox支持申请临时提升配额
- 实现请求排队机制
-
长期解决方案
- 分析使用模式
- 优化缓存策略
- 考虑升级套餐
undefinedTimeout Response
超时响应
markdown
undefinedmarkdown
undefinedWhen Requests Timeout
当请求超时时
-
Check Networkbash
# DNS resolution nslookup api.juicebox.ai # Connectivity curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health -
Check Load
- Review query complexity
- Check for unusually large requests
-
Increase Timeoutbash
kubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000 -
Implement Circuit Breaker
- Enable circuit breaker if repeated timeouts
- Serve cached/fallback data
undefined-
检查网络状况bash
# DNS解析 nslookup api.juicebox.ai # 连通性测试 curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health -
检查负载情况
- 分析查询复杂度
- 检查是否存在异常大请求
-
增加超时时间bash
kubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000 -
实现断路器机制
- 若重复出现超时则启用断路器
- 返回缓存数据或fallback数据
undefinedIncident Communication Template
事件沟通模板
markdown
undefinedmarkdown
undefinedIncident Report Template
事件报告模板
Incident ID: INC-YYYY-MM-DD-XXX
Status: Investigating | Identified | Monitoring | Resolved
Severity: P1 | P2 | P3 | P4
Start Time: YYYY-MM-DD HH:MM UTC
End Time: (when resolved)
事件ID: INC-YYYY-MM-DD-XXX
状态: 调查中 | 已定位 | 监控中 | 已解决
严重等级: P1 | P2 | P3 | P4
开始时间: YYYY-MM-DD HH:MM UTC
结束时间: (解决后填写)
Summary
摘要
[Brief description of the incident]
[事件简要描述]
Impact
影响范围
- Users affected: [number/percentage]
- Features affected: [list]
- Duration: [time]
- 受影响用户数:[数量/百分比]
- 受影响功能:[列表]
- 持续时间:[时长]
Timeline
时间线
- HH:MM - Incident detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Incident resolved
- HH:MM - 事件被检测到
- HH:MM - 开始调查
- HH:MM - 定位根本原因
- HH:MM - 应用缓解措施
- HH:MM - 事件已解决
Root Cause
根本原因
[Description of what caused the incident]
[事件原因描述]
Resolution
解决方案
[What was done to fix it]
[修复措施说明]
Action Items
行动项
- Action 1 (Owner, Due Date)
- Action 2 (Owner, Due Date)
undefined- 行动1(负责人,截止日期)
- 行动2(负责人,截止日期)
undefinedOn-Call Checklist
值班检查表
markdown
undefinedmarkdown
undefinedOn-Call Handoff Checklist
值班交接检查表
Before Shift
值班前
- Access to all monitoring dashboards
- VPN/access to production systems
- Runbook bookmarked
- Escalation contacts available
- 拥有所有监控仪表盘的访问权限
- 可通过VPN访问生产系统
- 已收藏运行手册
- 可联系到升级联系人
During Shift
值班期间
- Check dashboards every 30 min
- Respond to alerts within SLA
- Document all incidents
- Escalate P1/P2 immediately
- 每30分钟检查一次仪表盘
- 在SLA内响应告警
- 记录所有事件
- 立即升级P1/P2事件
End of Shift
值班结束
- Handoff open incidents
- Update incident log
- Brief incoming on-call
undefined- 交接未解决事件
- 更新事件日志
- 向接班值班人员简要说明情况
undefinedOutput
输出内容
- Diagnostic scripts
- Response procedures
- Communication templates
- On-call checklists
- 诊断脚本
- 响应流程
- 沟通模板
- 值班检查表
Resources
参考资源
Next Steps
后续步骤
After incident, see for data management.
juicebox-data-handling事件处理完成后,参考进行数据管理。
juicebox-data-handling