gamma-incident-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGamma Incident Runbook
Gamma事件响应手册
Overview
概述
Systematic procedures for responding to and resolving Gamma integration incidents.
针对Gamma集成事件的响应与解决系统性流程。
Prerequisites
前置条件
- Access to monitoring dashboards
- Access to application logs
- On-call responsibilities defined
- Communication channels established
- 有权限访问监控仪表盘
- 有权限访问应用日志
- 已定义值班职责
- 已建立沟通渠道
Incident Severity Levels
事件严重级别
| Level | Description | Response Time | Escalation |
|---|---|---|---|
| P1 | Complete outage, no presentations | < 15 min | Immediate |
| P2 | Degraded, slow or partial failures | < 30 min | 1 hour |
| P3 | Minor issues, workaround available | < 2 hours | 4 hours |
| P4 | Cosmetic or non-urgent | < 24 hours | None |
| 级别 | 描述 | 响应时间 | 升级流程 |
|---|---|---|---|
| P1 | 完全服务中断,无法展示内容 | < 15分钟 | 立即升级 |
| P2 | 服务降级,出现缓慢或部分功能故障 | < 30分钟 | 1小时内升级 |
| P3 | 轻微问题,存在可用的临时解决方案 | < 2小时 | 4小时内升级 |
| P4 | 界面显示问题或非紧急事项 | < 24小时 | 无需升级 |
Quick Diagnostics
快速诊断
Step 1: Check Gamma Status
步骤1:检查Gamma状态
bash
undefinedbash
undefinedCheck Gamma status page
Check Gamma status page
curl -s https://status.gamma.app/api/v2/status.json | jq '.status'
curl -s https://status.gamma.app/api/v2/status.json | jq '.status'
Check our integration health
Check our integration health
curl -s https://your-app.com/health/gamma | jq '.'
curl -s https://your-app.com/health/gamma | jq '.'
Quick connectivity test
Quick connectivity test
curl -w "\nTime: %{time_total}s\n"
-H "Authorization: Bearer $GAMMA_API_KEY"
https://api.gamma.app/v1/ping
-H "Authorization: Bearer $GAMMA_API_KEY"
https://api.gamma.app/v1/ping
undefinedcurl -w "\nTime: %{time_total}s\n"
-H "Authorization: Bearer $GAMMA_API_KEY"
https://api.gamma.app/v1/ping
-H "Authorization: Bearer $GAMMA_API_KEY"
https://api.gamma.app/v1/ping
undefinedStep 2: Review Key Metrics
步骤2:查看关键指标
bash
undefinedbash
undefinedCheck error rate (Prometheus)
Check error rate (Prometheus)
curl -s 'http://prometheus:9090/api/v1/query?query=rate(gamma_requests_total{status=~"5.."}[5m])' | jq '.data.result'
curl -s 'http://prometheus:9090/api/v1/query?query=rate(gamma_requests_total{status=~"5.."}[5m])' | jq '.data.result'
Check latency P95
Check latency P95
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(gamma_request_duration_seconds_bucket[5m]))' | jq '.data.result'
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(gamma_request_duration_seconds_bucket[5m]))' | jq '.data.result'
Check rate limit
Check rate limit
curl -s 'http://prometheus:9090/api/v1/query?query=gamma_rate_limit_remaining' | jq '.data.result'
undefinedcurl -s 'http://prometheus:9090/api/v1/query?query=gamma_rate_limit_remaining' | jq '.data.result'
undefinedStep 3: Review Recent Logs
步骤3:查看近期日志
bash
undefinedbash
undefinedLast 100 error logs
Last 100 error logs
grep -i "gamma.error" /var/log/app/gamma-.log | tail -100
grep -i "gamma.error" /var/log/app/gamma-.log | tail -100
Rate limit hits
Rate limit hits
grep "429" /var/log/app/gamma-*.log | wc -l
grep "429" /var/log/app/gamma-*.log | wc -l
Timeout errors
Timeout errors
grep -i "timeout" /var/log/app/gamma-*.log | tail -50
undefinedgrep -i "timeout" /var/log/app/gamma-*.log | tail -50
undefinedIncident Response Procedures
事件响应流程
Scenario 1: API Returning 5xx Errors
场景1:API返回5xx错误
Symptoms:
- High error rate in monitoring
- Users reporting failed presentations
- 500/502/503 responses from Gamma
Actions:
-
Verify Gamma status: https://status.gamma.app
-
If Gamma outage confirmed:
- Enable degraded mode / show maintenance message
- Monitor status page for updates
- No action needed on our side
-
If Gamma is operational:bash
# Check our request patterns grep "5[0-9][0-9]" /var/log/app/gamma-*.log | \ awk '{print $1}' | sort | uniq -c | sort -rn # Look for malformed requests grep -B5 "500" /var/log/app/gamma-*.log | grep "request" -
Rollback recent deployments if issue correlates
症状:
- 监控中显示错误率较高
- 用户反馈内容展示失败
- Gamma返回500/502/503响应
行动:
-
验证Gamma状态:https://status.gamma.app
-
若确认Gamma服务中断:
- 启用降级模式 / 显示维护提示
- 监控状态页面获取更新
- 我方无需额外操作
-
若Gamma服务正常:bash
# Check our request patterns grep "5[0-9][0-9]" /var/log/app/gamma-*.log | \ awk '{print $1}' | sort | uniq -c | sort -rn # Look for malformed requests grep -B5 "500" /var/log/app/gamma-*.log | grep "request" -
若问题与近期部署相关,回滚最近的部署
Scenario 2: Rate Limit Exceeded (429)
场景2:超出速率限制(429)
Symptoms:
- 429 responses in logs
- Rate limit metrics at zero
- Slow or queued requests
Actions:
-
Immediate mitigation:bash
# Enable request throttling curl -X POST http://localhost:8080/admin/throttle \ -d '{"gamma": {"rps": 10}}' -
Check for runaway processes:bash
# Find high-volume clients grep "gamma" /var/log/app/*.log | \ awk '{print $5}' | sort | uniq -c | sort -rn | head -20 -
Enable circuit breaker:bash
curl -X POST http://localhost:8080/admin/circuit-breaker \ -d '{"service": "gamma", "state": "open"}' -
Long-term: Review rate limit tier with Gamma
症状:
- 日志中出现429响应
- 速率限制指标为0
- 请求缓慢或处于排队状态
行动:
-
立即缓解:bash
# Enable request throttling curl -X POST http://localhost:8080/admin/throttle \ -d '{"gamma": {"rps": 10}}' -
检查是否存在异常进程:bash
# Find high-volume clients grep "gamma" /var/log/app/*.log | \ awk '{print $5}' | sort | uniq -c | sort -rn | head -20 -
启用断路器:bash
curl -X POST http://localhost:8080/admin/circuit-breaker \ -d '{"service": "gamma", "state": "open"}' -
长期方案:与Gamma协商调整速率限制档位
Scenario 3: High Latency
场景3:高延迟
Symptoms:
- Slow presentation creation
- Timeouts in logs
- P95 latency > 10s
Actions:
-
Check Gamma latency vs our latency:bash
# Direct Gamma latency for i in {1..5}; do curl -w "%{time_total}\n" -o /dev/null -s \ -H "Authorization: Bearer $GAMMA_API_KEY" \ https://api.gamma.app/v1/ping done -
If Gamma is slow:
- Increase timeouts temporarily
- Enable async mode for non-critical operations
- Queue heavy operations
-
If our infrastructure is slow:
- Check CPU/memory on app servers
- Review connection pool settings
- Check network connectivity
症状:
- 内容创建缓慢
- 日志中出现超时
- P95延迟 > 10秒
行动:
-
对比Gamma延迟与我方系统延迟:bash
# Direct Gamma latency for i in {1..5}; do curl -w "%{time_total}\n" -o /dev/null -s \ -H "Authorization: Bearer $GAMMA_API_KEY" \ https://api.gamma.app/v1/ping done -
若Gamma服务缓慢:
- 临时增加超时时间
- 对非关键操作启用异步模式
- 对重型操作进行排队处理
-
若我方基础设施缓慢:
- 检查应用服务器的CPU/内存使用情况
- 查看连接池配置
- 检查网络连通性
Scenario 4: Authentication Failures (401/403)
场景4:认证失败(401/403)
Symptoms:
- All requests failing with 401
- "Invalid API key" errors
- Sudden authentication failures
Actions:
-
Verify API key:bash
# Test key directly curl -H "Authorization: Bearer $GAMMA_API_KEY" \ https://api.gamma.app/v1/ping # Check key format echo $GAMMA_API_KEY | head -c 20 -
If key is invalid:
- Check if key was rotated
- Deploy backup key:
GAMMA_API_KEY_SECONDARY - Generate new key in Gamma dashboard
-
Notify team and update secrets
症状:
- 所有请求均返回401失败
- 日志中出现“Invalid API key”错误
- 突然出现认证失败
行动:
-
验证API密钥:bash
# Test key directly curl -H "Authorization: Bearer $GAMMA_API_KEY" \ https://api.gamma.app/v1/ping # Check key format echo $GAMMA_API_KEY | head -c 20 -
若密钥无效:
- 检查密钥是否已被轮换
- 部署备用密钥:
GAMMA_API_KEY_SECONDARY - 在Gamma控制台生成新密钥
-
通知团队并更新密钥信息
Communication Templates
沟通模板
Internal Notification
内部通知
INCIDENT: Gamma Integration Issue
Severity: P[X]
Status: Investigating / Identified / Mitigating / Resolved
Impact: [Description of user impact]
Start Time: [ISO timestamp]
Summary: [Brief description]
Current Actions:
- [Action 1]
- [Action 2]
Next Update: [Time]事件:Gamma集成问题
严重级别:P[X]
状态:排查中 / 已定位 / 缓解中 / 已解决
影响:[用户影响描述]
开始时间:[ISO时间戳]
摘要:[简要描述]
当前行动:
- [行动1]
- [行动2]
下次更新时间:[时间]User-Facing Message
用户端提示
We're currently experiencing issues with presentation generation.
Our team is actively working to resolve this.
Workaround: [If available]
Status updates: [Link to status page]
ETA: [If known]目前我们的内容生成功能遇到了一些问题。
我们的团队正在积极解决。
临时解决方案:[若有可用方案]
状态更新:[状态页面链接]
预计恢复时间:[若已知]Post-Incident Checklist
事后复盘清单
- Incident timeline documented
- Root cause identified
- User impact quantified
- Fix verified in production
- Monitoring gaps identified
- Preventive measures documented
- Post-mortem scheduled (for P1/P2)
- 已记录事件时间线
- 已定位根本原因
- 已量化用户影响
- 已在生产环境验证修复效果
- 已识别监控盲区
- 已记录预防措施
- 已安排事后复盘会议(针对P1/P2事件)
Resources
参考资源
- Gamma Status
- Gamma Support
- Internal Runbook Wiki
- On-Call Schedule
Next Steps
后续步骤
Proceed to for data management.
gamma-data-handling请查看了解数据管理相关内容。
gamma-data-handling