gamma-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Gamma Incident Runbook

Gamma事件响应手册

Overview

概述

Systematic procedures for responding to and resolving Gamma integration incidents.
针对Gamma集成事件的响应与解决系统性流程。

Prerequisites

前置条件

  • Access to monitoring dashboards
  • Access to application logs
  • On-call responsibilities defined
  • Communication channels established
  • 有权限访问监控仪表盘
  • 有权限访问应用日志
  • 已定义值班职责
  • 已建立沟通渠道

Incident Severity Levels

事件严重级别

LevelDescriptionResponse TimeEscalation
P1Complete outage, no presentations< 15 minImmediate
P2Degraded, slow or partial failures< 30 min1 hour
P3Minor issues, workaround available< 2 hours4 hours
P4Cosmetic or non-urgent< 24 hoursNone
级别描述响应时间升级流程
P1完全服务中断,无法展示内容< 15分钟立即升级
P2服务降级,出现缓慢或部分功能故障< 30分钟1小时内升级
P3轻微问题,存在可用的临时解决方案< 2小时4小时内升级
P4界面显示问题或非紧急事项< 24小时无需升级

Quick Diagnostics

快速诊断

Step 1: Check Gamma Status

步骤1:检查Gamma状态

bash
undefined
bash
undefined

Check Gamma status page

Check Gamma status page

Check our integration health

Check our integration health

Quick connectivity test

Quick connectivity test

curl -w "\nTime: %{time_total}s\n"
-H "Authorization: Bearer $GAMMA_API_KEY"
https://api.gamma.app/v1/ping
undefined
curl -w "\nTime: %{time_total}s\n"
-H "Authorization: Bearer $GAMMA_API_KEY"
https://api.gamma.app/v1/ping
undefined

Step 2: Review Key Metrics

步骤2:查看关键指标

bash
undefined
bash
undefined

Check error rate (Prometheus)

Check error rate (Prometheus)

Check latency P95

Check latency P95

Check rate limit

Check rate limit

Step 3: Review Recent Logs

步骤3:查看近期日志

bash
undefined
bash
undefined

Last 100 error logs

Last 100 error logs

grep -i "gamma.error" /var/log/app/gamma-.log | tail -100
grep -i "gamma.error" /var/log/app/gamma-.log | tail -100

Rate limit hits

Rate limit hits

grep "429" /var/log/app/gamma-*.log | wc -l
grep "429" /var/log/app/gamma-*.log | wc -l

Timeout errors

Timeout errors

grep -i "timeout" /var/log/app/gamma-*.log | tail -50
undefined
grep -i "timeout" /var/log/app/gamma-*.log | tail -50
undefined

Incident Response Procedures

事件响应流程

Scenario 1: API Returning 5xx Errors

场景1:API返回5xx错误

Symptoms:
  • High error rate in monitoring
  • Users reporting failed presentations
  • 500/502/503 responses from Gamma
Actions:
  1. Verify Gamma status: https://status.gamma.app
  2. If Gamma outage confirmed:
    • Enable degraded mode / show maintenance message
    • Monitor status page for updates
    • No action needed on our side
  3. If Gamma is operational:
    bash
    # Check our request patterns
    grep "5[0-9][0-9]" /var/log/app/gamma-*.log | \
      awk '{print $1}' | sort | uniq -c | sort -rn
    
    # Look for malformed requests
    grep -B5 "500" /var/log/app/gamma-*.log | grep "request"
  4. Rollback recent deployments if issue correlates
症状:
  • 监控中显示错误率较高
  • 用户反馈内容展示失败
  • Gamma返回500/502/503响应
行动:
  1. 验证Gamma状态:https://status.gamma.app
  2. 若确认Gamma服务中断:
    • 启用降级模式 / 显示维护提示
    • 监控状态页面获取更新
    • 我方无需额外操作
  3. 若Gamma服务正常:
    bash
    # Check our request patterns
    grep "5[0-9][0-9]" /var/log/app/gamma-*.log | \
      awk '{print $1}' | sort | uniq -c | sort -rn
    
    # Look for malformed requests
    grep -B5 "500" /var/log/app/gamma-*.log | grep "request"
  4. 若问题与近期部署相关,回滚最近的部署

Scenario 2: Rate Limit Exceeded (429)

场景2:超出速率限制(429)

Symptoms:
  • 429 responses in logs
  • Rate limit metrics at zero
  • Slow or queued requests
Actions:
  1. Immediate mitigation:
    bash
    # Enable request throttling
    curl -X POST http://localhost:8080/admin/throttle \
      -d '{"gamma": {"rps": 10}}'
  2. Check for runaway processes:
    bash
    # Find high-volume clients
    grep "gamma" /var/log/app/*.log | \
      awk '{print $5}' | sort | uniq -c | sort -rn | head -20
  3. Enable circuit breaker:
    bash
    curl -X POST http://localhost:8080/admin/circuit-breaker \
      -d '{"service": "gamma", "state": "open"}'
  4. Long-term: Review rate limit tier with Gamma
症状:
  • 日志中出现429响应
  • 速率限制指标为0
  • 请求缓慢或处于排队状态
行动:
  1. 立即缓解:
    bash
    # Enable request throttling
    curl -X POST http://localhost:8080/admin/throttle \
      -d '{"gamma": {"rps": 10}}'
  2. 检查是否存在异常进程:
    bash
    # Find high-volume clients
    grep "gamma" /var/log/app/*.log | \
      awk '{print $5}' | sort | uniq -c | sort -rn | head -20
  3. 启用断路器:
    bash
    curl -X POST http://localhost:8080/admin/circuit-breaker \
      -d '{"service": "gamma", "state": "open"}'
  4. 长期方案:与Gamma协商调整速率限制档位

Scenario 3: High Latency

场景3:高延迟

Symptoms:
  • Slow presentation creation
  • Timeouts in logs
  • P95 latency > 10s
Actions:
  1. Check Gamma latency vs our latency:
    bash
    # Direct Gamma latency
    for i in {1..5}; do
      curl -w "%{time_total}\n" -o /dev/null -s \
        -H "Authorization: Bearer $GAMMA_API_KEY" \
        https://api.gamma.app/v1/ping
    done
  2. If Gamma is slow:
    • Increase timeouts temporarily
    • Enable async mode for non-critical operations
    • Queue heavy operations
  3. If our infrastructure is slow:
    • Check CPU/memory on app servers
    • Review connection pool settings
    • Check network connectivity
症状:
  • 内容创建缓慢
  • 日志中出现超时
  • P95延迟 > 10秒
行动:
  1. 对比Gamma延迟与我方系统延迟:
    bash
    # Direct Gamma latency
    for i in {1..5}; do
      curl -w "%{time_total}\n" -o /dev/null -s \
        -H "Authorization: Bearer $GAMMA_API_KEY" \
        https://api.gamma.app/v1/ping
    done
  2. 若Gamma服务缓慢:
    • 临时增加超时时间
    • 对非关键操作启用异步模式
    • 对重型操作进行排队处理
  3. 若我方基础设施缓慢:
    • 检查应用服务器的CPU/内存使用情况
    • 查看连接池配置
    • 检查网络连通性

Scenario 4: Authentication Failures (401/403)

场景4:认证失败(401/403)

Symptoms:
  • All requests failing with 401
  • "Invalid API key" errors
  • Sudden authentication failures
Actions:
  1. Verify API key:
    bash
    # Test key directly
    curl -H "Authorization: Bearer $GAMMA_API_KEY" \
      https://api.gamma.app/v1/ping
    
    # Check key format
    echo $GAMMA_API_KEY | head -c 20
  2. If key is invalid:
    • Check if key was rotated
    • Deploy backup key:
      GAMMA_API_KEY_SECONDARY
    • Generate new key in Gamma dashboard
  3. Notify team and update secrets
症状:
  • 所有请求均返回401失败
  • 日志中出现“Invalid API key”错误
  • 突然出现认证失败
行动:
  1. 验证API密钥:
    bash
    # Test key directly
    curl -H "Authorization: Bearer $GAMMA_API_KEY" \
      https://api.gamma.app/v1/ping
    
    # Check key format
    echo $GAMMA_API_KEY | head -c 20
  2. 若密钥无效:
    • 检查密钥是否已被轮换
    • 部署备用密钥:
      GAMMA_API_KEY_SECONDARY
    • 在Gamma控制台生成新密钥
  3. 通知团队并更新密钥信息

Communication Templates

沟通模板

Internal Notification

内部通知

INCIDENT: Gamma Integration Issue

Severity: P[X]
Status: Investigating / Identified / Mitigating / Resolved
Impact: [Description of user impact]
Start Time: [ISO timestamp]

Summary: [Brief description]

Current Actions:
- [Action 1]
- [Action 2]

Next Update: [Time]
事件:Gamma集成问题

严重级别:P[X]
状态:排查中 / 已定位 / 缓解中 / 已解决
影响:[用户影响描述]
开始时间:[ISO时间戳]

摘要:[简要描述]

当前行动:
- [行动1]
- [行动2]

下次更新时间:[时间]

User-Facing Message

用户端提示

We're currently experiencing issues with presentation generation.
Our team is actively working to resolve this.

Workaround: [If available]
Status updates: [Link to status page]
ETA: [If known]
目前我们的内容生成功能遇到了一些问题。
我们的团队正在积极解决。

临时解决方案:[若有可用方案]
状态更新:[状态页面链接]
预计恢复时间:[若已知]

Post-Incident Checklist

事后复盘清单

  • Incident timeline documented
  • Root cause identified
  • User impact quantified
  • Fix verified in production
  • Monitoring gaps identified
  • Preventive measures documented
  • Post-mortem scheduled (for P1/P2)
  • 已记录事件时间线
  • 已定位根本原因
  • 已量化用户影响
  • 已在生产环境验证修复效果
  • 已识别监控盲区
  • 已记录预防措施
  • 已安排事后复盘会议(针对P1/P2事件)

Resources

参考资源

Next Steps

后续步骤

Proceed to
gamma-data-handling
for data management.
请查看
gamma-data-handling
了解数据管理相关内容。