runbook-generator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Runbook Generator

Runbook生成器

Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks.
擅长为操作流程、事件响应和系统维护任务创建全面、标准化的Runbook。

Runbook Structure

Runbook结构

yaml
runbook_template:
  metadata:
    title: "Runbook title"
    version: "1.0"
    last_updated: "2024-01-15"
    owner: "Team/Person"
    reviewers: ["Name 1", "Name 2"]

  overview:
    purpose: "What this runbook accomplishes"
    scope: "Systems/services affected"
    audience: "Who should use this"

  prerequisites:
    access:
      - "AWS Console access"
      - "SSH key for production servers"
      - "Database credentials"
    tools:
      - "kubectl configured"
      - "AWS CLI installed"
      - "jq for JSON parsing"
    knowledge:
      - "Basic Kubernetes concepts"
      - "Understanding of service architecture"

  execution:
    estimated_time: "15-30 minutes"
    risk_level: "Medium"
    requires_change_ticket: true
    requires_approval: true
    can_be_automated: true

  steps: []  # Detailed steps below

  verification: []  # How to confirm success

  rollback: []  # How to undo changes

  troubleshooting: []  # Common issues

  contacts:
    primary_oncall: "PagerDuty"
    escalation: "Engineering Manager"
    subject_experts: ["DBA Team", "Platform Team"]
yaml
runbook_template:
  metadata:
    title: "Runbook title"
    version: "1.0"
    last_updated: "2024-01-15"
    owner: "Team/Person"
    reviewers: ["Name 1", "Name 2"]

  overview:
    purpose: "What this runbook accomplishes"
    scope: "Systems/services affected"
    audience: "Who should use this"

  prerequisites:
    access:
      - "AWS Console access"
      - "SSH key for production servers"
      - "Database credentials"
    tools:
      - "kubectl configured"
      - "AWS CLI installed"
      - "jq for JSON parsing"
    knowledge:
      - "Basic Kubernetes concepts"
      - "Understanding of service architecture"

  execution:
    estimated_time: "15-30 minutes"
    risk_level: "Medium"
    requires_change_ticket: true
    requires_approval: true
    can_be_automated: true

  steps: []  # Detailed steps below

  verification: []  # How to confirm success

  rollback: []  # How to undo changes

  troubleshooting: []  # Common issues

  contacts:
    primary_oncall: "PagerDuty"
    escalation: "Engineering Manager"
    subject_experts: ["DBA Team", "Platform Team"]

Standard Runbook Template

标准Runbook模板

markdown
undefined
markdown
undefined

[Runbook Title]

[Runbook Title]

Version: 1.0 Last Updated: YYYY-MM-DD Owner: Team Name Risk Level: Low | Medium | High | Critical
Version: 1.0 Last Updated: YYYY-MM-DD Owner: Team Name Risk Level: Low | Medium | High | Critical

Overview

概述

Purpose

用途

Brief description of what this runbook accomplishes.
简要说明本Runbook的目标。

When to Use

使用场景

  • Trigger condition 1
  • Trigger condition 2
  • Alert: "Alert Name" fires
  • 触发条件1
  • 触发条件2
  • 警报:“Alert Name”触发

Scope

适用范围

Systems and services affected:
  • Service A
  • Database B
  • External dependency C
涉及的系统与服务:
  • 服务A
  • 数据库B
  • 外部依赖C

Prerequisites

前置条件

Required Access

所需权限

  • Production AWS Console
  • Kubernetes cluster access
  • Database read/write permissions
  • 生产环境AWS Console权限
  • Kubernetes集群访问权限
  • 数据库读写权限

Required Tools

所需工具

bash
undefined
bash
undefined

Verify kubectl

验证kubectl

kubectl version --client
kubectl version --client

Verify AWS CLI

验证AWS CLI

aws sts get-caller-identity
aws sts get-caller-identity

Verify database connectivity

验证数据库连接

psql -h $DB_HOST -U $DB_USER -c "SELECT 1"
undefined
psql -h $DB_HOST -U $DB_USER -c "SELECT 1"
undefined

Required Knowledge

所需知识

  • Kubernetes pod management
  • Service architecture overview
  • Incident response process
  • Kubernetes pod管理
  • 服务架构概述
  • 事件响应流程

Pre-Execution Checklist

执行前检查清单

  • Change ticket created: CHG-XXXXX
  • Approval obtained from: [Name]
  • Backup verified (if applicable)
  • Stakeholders notified
  • Maintenance window scheduled (if applicable)
  • 已创建变更工单:CHG-XXXXX
  • 已获得[姓名]的批准
  • 已验证备份(如适用)
  • 已通知相关利益方
  • 已安排维护窗口(如适用)

Execution Steps

执行步骤

Step 1: [Action Name]

步骤1:[操作名称]

Purpose: Why this step is necessary
Command:
bash
kubectl get pods -n production -l app=myservice
Expected Output:
NAME                        READY   STATUS    RESTARTS   AGE
myservice-abc123-xyz        1/1     Running   0          2d
myservice-def456-uvw        1/1     Running   0          2d
Verification: Confirm all pods show STATUS=Running
If unexpected: See Troubleshooting section

用途: 说明此步骤的必要性
命令:
bash
kubectl get pods -n production -l app=myservice
预期输出:
NAME                        READY   STATUS    RESTARTS   AGE
myservice-abc123-xyz        1/1     Running   0          2d
myservice-def456-uvw        1/1     Running   0          2d
验证: 确认所有pod的STATUS为Running
若不符合预期: 查看故障排除部分

Step 2: [Next Action]

步骤2:[下一步操作]

Purpose: Description
Command:
bash
undefined
用途: 操作说明
命令:
bash
undefined

Command with explanation

命令说明

kubectl scale deployment myservice --replicas=3 -n production

**Expected Output:**
deployment.apps/myservice scaled

**Verification:**
```bash
kubectl scale deployment myservice --replicas=3 -n production

**预期输出:**
deployment.apps/myservice scaled

**验证:**
```bash

Verify new replicas are running

验证新副本是否运行

kubectl get pods -n production -l app=myservice -w

**Wait for:** All 3 pods to show Running status (typically 2-5 minutes)

---
kubectl get pods -n production -l app=myservice -w

**等待:** 所有3个pod显示Running状态(通常需要2-5分钟)

---

Post-Execution Verification

执行后验证

Verify Service Health

验证服务健康状态

bash
undefined
bash
undefined

Check deployment status

检查部署状态

kubectl rollout status deployment/myservice -n production
kubectl rollout status deployment/myservice -n production

Check service endpoints

检查服务端点

kubectl get endpoints myservice -n production
kubectl get endpoints myservice -n production

Verify application health

验证应用健康状态


**Expected:**
```json
{
  "status": "healthy",
  "version": "1.2.3",
  "uptime": "2h30m"
}

**预期结果:**
```json
{
  "status": "healthy",
  "version": "1.2.3",
  "uptime": "2h30m"
}

Verify Metrics

验证指标

  • Error rate returned to normal (<0.1%)
  • Latency within SLA (<200ms p99)
  • No new alerts firing
  • 错误率恢复正常(<0.1%)
  • 延迟符合SLA要求(p99<200ms)
  • 无新警报触发

Rollback Procedure

回滚流程

When to Rollback

回滚场景

  • Error rate exceeds 1%
  • Latency exceeds 500ms p99
  • Critical functionality broken
  • 错误率超过1%
  • p99延迟超过500ms
  • 核心功能故障

Rollback Steps

回滚步骤

bash
undefined
bash
undefined

Rollback to previous deployment

回滚到上一个部署版本

kubectl rollout undo deployment/myservice -n production
kubectl rollout undo deployment/myservice -n production

Verify rollback

验证回滚状态

kubectl rollout status deployment/myservice -n production
kubectl rollout status deployment/myservice -n production

Confirm previous version

确认回滚后的版本

kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
undefined
kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
undefined

Troubleshooting

故障排除

SymptomLikely CauseResolution
Pods stuck in PendingResource constraintsCheck node capacity:
kubectl describe nodes
CrashLoopBackOffApplication errorCheck logs:
kubectl logs -f <pod>
ImagePullBackOffRegistry auth issueVerify secret:
kubectl get secret regcred
Connection refusedService not readyWait for readiness probe, check endpoints
症状可能原因解决方法
Pod卡在Pending状态资源限制检查节点容量:
kubectl describe nodes
CrashLoopBackOff应用程序错误查看日志:
kubectl logs -f <pod>
ImagePullBackOff镜像仓库认证问题验证密钥:
kubectl get secret regcred
连接被拒绝服务未就绪等待就绪探针完成,检查端点

Common Issues

常见问题

Issue: Deployment times out
bash
undefined
问题:部署超时
bash
undefined

Check pod events

查看pod事件

kubectl describe pod <pod-name> -n production
kubectl describe pod <pod-name> -n production

Check resource limits

检查资源限制

kubectl top pods -n production

**Issue: Database connection failures**
```bash
kubectl top pods -n production

**问题:数据库连接失败**
```bash

Verify database connectivity

验证数据库连接

kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"
kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"

Check connection pool

检查连接池

kubectl logs <pod> -n production | grep -i "connection"
undefined
kubectl logs <pod> -n production | grep -i "connection"
undefined

Emergency Contacts

紧急联系人

RoleContactWhen to Engage
On-call EngineerPagerDutyAny issue
Database Team#dba-oncallDatabase issues
Platform Team#platform-oncallInfrastructure issues
Engineering Manager[Name]Escalation
角色联系方式触发场景
值班工程师PagerDuty任何问题
数据库团队#dba-oncall数据库相关问题
平台团队#platform-oncall基础设施相关问题
工程经理[姓名]升级上报

Change Log

变更日志

VersionDateAuthorChanges
1.02024-01-15AuthorInitial version
版本日期作者变更内容
1.02024-01-15作者初始版本

Related Documentation

相关文档

  • Service Architecture
  • Incident Response Process
  • Monitoring Dashboard
undefined
  • Service Architecture
  • Incident Response Process
  • Monitoring Dashboard
undefined

Runbook Types

Runbook类型

Incident Response Runbook

事件响应Runbook

yaml
incident_runbook:
  sections:
    detection:
      alert_name: "High Error Rate - Payment Service"
      threshold: "Error rate > 5% for 5 minutes"
      severity: "P1"

    immediate_actions:
      - step: "Acknowledge alert"
        command: "In PagerDuty, acknowledge incident"
        time: "< 5 min"

      - step: "Assess impact"
        command: |
          # Check error rate
          curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
        time: "< 2 min"

      - step: "Notify stakeholders"
        action: "Post in #incident-channel"
        template: |
          🚨 INCIDENT: Payment Service High Errors
          Severity: P1
          Status: Investigating
          Impact: Payment processing affected
          IC: @oncall

    investigation:
      - "Check recent deployments"
      - "Review error logs"
      - "Check dependent services"
      - "Review infrastructure metrics"

    mitigation:
      options:
        - name: "Rollback deployment"
          when: "Error started after deploy"
          command: "kubectl rollout undo deployment/payment -n prod"

        - name: "Scale up"
          when: "Load-related errors"
          command: "kubectl scale deployment/payment --replicas=10 -n prod"

        - name: "Enable circuit breaker"
          when: "Downstream dependency failing"
          command: "Toggle feature flag: payment.circuit_breaker=true"

    resolution:
      checklist:
        - "[ ] Error rate < 0.1%"
        - "[ ] No P1 alerts"
        - "[ ] Stakeholders notified"
        - "[ ] Incident documented"
yaml
incident_runbook:
  sections:
    detection:
      alert_name: "High Error Rate - Payment Service"
      threshold: "Error rate > 5% for 5 minutes"
      severity: "P1"

    immediate_actions:
      - step: "Acknowledge alert"
        command: "In PagerDuty, acknowledge incident"
        time: "< 5 min"

      - step: "Assess impact"
        command: |
          # 检查错误率
          curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
        time: "< 2 min"

      - step: "Notify stakeholders"
        action: "Post in #incident-channel"
        template: |
          🚨 事件:支付服务错误率过高
          严重程度:P1
          状态:正在调查
          影响:支付流程受影响
          负责人:@oncall

    investigation:
      - "Check recent deployments"
      - "Review error logs"
      - "Check dependent services"
      - "Review infrastructure metrics"

    mitigation:
      options:
        - name: "Rollback deployment"
          when: "Error started after deploy"
          command: "kubectl rollout undo deployment/payment -n prod"

        - name: "Scale up"
          when: "Load-related errors"
          command: "kubectl scale deployment/payment --replicas=10 -n prod"

        - name: "Enable circuit breaker"
          when: "Downstream dependency failing"
          command: "Toggle feature flag: payment.circuit_breaker=true"

    resolution:
      checklist:
        - "[ ] Error rate < 0.1%"
        - "[ ] No P1 alerts"
        - "[ ] Stakeholders notified"
        - "[ ] Incident documented"

Deployment Runbook

部署Runbook

yaml
deployment_runbook:
  pre_deployment:
    checklist:
      - "[ ] Code review approved"
      - "[ ] CI/CD pipeline passed"
      - "[ ] Staging tested"
      - "[ ] Change ticket approved"
      - "[ ] Rollback plan documented"

    verification:
      - step: "Verify staging health"
        command: |
          curl -s https://staging.example.com/health

      - step: "Check deployment queue"
        command: |
          kubectl get pods -n staging -l app=myservice

  deployment:
    - step: "Apply deployment"
      command: |
        kubectl apply -f k8s/production/deployment.yaml

    - step: "Monitor rollout"
      command: |
        kubectl rollout status deployment/myservice -n production --timeout=10m

    - step: "Verify new version"
      command: |
        kubectl get deployment myservice -n production \
          -o jsonpath='{.spec.template.spec.containers[0].image}'

  post_deployment:
    - step: "Smoke test"
      command: |
        ./scripts/smoke-test.sh production

    - step: "Monitor metrics"
      duration: "15 minutes"
      watch:
        - "Error rate"
        - "Latency p99"
        - "Request rate"

    - step: "Update ticket"
      action: "Mark CHG ticket as completed"
yaml
deployment_runbook:
  pre_deployment:
    checklist:
      - "[ ] Code review approved"
      - "[ ] CI/CD pipeline passed"
      - "[ ] Staging tested"
      - "[ ] Change ticket approved"
      - "[ ] Rollback plan documented"

    verification:
      - step: "Verify staging health"
        command: |
          curl -s https://staging.example.com/health

      - step: "Check deployment queue"
        command: |
          kubectl get pods -n staging -l app=myservice

  deployment:
    - step: "Apply deployment"
      command: |
        kubectl apply -f k8s/production/deployment.yaml

    - step: "Monitor rollout"
      command: |
        kubectl rollout status deployment/myservice -n production --timeout=10m

    - step: "Verify new version"
      command: |
        kubectl get deployment myservice -n production \
          -o jsonpath='{.spec.template.spec.containers[0].image}'

  post_deployment:
    - step: "Smoke test"
      command: |
        ./scripts/smoke-test.sh production

    - step: "Monitor metrics"
      duration: "15 minutes"
      watch:
        - "Error rate"
        - "Latency p99"
        - "Request rate"

    - step: "Update ticket"
      action: "Mark CHG ticket as completed"

Maintenance Runbook

维护Runbook

yaml
maintenance_runbook:
  log_rotation:
    schedule: "Weekly, Sunday 02:00 UTC"

    steps:
      - step: "Connect to server"
        command: |
          ssh admin@logs.example.com

      - step: "Rotate logs"
        command: |
          sudo logrotate -f /etc/logrotate.d/application

      - step: "Verify rotation"
        command: |
          ls -la /var/log/application/
          # Should see rotated files with date suffix

      - step: "Clean old logs"
        command: |
          # Remove logs older than 30 days
          find /var/log/application/ -name "*.log.*" -mtime +30 -delete

      - step: "Verify disk space"
        command: |
          df -h /var/log
          # Should show > 20% free

  database_maintenance:
    schedule: "Monthly, first Sunday 03:00 UTC"

    steps:
      - step: "Check table sizes"
        command: |
          psql -c "
            SELECT tablename,
                   pg_size_pretty(pg_total_relation_size(tablename::text))
            FROM pg_tables
            WHERE schemaname = 'public'
            ORDER BY pg_total_relation_size(tablename::text) DESC
            LIMIT 10;
          "

      - step: "Run VACUUM ANALYZE"
        command: |
          psql -c "VACUUM ANALYZE;"

      - step: "Reindex if needed"
        command: |
          psql -c "REINDEX DATABASE mydb;"
yaml
maintenance_runbook:
  log_rotation:
    schedule: "Weekly, Sunday 02:00 UTC"

    steps:
      - step: "Connect to server"
        command: |
          ssh admin@logs.example.com

      - step: "Rotate logs"
        command: |
          sudo logrotate -f /etc/logrotate.d/application

      - step: "Verify rotation"
        command: |
          ls -la /var/log/application/
          # Should see rotated files with date suffix

      - step: "Clean old logs"
        command: |
          # Remove logs older than 30 days
          find /var/log/application/ -name "*.log.*" -mtime +30 -delete

      - step: "Verify disk space"
        command: |
          df -h /var/log
          # Should show > 20% free

  database_maintenance:
    schedule: "Monthly, first Sunday 03:00 UTC"

    steps:
      - step: "Check table sizes"
        command: |
          psql -c "
            SELECT tablename,
                   pg_size_pretty(pg_total_relation_size(tablename::text))
            FROM pg_tables
            WHERE schemaname = 'public'
            ORDER BY pg_total_relation_size(tablename::text) DESC
            LIMIT 10;
          "

      - step: "Run VACUUM ANALYZE"
        command: |
          psql -c "VACUUM ANALYZE;"

      - step: "Reindex if needed"
        command: |
          psql -c "REINDEX DATABASE mydb;"

Writing Guidelines

编写指南

yaml
principles:
  clarity:
    - "Use active voice"
    - "Be explicit, never assume"
    - "One action per step"

  completeness:
    - "Include all commands"
    - "Show expected output"
    - "Document verification"

  safety:
    - "Test in non-prod first"
    - "Include rollback steps"
    - "Document risks"

formatting:
  commands:
    - "Use code blocks with language"
    - "Include full paths"
    - "Add comments for complex commands"

  steps:
    - "Number sequentially"
    - "Include purpose"
    - "Show expected result"
    - "Note time estimate"

  variables:
    format: "$VARIABLE_NAME or <placeholder>"
    document: "List all variables at start"
yaml
principles:
  clarity:
    - "Use active voice"
    - "Be explicit, never assume"
    - "One action per step"

  completeness:
    - "Include all commands"
    - "Show expected output"
    - "Document verification"

  safety:
    - "Test in non-prod first"
    - "Include rollback steps"
    - "Document risks"

formatting:
  commands:
    - "Use code blocks with language"
    - "Include full paths"
    - "Add comments for complex commands"

  steps:
    - "Number sequentially"
    - "Include purpose"
    - "Show expected result"
    - "Note time estimate"

  variables:
    format: "$VARIABLE_NAME or <placeholder>"
    document: "List all variables at start"

Quality Checklist

质量检查清单

yaml
validation:
  structure:
    - "[ ] Clear title and metadata"
    - "[ ] Prerequisites listed"
    - "[ ] Steps numbered and clear"
    - "[ ] Expected outputs included"
    - "[ ] Verification steps present"
    - "[ ] Rollback documented"
    - "[ ] Troubleshooting section"
    - "[ ] Contacts listed"

  testing:
    - "[ ] All commands tested"
    - "[ ] Outputs verified"
    - "[ ] Rollback tested"
    - "[ ] Time estimates accurate"

  maintenance:
    - "[ ] Version number updated"
    - "[ ] Change log maintained"
    - "[ ] Quarterly review scheduled"
    - "[ ] Owner assigned"
yaml
validation:
  structure:
    - "[ ] Clear title and metadata"
    - "[ ] Prerequisites listed"
    - "[ ] Steps numbered and clear"
    - "[ ] Expected outputs included"
    - "[ ] Verification steps present"
    - "[ ] Rollback documented"
    - "[ ] Troubleshooting section"
    - "[ ] Contacts listed"

  testing:
    - "[ ] All commands tested"
    - "[ ] Outputs verified"
    - "[ ] Rollback tested"
    - "[ ] Time estimates accurate"

  maintenance:
    - "[ ] Version number updated"
    - "[ ] Change log maintained"
    - "[ ] Quarterly review scheduled"
    - "[ ] Owner assigned"

Лучшие практики

最佳实践

  1. Test everything — каждая команда должна быть проверена
  2. Show expected output — пользователь должен знать что увидит
  3. Include rollback — всегда план отката
  4. Keep updated — ревью каждый квартал
  5. Version control — runbooks в git
  6. Automate when possible — автоматизируй повторяющиеся процедуры
  1. 全面测试——每个命令都必须经过验证
  2. 展示预期输出——用户需要知道会看到什么结果
  3. 包含回滚流程——始终要有回滚计划
  4. 持续更新——每季度进行一次审核
  5. 版本控制——将Runbook存储在Git中
  6. 尽可能自动化——自动化重复执行的流程