runbook-generator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Runbook Generator

Runbook生成器

Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks.

擅长为操作流程、事件响应和系统维护任务创建全面、标准化的Runbook。

Runbook Structure

Runbook结构

yaml

runbook_template:
  metadata:
    title: "Runbook title"
    version: "1.0"
    last_updated: "2024-01-15"
    owner: "Team/Person"
    reviewers: ["Name 1", "Name 2"]

  overview:
    purpose: "What this runbook accomplishes"
    scope: "Systems/services affected"
    audience: "Who should use this"

  prerequisites:
    access:
      - "AWS Console access"
      - "SSH key for production servers"
      - "Database credentials"
    tools:
      - "kubectl configured"
      - "AWS CLI installed"
      - "jq for JSON parsing"
    knowledge:
      - "Basic Kubernetes concepts"
      - "Understanding of service architecture"

  execution:
    estimated_time: "15-30 minutes"
    risk_level: "Medium"
    requires_change_ticket: true
    requires_approval: true
    can_be_automated: true

  steps: []  # Detailed steps below

  verification: []  # How to confirm success

  rollback: []  # How to undo changes

  troubleshooting: []  # Common issues

  contacts:
    primary_oncall: "PagerDuty"
    escalation: "Engineering Manager"
    subject_experts: ["DBA Team", "Platform Team"]

yaml

runbook_template:
  metadata:
    title: "Runbook title"
    version: "1.0"
    last_updated: "2024-01-15"
    owner: "Team/Person"
    reviewers: ["Name 1", "Name 2"]

  overview:
    purpose: "What this runbook accomplishes"
    scope: "Systems/services affected"
    audience: "Who should use this"

  prerequisites:
    access:
      - "AWS Console access"
      - "SSH key for production servers"
      - "Database credentials"
    tools:
      - "kubectl configured"
      - "AWS CLI installed"
      - "jq for JSON parsing"
    knowledge:
      - "Basic Kubernetes concepts"
      - "Understanding of service architecture"

  execution:
    estimated_time: "15-30 minutes"
    risk_level: "Medium"
    requires_change_ticket: true
    requires_approval: true
    can_be_automated: true

  steps: []  # Detailed steps below

  verification: []  # How to confirm success

  rollback: []  # How to undo changes

  troubleshooting: []  # Common issues

  contacts:
    primary_oncall: "PagerDuty"
    escalation: "Engineering Manager"
    subject_experts: ["DBA Team", "Platform Team"]

Standard Runbook Template

标准Runbook模板

markdown

undefined

markdown

undefined

[Runbook Title]

Version: 1.0 Last Updated: YYYY-MM-DD Owner: Team Name Risk Level: Low | Medium | High | Critical

Overview

概述

Purpose

用途

Brief description of what this runbook accomplishes.

简要说明本Runbook的目标。

When to Use

使用场景

Trigger condition 1
Trigger condition 2
Alert: "Alert Name" fires

触发条件1
触发条件2
警报：“Alert Name”触发

Scope

适用范围

Systems and services affected:

Service A
Database B
External dependency C

涉及的系统与服务：

服务A
数据库B
外部依赖C

Prerequisites

前置条件

Required Access

所需权限

Required Tools

所需工具

bash

undefined

bash

undefined

Verify kubectl

验证kubectl

kubectl version --client

Verify AWS CLI

验证AWS CLI

aws sts get-caller-identity

Verify database connectivity

验证数据库连接

psql -h $DB_HOST -U $DB_USER -c "SELECT 1"

undefined

psql -h $DB_HOST -U $DB_USER -c "SELECT 1"

undefined

Required Knowledge

所需知识

Kubernetes pod management
Service architecture overview
Incident response process

Kubernetes pod管理
服务架构概述
事件响应流程

Pre-Execution Checklist

执行前检查清单

Execution Steps

执行步骤

Step 1: [Action Name]

步骤1：[操作名称]

Purpose: Why this step is necessary

Command:

bash

kubectl get pods -n production -l app=myservice

Expected Output:

NAME                        READY   STATUS    RESTARTS   AGE
myservice-abc123-xyz        1/1     Running   0          2d
myservice-def456-uvw        1/1     Running   0          2d

Verification: Confirm all pods show STATUS=Running

If unexpected: See Troubleshooting section

用途： 说明此步骤的必要性

命令：

bash

kubectl get pods -n production -l app=myservice

预期输出：

NAME                        READY   STATUS    RESTARTS   AGE
myservice-abc123-xyz        1/1     Running   0          2d
myservice-def456-uvw        1/1     Running   0          2d

验证： 确认所有pod的STATUS为Running

若不符合预期： 查看故障排除部分

Step 2: [Next Action]

步骤2：[下一步操作]

Purpose: Description

Command:

bash

undefined

用途： 操作说明

命令：

bash

undefined

Command with explanation

命令说明

kubectl scale deployment myservice --replicas=3 -n production


**Expected Output:**

deployment.apps/myservice scaled


**Verification:**
```bash

kubectl scale deployment myservice --replicas=3 -n production


**预期输出：**

deployment.apps/myservice scaled


**验证：**
```bash

Verify new replicas are running

验证新副本是否运行

kubectl get pods -n production -l app=myservice -w


**Wait for:** All 3 pods to show Running status (typically 2-5 minutes)

---

kubectl get pods -n production -l app=myservice -w


**等待：** 所有3个pod显示Running状态（通常需要2-5分钟）

---

Post-Execution Verification

执行后验证

Verify Service Health

验证服务健康状态

bash

undefined

bash

undefined

Check deployment status

检查部署状态

kubectl rollout status deployment/myservice -n production

Check service endpoints

检查服务端点

kubectl get endpoints myservice -n production

Verify application health

验证应用健康状态

curl -s https://api.example.com/health | jq .


**Expected:**
```json
{
  "status": "healthy",
  "version": "1.2.3",
  "uptime": "2h30m"
}

curl -s https://api.example.com/health | jq .


**预期结果：**
```json
{
  "status": "healthy",
  "version": "1.2.3",
  "uptime": "2h30m"
}

Verify Metrics

验证指标

Error rate returned to normal (<0.1%)
Latency within SLA (<200ms p99)
No new alerts firing

错误率恢复正常(<0.1%)
延迟符合SLA要求(p99<200ms)
无新警报触发

Rollback Procedure

回滚流程

When to Rollback

回滚场景

Error rate exceeds 1%
Latency exceeds 500ms p99
Critical functionality broken

错误率超过1%
p99延迟超过500ms
核心功能故障

Rollback Steps

回滚步骤

bash

undefined

bash

undefined

Rollback to previous deployment

回滚到上一个部署版本

kubectl rollout undo deployment/myservice -n production

Verify rollback

验证回滚状态

kubectl rollout status deployment/myservice -n production

Confirm previous version

确认回滚后的版本

kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'

undefined

kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'

undefined

Troubleshooting

故障排除

Symptom	Likely Cause	Resolution
Pods stuck in Pending	Resource constraints	Check node capacity: `kubectl describe nodes`
CrashLoopBackOff	Application error	Check logs: `kubectl logs -f <pod>`
ImagePullBackOff	Registry auth issue	Verify secret: `kubectl get secret regcred`
Connection refused	Service not ready	Wait for readiness probe, check endpoints

症状	可能原因	解决方法
Pod卡在Pending状态	资源限制	检查节点容量： `kubectl describe nodes`
CrashLoopBackOff	应用程序错误	查看日志： `kubectl logs -f <pod>`
ImagePullBackOff	镜像仓库认证问题	验证密钥： `kubectl get secret regcred`
连接被拒绝	服务未就绪	等待就绪探针完成，检查端点

Common Issues

常见问题

Issue: Deployment times out

bash

undefined

问题：部署超时

bash

undefined

Check pod events

查看pod事件

kubectl describe pod <pod-name> -n production

Check resource limits

检查资源限制

kubectl top pods -n production


**Issue: Database connection failures**
```bash

kubectl top pods -n production


**问题：数据库连接失败**
```bash

Verify database connectivity

验证数据库连接

kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"

Check connection pool

检查连接池

kubectl logs <pod> -n production | grep -i "connection"

undefined

kubectl logs <pod> -n production | grep -i "connection"

undefined

Emergency Contacts

紧急联系人

Role	Contact	When to Engage
On-call Engineer	PagerDuty	Any issue
Database Team	#dba-oncall	Database issues
Platform Team	#platform-oncall	Infrastructure issues
Engineering Manager	[Name]	Escalation

角色	联系方式	触发场景
值班工程师	PagerDuty	任何问题
数据库团队	#dba-oncall	数据库相关问题
平台团队	#platform-oncall	基础设施相关问题
工程经理	[姓名]	升级上报

Change Log

变更日志

Version	Date	Author	Changes
1.0	2024-01-15	Author	Initial version

版本	日期	作者	变更内容
1.0	2024-01-15	作者	初始版本

Runbook Types

Runbook类型

Incident Response Runbook

事件响应Runbook

yaml

incident_runbook:
  sections:
    detection:
      alert_name: "High Error Rate - Payment Service"
      threshold: "Error rate > 5% for 5 minutes"
      severity: "P1"

    immediate_actions:
      - step: "Acknowledge alert"
        command: "In PagerDuty, acknowledge incident"
        time: "< 5 min"

      - step: "Assess impact"
        command: |
          # Check error rate
          curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
        time: "< 2 min"

      - step: "Notify stakeholders"
        action: "Post in #incident-channel"
        template: |
          🚨 INCIDENT: Payment Service High Errors
          Severity: P1
          Status: Investigating
          Impact: Payment processing affected
          IC: @oncall

    investigation:
      - "Check recent deployments"
      - "Review error logs"
      - "Check dependent services"
      - "Review infrastructure metrics"

    mitigation:
      options:
        - name: "Rollback deployment"
          when: "Error started after deploy"
          command: "kubectl rollout undo deployment/payment -n prod"

        - name: "Scale up"
          when: "Load-related errors"
          command: "kubectl scale deployment/payment --replicas=10 -n prod"

        - name: "Enable circuit breaker"
          when: "Downstream dependency failing"
          command: "Toggle feature flag: payment.circuit_breaker=true"

    resolution:
      checklist:
        - "[ ] Error rate < 0.1%"
        - "[ ] No P1 alerts"
        - "[ ] Stakeholders notified"
        - "[ ] Incident documented"

yaml

incident_runbook:
  sections:
    detection:
      alert_name: "High Error Rate - Payment Service"
      threshold: "Error rate > 5% for 5 minutes"
      severity: "P1"

    immediate_actions:
      - step: "Acknowledge alert"
        command: "In PagerDuty, acknowledge incident"
        time: "< 5 min"

      - step: "Assess impact"
        command: |
          # 检查错误率
          curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
        time: "< 2 min"

      - step: "Notify stakeholders"
        action: "Post in #incident-channel"
        template: |
          🚨 事件：支付服务错误率过高
          严重程度：P1
          状态：正在调查
          影响：支付流程受影响
          负责人：@oncall

    investigation:
      - "Check recent deployments"
      - "Review error logs"
      - "Check dependent services"
      - "Review infrastructure metrics"

    mitigation:
      options:
        - name: "Rollback deployment"
          when: "Error started after deploy"
          command: "kubectl rollout undo deployment/payment -n prod"

        - name: "Scale up"
          when: "Load-related errors"
          command: "kubectl scale deployment/payment --replicas=10 -n prod"

        - name: "Enable circuit breaker"
          when: "Downstream dependency failing"
          command: "Toggle feature flag: payment.circuit_breaker=true"

    resolution:
      checklist:
        - "[ ] Error rate < 0.1%"
        - "[ ] No P1 alerts"
        - "[ ] Stakeholders notified"
        - "[ ] Incident documented"

Deployment Runbook

部署Runbook

yaml

deployment_runbook:
  pre_deployment:
    checklist:
      - "[ ] Code review approved"
      - "[ ] CI/CD pipeline passed"
      - "[ ] Staging tested"
      - "[ ] Change ticket approved"
      - "[ ] Rollback plan documented"

    verification:
      - step: "Verify staging health"
        command: |
          curl -s https://staging.example.com/health

      - step: "Check deployment queue"
        command: |
          kubectl get pods -n staging -l app=myservice

  deployment:
    - step: "Apply deployment"
      command: |
        kubectl apply -f k8s/production/deployment.yaml

    - step: "Monitor rollout"
      command: |
        kubectl rollout status deployment/myservice -n production --timeout=10m

    - step: "Verify new version"
      command: |
        kubectl get deployment myservice -n production \
          -o jsonpath='{.spec.template.spec.containers[0].image}'

  post_deployment:
    - step: "Smoke test"
      command: |
        ./scripts/smoke-test.sh production

    - step: "Monitor metrics"
      duration: "15 minutes"
      watch:
        - "Error rate"
        - "Latency p99"
        - "Request rate"

    - step: "Update ticket"
      action: "Mark CHG ticket as completed"

yaml

deployment_runbook:
  pre_deployment:
    checklist:
      - "[ ] Code review approved"
      - "[ ] CI/CD pipeline passed"
      - "[ ] Staging tested"
      - "[ ] Change ticket approved"
      - "[ ] Rollback plan documented"

    verification:
      - step: "Verify staging health"
        command: |
          curl -s https://staging.example.com/health

      - step: "Check deployment queue"
        command: |
          kubectl get pods -n staging -l app=myservice

  deployment:
    - step: "Apply deployment"
      command: |
        kubectl apply -f k8s/production/deployment.yaml

    - step: "Monitor rollout"
      command: |
        kubectl rollout status deployment/myservice -n production --timeout=10m

    - step: "Verify new version"
      command: |
        kubectl get deployment myservice -n production \
          -o jsonpath='{.spec.template.spec.containers[0].image}'

  post_deployment:
    - step: "Smoke test"
      command: |
        ./scripts/smoke-test.sh production

    - step: "Monitor metrics"
      duration: "15 minutes"
      watch:
        - "Error rate"
        - "Latency p99"
        - "Request rate"

    - step: "Update ticket"
      action: "Mark CHG ticket as completed"

Maintenance Runbook

维护Runbook

yaml

maintenance_runbook:
  log_rotation:
    schedule: "Weekly, Sunday 02:00 UTC"

    steps:
      - step: "Connect to server"
        command: |
          ssh admin@logs.example.com

      - step: "Rotate logs"
        command: |
          sudo logrotate -f /etc/logrotate.d/application

      - step: "Verify rotation"
        command: |
          ls -la /var/log/application/
          # Should see rotated files with date suffix

      - step: "Clean old logs"
        command: |
          # Remove logs older than 30 days
          find /var/log/application/ -name "*.log.*" -mtime +30 -delete

      - step: "Verify disk space"
        command: |
          df -h /var/log
          # Should show > 20% free

  database_maintenance:
    schedule: "Monthly, first Sunday 03:00 UTC"

    steps:
      - step: "Check table sizes"
        command: |
          psql -c "
            SELECT tablename,
                   pg_size_pretty(pg_total_relation_size(tablename::text))
            FROM pg_tables
            WHERE schemaname = 'public'
            ORDER BY pg_total_relation_size(tablename::text) DESC
            LIMIT 10;
          "

      - step: "Run VACUUM ANALYZE"
        command: |
          psql -c "VACUUM ANALYZE;"

      - step: "Reindex if needed"
        command: |
          psql -c "REINDEX DATABASE mydb;"

yaml

maintenance_runbook:
  log_rotation:
    schedule: "Weekly, Sunday 02:00 UTC"

    steps:
      - step: "Connect to server"
        command: |
          ssh admin@logs.example.com

      - step: "Rotate logs"
        command: |
          sudo logrotate -f /etc/logrotate.d/application

      - step: "Verify rotation"
        command: |
          ls -la /var/log/application/
          # Should see rotated files with date suffix

      - step: "Clean old logs"
        command: |
          # Remove logs older than 30 days
          find /var/log/application/ -name "*.log.*" -mtime +30 -delete

      - step: "Verify disk space"
        command: |
          df -h /var/log
          # Should show > 20% free

  database_maintenance:
    schedule: "Monthly, first Sunday 03:00 UTC"

    steps:
      - step: "Check table sizes"
        command: |
          psql -c "
            SELECT tablename,
                   pg_size_pretty(pg_total_relation_size(tablename::text))
            FROM pg_tables
            WHERE schemaname = 'public'
            ORDER BY pg_total_relation_size(tablename::text) DESC
            LIMIT 10;
          "

      - step: "Run VACUUM ANALYZE"
        command: |
          psql -c "VACUUM ANALYZE;"

      - step: "Reindex if needed"
        command: |
          psql -c "REINDEX DATABASE mydb;"

Writing Guidelines

编写指南

yaml

principles:
  clarity:
    - "Use active voice"
    - "Be explicit, never assume"
    - "One action per step"

  completeness:
    - "Include all commands"
    - "Show expected output"
    - "Document verification"

  safety:
    - "Test in non-prod first"
    - "Include rollback steps"
    - "Document risks"

formatting:
  commands:
    - "Use code blocks with language"
    - "Include full paths"
    - "Add comments for complex commands"

  steps:
    - "Number sequentially"
    - "Include purpose"
    - "Show expected result"
    - "Note time estimate"

  variables:
    format: "$VARIABLE_NAME or <placeholder>"
    document: "List all variables at start"

yaml

principles:
  clarity:
    - "Use active voice"
    - "Be explicit, never assume"
    - "One action per step"

  completeness:
    - "Include all commands"
    - "Show expected output"
    - "Document verification"

  safety:
    - "Test in non-prod first"
    - "Include rollback steps"
    - "Document risks"

formatting:
  commands:
    - "Use code blocks with language"
    - "Include full paths"
    - "Add comments for complex commands"

  steps:
    - "Number sequentially"
    - "Include purpose"
    - "Show expected result"
    - "Note time estimate"

  variables:
    format: "$VARIABLE_NAME or <placeholder>"
    document: "List all variables at start"

Quality Checklist

质量检查清单

yaml

validation:
  structure:
    - "[ ] Clear title and metadata"
    - "[ ] Prerequisites listed"
    - "[ ] Steps numbered and clear"
    - "[ ] Expected outputs included"
    - "[ ] Verification steps present"
    - "[ ] Rollback documented"
    - "[ ] Troubleshooting section"
    - "[ ] Contacts listed"

  testing:
    - "[ ] All commands tested"
    - "[ ] Outputs verified"
    - "[ ] Rollback tested"
    - "[ ] Time estimates accurate"

  maintenance:
    - "[ ] Version number updated"
    - "[ ] Change log maintained"
    - "[ ] Quarterly review scheduled"
    - "[ ] Owner assigned"

yaml

validation:
  structure:
    - "[ ] Clear title and metadata"
    - "[ ] Prerequisites listed"
    - "[ ] Steps numbered and clear"
    - "[ ] Expected outputs included"
    - "[ ] Verification steps present"
    - "[ ] Rollback documented"
    - "[ ] Troubleshooting section"
    - "[ ] Contacts listed"

  testing:
    - "[ ] All commands tested"
    - "[ ] Outputs verified"
    - "[ ] Rollback tested"
    - "[ ] Time estimates accurate"

  maintenance:
    - "[ ] Version number updated"
    - "[ ] Change log maintained"
    - "[ ] Quarterly review scheduled"
    - "[ ] Owner assigned"

Лучшие практики

最佳实践

Test everything — каждая команда должна быть проверена
Show expected output — пользователь должен знать что увидит
Include rollback — всегда план отката
Keep updated — ревью каждый квартал
Version control — runbooks в git
Automate when possible — автоматизируй повторяющиеся процедуры

全面测试——每个命令都必须经过验证
展示预期输出——用户需要知道会看到什么结果
包含回滚流程——始终要有回滚计划
持续更新——每季度进行一次审核
版本控制——将Runbook存储在Git中
尽可能自动化——自动化重复执行的流程