runbook-generator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRunbook Generator
Runbook生成器
Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks.
擅长为操作流程、事件响应和系统维护任务创建全面、标准化的Runbook。
Runbook Structure
Runbook结构
yaml
runbook_template:
metadata:
title: "Runbook title"
version: "1.0"
last_updated: "2024-01-15"
owner: "Team/Person"
reviewers: ["Name 1", "Name 2"]
overview:
purpose: "What this runbook accomplishes"
scope: "Systems/services affected"
audience: "Who should use this"
prerequisites:
access:
- "AWS Console access"
- "SSH key for production servers"
- "Database credentials"
tools:
- "kubectl configured"
- "AWS CLI installed"
- "jq for JSON parsing"
knowledge:
- "Basic Kubernetes concepts"
- "Understanding of service architecture"
execution:
estimated_time: "15-30 minutes"
risk_level: "Medium"
requires_change_ticket: true
requires_approval: true
can_be_automated: true
steps: [] # Detailed steps below
verification: [] # How to confirm success
rollback: [] # How to undo changes
troubleshooting: [] # Common issues
contacts:
primary_oncall: "PagerDuty"
escalation: "Engineering Manager"
subject_experts: ["DBA Team", "Platform Team"]yaml
runbook_template:
metadata:
title: "Runbook title"
version: "1.0"
last_updated: "2024-01-15"
owner: "Team/Person"
reviewers: ["Name 1", "Name 2"]
overview:
purpose: "What this runbook accomplishes"
scope: "Systems/services affected"
audience: "Who should use this"
prerequisites:
access:
- "AWS Console access"
- "SSH key for production servers"
- "Database credentials"
tools:
- "kubectl configured"
- "AWS CLI installed"
- "jq for JSON parsing"
knowledge:
- "Basic Kubernetes concepts"
- "Understanding of service architecture"
execution:
estimated_time: "15-30 minutes"
risk_level: "Medium"
requires_change_ticket: true
requires_approval: true
can_be_automated: true
steps: [] # Detailed steps below
verification: [] # How to confirm success
rollback: [] # How to undo changes
troubleshooting: [] # Common issues
contacts:
primary_oncall: "PagerDuty"
escalation: "Engineering Manager"
subject_experts: ["DBA Team", "Platform Team"]Standard Runbook Template
标准Runbook模板
markdown
undefinedmarkdown
undefined[Runbook Title]
[Runbook Title]
Version: 1.0
Last Updated: YYYY-MM-DD
Owner: Team Name
Risk Level: Low | Medium | High | Critical
Version: 1.0
Last Updated: YYYY-MM-DD
Owner: Team Name
Risk Level: Low | Medium | High | Critical
Overview
概述
Purpose
用途
Brief description of what this runbook accomplishes.
简要说明本Runbook的目标。
When to Use
使用场景
- Trigger condition 1
- Trigger condition 2
- Alert: "Alert Name" fires
- 触发条件1
- 触发条件2
- 警报:“Alert Name”触发
Scope
适用范围
Systems and services affected:
- Service A
- Database B
- External dependency C
涉及的系统与服务:
- 服务A
- 数据库B
- 外部依赖C
Prerequisites
前置条件
Required Access
所需权限
- Production AWS Console
- Kubernetes cluster access
- Database read/write permissions
- 生产环境AWS Console权限
- Kubernetes集群访问权限
- 数据库读写权限
Required Tools
所需工具
bash
undefinedbash
undefinedVerify kubectl
验证kubectl
kubectl version --client
kubectl version --client
Verify AWS CLI
验证AWS CLI
aws sts get-caller-identity
aws sts get-caller-identity
Verify database connectivity
验证数据库连接
psql -h $DB_HOST -U $DB_USER -c "SELECT 1"
undefinedpsql -h $DB_HOST -U $DB_USER -c "SELECT 1"
undefinedRequired Knowledge
所需知识
- Kubernetes pod management
- Service architecture overview
- Incident response process
- Kubernetes pod管理
- 服务架构概述
- 事件响应流程
Pre-Execution Checklist
执行前检查清单
- Change ticket created: CHG-XXXXX
- Approval obtained from: [Name]
- Backup verified (if applicable)
- Stakeholders notified
- Maintenance window scheduled (if applicable)
- 已创建变更工单:CHG-XXXXX
- 已获得[姓名]的批准
- 已验证备份(如适用)
- 已通知相关利益方
- 已安排维护窗口(如适用)
Execution Steps
执行步骤
Step 1: [Action Name]
步骤1:[操作名称]
Purpose: Why this step is necessary
Command:
bash
kubectl get pods -n production -l app=myserviceExpected Output:
NAME READY STATUS RESTARTS AGE
myservice-abc123-xyz 1/1 Running 0 2d
myservice-def456-uvw 1/1 Running 0 2dVerification: Confirm all pods show STATUS=Running
If unexpected: See Troubleshooting section
用途: 说明此步骤的必要性
命令:
bash
kubectl get pods -n production -l app=myservice预期输出:
NAME READY STATUS RESTARTS AGE
myservice-abc123-xyz 1/1 Running 0 2d
myservice-def456-uvw 1/1 Running 0 2d验证: 确认所有pod的STATUS为Running
若不符合预期: 查看故障排除部分
Step 2: [Next Action]
步骤2:[下一步操作]
Purpose: Description
Command:
bash
undefined用途: 操作说明
命令:
bash
undefinedCommand with explanation
命令说明
kubectl scale deployment myservice --replicas=3 -n production
**Expected Output:**deployment.apps/myservice scaled
**Verification:**
```bashkubectl scale deployment myservice --replicas=3 -n production
**预期输出:**deployment.apps/myservice scaled
**验证:**
```bashVerify new replicas are running
验证新副本是否运行
kubectl get pods -n production -l app=myservice -w
**Wait for:** All 3 pods to show Running status (typically 2-5 minutes)
---kubectl get pods -n production -l app=myservice -w
**等待:** 所有3个pod显示Running状态(通常需要2-5分钟)
---Post-Execution Verification
执行后验证
Verify Service Health
验证服务健康状态
bash
undefinedbash
undefinedCheck deployment status
检查部署状态
kubectl rollout status deployment/myservice -n production
kubectl rollout status deployment/myservice -n production
Check service endpoints
检查服务端点
kubectl get endpoints myservice -n production
kubectl get endpoints myservice -n production
Verify application health
验证应用健康状态
curl -s https://api.example.com/health | jq .
**Expected:**
```json
{
"status": "healthy",
"version": "1.2.3",
"uptime": "2h30m"
}curl -s https://api.example.com/health | jq .
**预期结果:**
```json
{
"status": "healthy",
"version": "1.2.3",
"uptime": "2h30m"
}Verify Metrics
验证指标
- Error rate returned to normal (<0.1%)
- Latency within SLA (<200ms p99)
- No new alerts firing
- 错误率恢复正常(<0.1%)
- 延迟符合SLA要求(p99<200ms)
- 无新警报触发
Rollback Procedure
回滚流程
When to Rollback
回滚场景
- Error rate exceeds 1%
- Latency exceeds 500ms p99
- Critical functionality broken
- 错误率超过1%
- p99延迟超过500ms
- 核心功能故障
Rollback Steps
回滚步骤
bash
undefinedbash
undefinedRollback to previous deployment
回滚到上一个部署版本
kubectl rollout undo deployment/myservice -n production
kubectl rollout undo deployment/myservice -n production
Verify rollback
验证回滚状态
kubectl rollout status deployment/myservice -n production
kubectl rollout status deployment/myservice -n production
Confirm previous version
确认回滚后的版本
kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
undefinedkubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
undefinedTroubleshooting
故障排除
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Pods stuck in Pending | Resource constraints | Check node capacity: |
| CrashLoopBackOff | Application error | Check logs: |
| ImagePullBackOff | Registry auth issue | Verify secret: |
| Connection refused | Service not ready | Wait for readiness probe, check endpoints |
| 症状 | 可能原因 | 解决方法 |
|---|---|---|
| Pod卡在Pending状态 | 资源限制 | 检查节点容量: |
| CrashLoopBackOff | 应用程序错误 | 查看日志: |
| ImagePullBackOff | 镜像仓库认证问题 | 验证密钥: |
| 连接被拒绝 | 服务未就绪 | 等待就绪探针完成,检查端点 |
Common Issues
常见问题
Issue: Deployment times out
bash
undefined问题:部署超时
bash
undefinedCheck pod events
查看pod事件
kubectl describe pod <pod-name> -n production
kubectl describe pod <pod-name> -n production
Check resource limits
检查资源限制
kubectl top pods -n production
**Issue: Database connection failures**
```bashkubectl top pods -n production
**问题:数据库连接失败**
```bashVerify database connectivity
验证数据库连接
kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"
kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"
Check connection pool
检查连接池
kubectl logs <pod> -n production | grep -i "connection"
undefinedkubectl logs <pod> -n production | grep -i "connection"
undefinedEmergency Contacts
紧急联系人
| Role | Contact | When to Engage |
|---|---|---|
| On-call Engineer | PagerDuty | Any issue |
| Database Team | #dba-oncall | Database issues |
| Platform Team | #platform-oncall | Infrastructure issues |
| Engineering Manager | [Name] | Escalation |
| 角色 | 联系方式 | 触发场景 |
|---|---|---|
| 值班工程师 | PagerDuty | 任何问题 |
| 数据库团队 | #dba-oncall | 数据库相关问题 |
| 平台团队 | #platform-oncall | 基础设施相关问题 |
| 工程经理 | [姓名] | 升级上报 |
Change Log
变更日志
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-01-15 | Author | Initial version |
| 版本 | 日期 | 作者 | 变更内容 |
|---|---|---|---|
| 1.0 | 2024-01-15 | 作者 | 初始版本 |
Related Documentation
相关文档
- Service Architecture
- Incident Response Process
- Monitoring Dashboard
undefined- Service Architecture
- Incident Response Process
- Monitoring Dashboard
undefinedRunbook Types
Runbook类型
Incident Response Runbook
事件响应Runbook
yaml
incident_runbook:
sections:
detection:
alert_name: "High Error Rate - Payment Service"
threshold: "Error rate > 5% for 5 minutes"
severity: "P1"
immediate_actions:
- step: "Acknowledge alert"
command: "In PagerDuty, acknowledge incident"
time: "< 5 min"
- step: "Assess impact"
command: |
# Check error rate
curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
time: "< 2 min"
- step: "Notify stakeholders"
action: "Post in #incident-channel"
template: |
🚨 INCIDENT: Payment Service High Errors
Severity: P1
Status: Investigating
Impact: Payment processing affected
IC: @oncall
investigation:
- "Check recent deployments"
- "Review error logs"
- "Check dependent services"
- "Review infrastructure metrics"
mitigation:
options:
- name: "Rollback deployment"
when: "Error started after deploy"
command: "kubectl rollout undo deployment/payment -n prod"
- name: "Scale up"
when: "Load-related errors"
command: "kubectl scale deployment/payment --replicas=10 -n prod"
- name: "Enable circuit breaker"
when: "Downstream dependency failing"
command: "Toggle feature flag: payment.circuit_breaker=true"
resolution:
checklist:
- "[ ] Error rate < 0.1%"
- "[ ] No P1 alerts"
- "[ ] Stakeholders notified"
- "[ ] Incident documented"yaml
incident_runbook:
sections:
detection:
alert_name: "High Error Rate - Payment Service"
threshold: "Error rate > 5% for 5 minutes"
severity: "P1"
immediate_actions:
- step: "Acknowledge alert"
command: "In PagerDuty, acknowledge incident"
time: "< 5 min"
- step: "Assess impact"
command: |
# 检查错误率
curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
time: "< 2 min"
- step: "Notify stakeholders"
action: "Post in #incident-channel"
template: |
🚨 事件:支付服务错误率过高
严重程度:P1
状态:正在调查
影响:支付流程受影响
负责人:@oncall
investigation:
- "Check recent deployments"
- "Review error logs"
- "Check dependent services"
- "Review infrastructure metrics"
mitigation:
options:
- name: "Rollback deployment"
when: "Error started after deploy"
command: "kubectl rollout undo deployment/payment -n prod"
- name: "Scale up"
when: "Load-related errors"
command: "kubectl scale deployment/payment --replicas=10 -n prod"
- name: "Enable circuit breaker"
when: "Downstream dependency failing"
command: "Toggle feature flag: payment.circuit_breaker=true"
resolution:
checklist:
- "[ ] Error rate < 0.1%"
- "[ ] No P1 alerts"
- "[ ] Stakeholders notified"
- "[ ] Incident documented"Deployment Runbook
部署Runbook
yaml
deployment_runbook:
pre_deployment:
checklist:
- "[ ] Code review approved"
- "[ ] CI/CD pipeline passed"
- "[ ] Staging tested"
- "[ ] Change ticket approved"
- "[ ] Rollback plan documented"
verification:
- step: "Verify staging health"
command: |
curl -s https://staging.example.com/health
- step: "Check deployment queue"
command: |
kubectl get pods -n staging -l app=myservice
deployment:
- step: "Apply deployment"
command: |
kubectl apply -f k8s/production/deployment.yaml
- step: "Monitor rollout"
command: |
kubectl rollout status deployment/myservice -n production --timeout=10m
- step: "Verify new version"
command: |
kubectl get deployment myservice -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}'
post_deployment:
- step: "Smoke test"
command: |
./scripts/smoke-test.sh production
- step: "Monitor metrics"
duration: "15 minutes"
watch:
- "Error rate"
- "Latency p99"
- "Request rate"
- step: "Update ticket"
action: "Mark CHG ticket as completed"yaml
deployment_runbook:
pre_deployment:
checklist:
- "[ ] Code review approved"
- "[ ] CI/CD pipeline passed"
- "[ ] Staging tested"
- "[ ] Change ticket approved"
- "[ ] Rollback plan documented"
verification:
- step: "Verify staging health"
command: |
curl -s https://staging.example.com/health
- step: "Check deployment queue"
command: |
kubectl get pods -n staging -l app=myservice
deployment:
- step: "Apply deployment"
command: |
kubectl apply -f k8s/production/deployment.yaml
- step: "Monitor rollout"
command: |
kubectl rollout status deployment/myservice -n production --timeout=10m
- step: "Verify new version"
command: |
kubectl get deployment myservice -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}'
post_deployment:
- step: "Smoke test"
command: |
./scripts/smoke-test.sh production
- step: "Monitor metrics"
duration: "15 minutes"
watch:
- "Error rate"
- "Latency p99"
- "Request rate"
- step: "Update ticket"
action: "Mark CHG ticket as completed"Maintenance Runbook
维护Runbook
yaml
maintenance_runbook:
log_rotation:
schedule: "Weekly, Sunday 02:00 UTC"
steps:
- step: "Connect to server"
command: |
ssh admin@logs.example.com
- step: "Rotate logs"
command: |
sudo logrotate -f /etc/logrotate.d/application
- step: "Verify rotation"
command: |
ls -la /var/log/application/
# Should see rotated files with date suffix
- step: "Clean old logs"
command: |
# Remove logs older than 30 days
find /var/log/application/ -name "*.log.*" -mtime +30 -delete
- step: "Verify disk space"
command: |
df -h /var/log
# Should show > 20% free
database_maintenance:
schedule: "Monthly, first Sunday 03:00 UTC"
steps:
- step: "Check table sizes"
command: |
psql -c "
SELECT tablename,
pg_size_pretty(pg_total_relation_size(tablename::text))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::text) DESC
LIMIT 10;
"
- step: "Run VACUUM ANALYZE"
command: |
psql -c "VACUUM ANALYZE;"
- step: "Reindex if needed"
command: |
psql -c "REINDEX DATABASE mydb;"yaml
maintenance_runbook:
log_rotation:
schedule: "Weekly, Sunday 02:00 UTC"
steps:
- step: "Connect to server"
command: |
ssh admin@logs.example.com
- step: "Rotate logs"
command: |
sudo logrotate -f /etc/logrotate.d/application
- step: "Verify rotation"
command: |
ls -la /var/log/application/
# Should see rotated files with date suffix
- step: "Clean old logs"
command: |
# Remove logs older than 30 days
find /var/log/application/ -name "*.log.*" -mtime +30 -delete
- step: "Verify disk space"
command: |
df -h /var/log
# Should show > 20% free
database_maintenance:
schedule: "Monthly, first Sunday 03:00 UTC"
steps:
- step: "Check table sizes"
command: |
psql -c "
SELECT tablename,
pg_size_pretty(pg_total_relation_size(tablename::text))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::text) DESC
LIMIT 10;
"
- step: "Run VACUUM ANALYZE"
command: |
psql -c "VACUUM ANALYZE;"
- step: "Reindex if needed"
command: |
psql -c "REINDEX DATABASE mydb;"Writing Guidelines
编写指南
yaml
principles:
clarity:
- "Use active voice"
- "Be explicit, never assume"
- "One action per step"
completeness:
- "Include all commands"
- "Show expected output"
- "Document verification"
safety:
- "Test in non-prod first"
- "Include rollback steps"
- "Document risks"
formatting:
commands:
- "Use code blocks with language"
- "Include full paths"
- "Add comments for complex commands"
steps:
- "Number sequentially"
- "Include purpose"
- "Show expected result"
- "Note time estimate"
variables:
format: "$VARIABLE_NAME or <placeholder>"
document: "List all variables at start"yaml
principles:
clarity:
- "Use active voice"
- "Be explicit, never assume"
- "One action per step"
completeness:
- "Include all commands"
- "Show expected output"
- "Document verification"
safety:
- "Test in non-prod first"
- "Include rollback steps"
- "Document risks"
formatting:
commands:
- "Use code blocks with language"
- "Include full paths"
- "Add comments for complex commands"
steps:
- "Number sequentially"
- "Include purpose"
- "Show expected result"
- "Note time estimate"
variables:
format: "$VARIABLE_NAME or <placeholder>"
document: "List all variables at start"Quality Checklist
质量检查清单
yaml
validation:
structure:
- "[ ] Clear title and metadata"
- "[ ] Prerequisites listed"
- "[ ] Steps numbered and clear"
- "[ ] Expected outputs included"
- "[ ] Verification steps present"
- "[ ] Rollback documented"
- "[ ] Troubleshooting section"
- "[ ] Contacts listed"
testing:
- "[ ] All commands tested"
- "[ ] Outputs verified"
- "[ ] Rollback tested"
- "[ ] Time estimates accurate"
maintenance:
- "[ ] Version number updated"
- "[ ] Change log maintained"
- "[ ] Quarterly review scheduled"
- "[ ] Owner assigned"yaml
validation:
structure:
- "[ ] Clear title and metadata"
- "[ ] Prerequisites listed"
- "[ ] Steps numbered and clear"
- "[ ] Expected outputs included"
- "[ ] Verification steps present"
- "[ ] Rollback documented"
- "[ ] Troubleshooting section"
- "[ ] Contacts listed"
testing:
- "[ ] All commands tested"
- "[ ] Outputs verified"
- "[ ] Rollback tested"
- "[ ] Time estimates accurate"
maintenance:
- "[ ] Version number updated"
- "[ ] Change log maintained"
- "[ ] Quarterly review scheduled"
- "[ ] Owner assigned"Лучшие практики
最佳实践
- Test everything — каждая команда должна быть проверена
- Show expected output — пользователь должен знать что увидит
- Include rollback — всегда план отката
- Keep updated — ревью каждый квартал
- Version control — runbooks в git
- Automate when possible — автоматизируй повторяющиеся процедуры
- 全面测试——每个命令都必须经过验证
- 展示预期输出——用户需要知道会看到什么结果
- 包含回滚流程——始终要有回滚计划
- 持续更新——每季度进行一次审核
- 版本控制——将Runbook存储在Git中
- 尽可能自动化——自动化重复执行的流程