Runbook Creation

运行手册创建

Overview

概述

Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.

创建全面的运维运行手册，为常见运维任务、事件响应和系统维护提供分步流程。

When to Use

适用场景

Incident response procedures
Standard operating procedures (SOPs)
On-call playbooks
System maintenance guides
Disaster recovery procedures
Deployment runbooks
Escalation procedures
Service restoration guides

事件响应流程
标准作业程序（SOP）
值班操作剧本
系统维护指南
灾难恢复流程
部署运行手册
升级流程
服务恢复指南

Incident Response Runbook Template

事件响应运行手册模板

markdown

undefined

markdown

undefined

Incident Response Runbook

事件响应运行手册

Quick Reference

快速参考

Severity Levels:

P0 (Critical): Complete outage, data loss, security breach
P1 (High): Major feature down, significant user impact
P2 (Medium): Minor feature degradation, limited user impact
P3 (Low): Cosmetic issues, minimal user impact

Response Times:

P0: Immediate (24/7)
P1: 15 minutes (business hours), 1 hour (after hours)
P2: 4 hours (business hours)
P3: Next business day

Escalation Contacts:

On-call Engineer: PagerDuty rotation
Engineering Manager: +1-555-0100
VP Engineering: +1-555-0101
CTO: +1-555-0102

严重等级：

P0（关键）：完全中断、数据丢失、安全漏洞
P1（高）：主要功能故障、用户影响范围广
P2（中）：次要功能降级、用户影响有限
P3（低）：界面显示问题、用户影响极小

响应时间：

P0：立即响应（7×24小时）
P1：15分钟内（工作时间），1小时内（非工作时间）
P2：4小时内（工作时间）
P3：下一个工作日

升级联系人：

值班工程师：PagerDuty轮值
工程经理：+1-555-0100
工程副总裁：+1-555-0101
CTO：+1-555-0102

Service Down

服务中断

Symptoms

症状

Health check endpoint returning 500 errors
Users unable to access application
Load balancer showing all instances unhealthy
Alerts:
```
service_down
```
,
```
health_check_failed
```

健康检查端点返回500错误
用户无法访问应用
负载均衡器显示所有实例状态异常
告警：
```
service_down
```
,
```
health_check_failed
```

Severity: P0 (Critical)

严重等级：P0（关键）

Initial Response (5 minutes)

初始响应（5分钟内）

Acknowledge the incident

bash

# Acknowledge in PagerDuty
# Post in #incidents Slack channel

Create incident channel

Create Slack channel: #incident-YYYY-MM-DD-service-down
Post incident details and status updates

Assess impact

bash

# Check service status
kubectl get pods -n production

# Check recent deployments
kubectl rollout history deployment/api -n production

# Check logs
kubectl logs -f deployment/api -n production --tail=100

确认事件

bash

# 在PagerDuty中确认
# 在#incidents Slack频道发布消息

创建事件频道

创建Slack频道：#incident-YYYY-MM-DD-service-down
发布事件详情和状态更新

评估影响范围

bash

# 检查服务状态
kubectl get pods -n production

# 检查最近的部署记录
kubectl rollout history deployment/api -n production

# 查看日志
kubectl logs -f deployment/api -n production --tail=100

Investigation Steps

调查步骤

Check Application Health

检查应用健康状态

bash

undefined

bash

undefined

1. Check pod status

1. 检查Pod状态

kubectl get pods -n production -l app=api

Expected output: All pods Running

预期输出：所有Pod处于Running状态

NAME READY STATUS RESTARTS AGE

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

2. Check pod logs for errors

2. 检查Pod日志中的错误

kubectl logs -f deployment/api -n production --tail=100 | grep -i error

3. Check application endpoints

3. 检查应用端点

curl -v https://api.example.com/health curl -v https://api.example.com/api/v1/status

4. Check database connectivity

4. 检查数据库连接

kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"

undefined

kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"

undefined

Check Infrastructure

检查基础设施

bash

undefined

bash

undefined

1. Check load balancer

1. 检查负载均衡器

aws elb describe-target-health
--target-group-arn arn:aws:elasticloadbalancing:...
--query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]'
--output table

2. Check DNS resolution

2. 检查DNS解析

dig api.example.com nslookup api.example.com

3. Check SSL certificates

3. 检查SSL证书

echo | openssl s_client -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates

4. Check network connectivity

4. 检查网络连通性

kubectl exec -it deployment/api -n production --
curl -v https://database.example.com:5432

undefined

kubectl exec -it deployment/api -n production --
curl -v https://database.example.com:5432

undefined

Check Database

检查数据库

bash

undefined

bash

undefined

1. Check database connections

1. 检查数据库连接数

psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

2. Check for locks

2. 检查锁情况

psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 "

3. Check database size

3. 检查数据库大小

psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) "

4. Check long-running queries

4. 检查长时间运行的查询

psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "

undefined

psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "

undefined

Resolution Steps

解决步骤

Option 1: Restart Pods (Quick Fix)

选项1：重启Pod（快速修复）

bash

undefined

bash

undefined

Restart all pods (rolling restart)

重启所有Pod（滚动重启）

kubectl rollout restart deployment/api -n production

Watch restart progress

监控重启进度

kubectl rollout status deployment/api -n production

Verify pods are healthy

验证Pod状态正常

kubectl get pods -n production -l app=api

undefined

kubectl get pods -n production -l app=api

undefined

Option 2: Scale Up (If Overload)

选项2：扩容（如果过载）

bash

undefined

bash

undefined

Check current replicas

检查当前副本数

kubectl get deployment api -n production

Scale up

扩容

kubectl scale deployment/api -n production --replicas=10

Watch scaling

监控扩容过程

kubectl get pods -n production -l app=api -w

undefined

kubectl get pods -n production -l app=api -w

undefined

Option 3: Rollback (If Bad Deploy)

选项3：回滚（如果部署出错）

bash

undefined

bash

undefined

Check deployment history

检查部署历史

kubectl rollout history deployment/api -n production

Rollback to previous version

回滚到上一个版本

kubectl rollout undo deployment/api -n production

Rollback to specific revision

回滚到指定版本

kubectl rollout undo deployment/api -n production --to-revision=5

Verify rollback

验证回滚结果

kubectl rollout status deployment/api -n production

undefined

kubectl rollout status deployment/api -n production

undefined

Option 4: Database Connection Reset

选项4：重置数据库连接

bash

undefined

bash

undefined

If database connection pool exhausted

如果数据库连接池耗尽

kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # Reload process, reset connections

kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # 重新加载进程，重置连接

Or restart database connection pool

或者重启数据库连接池

psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"

undefined

psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"

undefined

Verification

验证

bash

undefined

bash

undefined

1. Check health endpoint

1. 检查健康端点

curl https://api.example.com/health

Expected: {"status": "healthy"}

预期：{"status": "healthy"}

2. Check API endpoints

2. 检查API端点

curl https://api.example.com/api/v1/users

Expected: Valid JSON response

预期：有效的JSON响应

3. Check metrics

3. 检查指标

Visit https://grafana.example.com

访问 https://grafana.example.com

Verify:

验证：

- Error rate < 1%

- 错误率 < 1%

- Response time < 500ms

- 响应时间 < 500ms

- All pods healthy

- 所有Pod状态正常

4. Check logs for errors

4. 检查日志中的错误

kubectl logs deployment/api -n production --tail=100 | grep -i error

Expected: No new errors

预期：无新错误

undefined

undefined

Communication

沟通

Initial Update (within 5 minutes):

🚨 INCIDENT: Service Down

Status: Investigating
Severity: P0
Impact: All users unable to access application
Start Time: 2025-01-15 14:30 UTC

We are investigating reports of users unable to access the application.
Our team is working to identify the root cause.

Next update in 15 minutes.

Progress Update (every 15 minutes):

🔍 UPDATE: Service Down

Status: Identified
Root Cause: Database connection pool exhausted
Action: Restarting application pods
ETA: 5 minutes

We have identified the issue and are implementing a fix.

Resolution Update:

✅ RESOLVED: Service Down

Status: Resolved
Resolution: Restarted application pods, reset database connections
Duration: 23 minutes

The service is now fully operational. We are monitoring closely
and will conduct a post-mortem to prevent future occurrences.

初始更新（5分钟内）：

🚨 事件：服务中断

状态：正在调查
严重等级：P0
影响：所有用户无法访问应用
开始时间：2025-01-15 14:30 UTC

我们正在调查用户无法访问应用的问题。
团队正在努力确定根本原因。

15分钟内更新下一次状态。

进度更新（每15分钟一次）：

🔍 更新：服务中断

状态：已定位问题
根本原因：数据库连接池耗尽
操作：正在重启应用Pod
预计恢复时间：5分钟内

我们已定位问题并正在实施修复。

解决更新：

✅ 已解决：服务中断

状态：已解决
解决方式：重启应用Pod，重置数据库连接
持续时间：23分钟

服务现已完全恢复正常。我们将密切监控，并进行事后复盘以防止未来再次发生。

Post-Incident

事后处理

Create post-mortem document
- Timeline of events
- Root cause analysis
- Action items to prevent recurrence
Update monitoring
- Add alerts for this scenario
- Improve detection time
Update runbook
- Document any new findings
- Add shortcuts for faster resolution

创建事后复盘文档
- 事件时间线
- 根本原因分析
- 预防复发的行动项
更新监控
- 添加此场景的告警
- 缩短检测时间
更新运行手册
- 记录新发现的问题
- 添加快速解决的快捷方式

Database Issues

数据库问题

High Connection Count

连接数过高

Symptoms:

Database rejecting new connections
Error: "too many connections"
Alert:
```
db_connections_high
```

Quick Fix:

bash

undefined

症状：

数据库拒绝新连接
错误："too many connections"
告警：
```
db_connections_high
```

快速修复：

bash

undefined

1. Check connection count

1. 检查连接数

psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name "

2. Kill idle connections

2. 终止空闲连接

psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' "

3. Restart connection pools

3. 重启连接池

kubectl rollout restart deployment/api -n production

undefined

kubectl rollout restart deployment/api -n production

undefined

Slow Queries

查询缓慢

Symptoms:

API response times > 5 seconds
Database CPU at 100%
Alert:
```
slow_query_detected
```

Investigation:

sql

-- Find slow queries
SELECT
  pid,
  now() - query_start as duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

-- Check for missing indexes
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;

-- Kill long-running query (if needed)
SELECT pg_terminate_backend(12345);  -- Replace with actual PID

症状：

API响应时间 > 5秒
数据库CPU使用率达100%
告警：
```
slow_query_detected
```

调查：

sql

-- 查找缓慢查询
SELECT
  pid,
  now() - query_start as duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

-- 检查缺失的索引
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;

-- 终止长时间运行的查询（如有需要）
SELECT pg_terminate_backend(12345);  -- 替换为实际PID

High CPU/Memory Usage

CPU/内存使用率过高

Symptoms

症状

Pods being OOMKilled
Response times increasing
Alert:
```
high_memory_usage
```
,
```
high_cpu_usage
```

Pod被OOMKilled
响应时间增加
告警：
```
high_memory_usage
```
,
```
high_cpu_usage
```

Investigation

调查

bash

undefined

bash

undefined

1. Check pod resources

1. 检查Pod资源使用情况

kubectl top pods -n production

2. Check resource limits

2. 检查资源限制

kubectl describe pod <pod-name> -n production | grep -A 5 Limits

3. Check for memory leaks

3. 检查内存泄漏

kubectl logs deployment/api -n production | grep -i "out of memory"

4. Profile application (if needed)

4. 应用性能分析（如有需要）

kubectl exec -it <pod-name> -n production -- sh

Run profiler: node --inspect, py-spy, etc.

运行分析工具：node --inspect, py-spy, etc.

undefined

undefined

Resolution

解决

bash

undefined

bash

undefined

Option 1: Increase resources

选项1：增加资源配额

kubectl set resources deployment/api -n production
--limits=cpu=2000m,memory=4Gi
--requests=cpu=1000m,memory=2Gi

Option 2: Scale horizontally

选项2：水平扩容

kubectl scale deployment/api -n production --replicas=6

Option 3: Restart problematic pods

选项3：重启有问题的Pod

kubectl delete pod <pod-name> -n production

---

kubectl delete pod <pod-name> -n production

---

Rollback Procedures

回滚流程

Application Rollback

应用回滚

bash

undefined

bash

undefined

1. List deployment history

1. 列出部署历史

kubectl rollout history deployment/api -n production

2. Check specific revision

2. 查看指定版本详情

kubectl rollout history deployment/api -n production --revision=5

3. Rollback to previous

3. 回滚到上一个版本

kubectl rollout undo deployment/api -n production

4. Rollback to specific revision

4. 回滚到指定版本

kubectl rollout undo deployment/api -n production --to-revision=5

5. Verify rollback

5. 验证回滚结果

kubectl rollout status deployment/api -n production kubectl get pods -n production

undefined

kubectl rollout status deployment/api -n production kubectl get pods -n production

undefined

Database Rollback

数据库回滚

bash

undefined

bash

undefined

1. Check migration status

1. 检查迁移状态

npm run db:migrate:status

2. Rollback last migration

2. 回滚上一次迁移

npm run db:migrate:undo

3. Rollback to specific migration

3. 回滚到指定迁移版本

npm run db:migrate:undo --to 20250115120000-migration-name

4. Verify database state

4. 验证数据库状态

psql $DATABASE_URL -c "\dt"

---

psql $DATABASE_URL -c "\dt"

---

Escalation Path

升级路径

Level 1 - On-call Engineer (You)
- Initial response and investigation
- Attempt standard fixes from runbook
Level 2 - Senior Engineers
- Escalate if not resolved in 30 minutes
- Escalate if issue is complex/unclear
- Contact via PagerDuty or Slack
Level 3 - Engineering Manager
- Escalate if not resolved in 1 hour
- Escalate if cross-team coordination needed
Level 4 - VP Engineering / CTO
- Escalate for P0 incidents > 2 hours
- Escalate for security breaches
- Escalate for data loss

Level 1 - 值班工程师（你）
- 初始响应和调查
- 尝试运行手册中的标准修复方法
Level 2 - 资深工程师
- 30分钟内未解决则升级
- 问题复杂或不明确时升级
- 通过PagerDuty或Slack联系
Level 3 - 工程经理
- 1小时内未解决则升级
- 需要跨团队协调时升级
Level 4 - 工程副总裁 / CTO
- P0事件超过2小时则升级
- 安全漏洞时升级
- 数据丢失时升级

Useful Commands

常用命令

bash

undefined

bash

undefined

Kubernetes

kubectl get pods -n production kubectl logs -f <pod-name> -n production kubectl describe pod <pod-name> -n production kubectl exec -it <pod-name> -n production -- sh kubectl top pods -n production

Database

数据库

psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"

AWS

aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ...

Monitoring URLs

监控链接

Grafana: https://grafana.example.com

Datadog: https://app.datadoghq.com

PagerDuty: https://example.pagerduty.com

Status Page: https://status.example.com

状态页面: https://status.example.com

undefined

undefined

Best Practices

最佳实践

✅ DO

✅ 建议

Include quick reference section at top
Provide exact commands to run
Document expected outputs
Include verification steps
Add communication templates
Define severity levels clearly
Document escalation paths
Include useful links and contacts
Keep runbooks up-to-date
Test runbooks regularly
Include screenshots/diagrams
Document common gotchas

在顶部包含快速参考部分
提供可直接运行的精确命令
记录预期输出
包含验证步骤
添加沟通模板
明确定义严重等级
记录升级路径
包含有用的链接和联系人
保持运行手册更新
定期测试运行手册
包含截图/图表
记录常见陷阱

❌ DON'T

❌ 避免

Use vague instructions
Skip verification steps
Forget to document prerequisites
Assume knowledge of tools
Skip communication guidelines
Forget to update after incidents

使用模糊的指令
跳过验证步骤
忘记记录先决条件
假设用户了解工具使用
跳过沟通指南
事件后忘记更新

runbook-creation

Original

Translation

Runbook Creation

运行手册创建

Overview

概述

When to Use

适用场景

Incident Response Runbook Template

事件响应运行手册模板

Incident Response Runbook

事件响应运行手册

Quick Reference

快速参考

Table of Contents

目录

Service Down

服务中断

Symptoms

症状

Severity: P0 (Critical)

严重等级：P0（关键）

Initial Response (5 minutes)

初始响应（5分钟内）

Investigation Steps

调查步骤

Check Application Health

检查应用健康状态

1. Check pod status

1. 检查Pod状态

Expected output: All pods Running

预期输出：所有Pod处于Running状态

NAME READY STATUS RESTARTS AGE

NAME READY STATUS RESTARTS AGE

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h

2. Check pod logs for errors

2. 检查Pod日志中的错误

3. Check application endpoints

3. 检查应用端点

4. Check database connectivity

4. 检查数据库连接

Check Infrastructure

检查基础设施

1. Check load balancer

1. 检查负载均衡器

2. Check DNS resolution

2. 检查DNS解析

3. Check SSL certificates

3. 检查SSL证书

4. Check network connectivity

4. 检查网络连通性

Check Database

检查数据库

1. Check database connections

1. 检查数据库连接数

2. Check for locks

2. 检查锁情况

3. Check database size

3. 检查数据库大小

4. Check long-running queries

4. 检查长时间运行的查询

Resolution Steps

解决步骤

Option 1: Restart Pods (Quick Fix)

选项1：重启Pod（快速修复）

Restart all pods (rolling restart)

重启所有Pod（滚动重启）

Watch restart progress

监控重启进度

Verify pods are healthy

验证Pod状态正常

Option 2: Scale Up (If Overload)

选项2：扩容（如果过载）

Check current replicas

检查当前副本数

Scale up