devops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDevOps — Infrastructure & Deployment
DevOps — 基础设施与部署
Ship reliably. Monitor everything. Fix fast.
可靠发布,全面监控,快速修复。
Deployment Checklist
部署检查清单
Before any deploy:
- All tests pass in CI (not just locally)
- Environment variables set in target environment
- Database migrations tested against production-like data
- Rollback plan documented (even if it's "revert this commit")
- Health check endpoint exists and returns 200
任何部署前需完成:
- CI中的所有测试通过(不只是本地测试)
- 目标环境中已配置好环境变量
- 数据库迁移已针对类生产数据测试
- 已记录回滚方案(哪怕只是“回滚此提交”)
- 存在健康检查端点且返回200状态码
CI/CD Pipeline
CI/CD 流水线
push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod| Stage | Fails? | Action |
|---|---|---|
| Lint/Types | Block merge | Fix locally |
| Tests | Block merge | Fix or update tests |
| Build | Block merge | Fix build errors |
| Staging deploy | Block prod | Debug in staging |
| Smoke test | Block prod | Rollback staging, investigate |
| Prod deploy | Alert on-call | Rollback immediately |
push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod| 阶段 | 失败时? | 操作 |
|---|---|---|
| 代码检查/类型校验 | 阻止合并 | 本地修复 |
| 测试 | 阻止合并 | 修复或更新测试 |
| 构建 | 阻止合并 | 修复构建错误 |
| 预发布环境部署 | 阻止生产环境部署 | 在预发布环境调试 |
| 冒烟测试 | 阻止生产环境部署 | 回滚预发布环境,调查问题 |
| 生产环境部署 | 通知值班人员 | 立即回滚 |
Docker
Docker
dockerfile
undefineddockerfile
undefinedMulti-stage build — keep images small
Multi-stage build — keep images small
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]
**Rules:**
- Always pin base image versions (not `latest`)
- Use `.dockerignore` — never ship `node_modules`, `.git`, `.env`
- One process per container
- Health check in Dockerfile: `HEALTHCHECK CMD curl -f http://localhost:3000/health`FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]
**规则:**
- 始终固定基础镜像版本(不要使用`latest`)
- 使用`.dockerignore` — 绝不打包`node_modules`、`.git`、`.env`
- 每个容器仅运行一个进程
- 在Dockerfile中添加健康检查:`HEALTHCHECK CMD curl -f http://localhost:3000/health`Monitoring
监控
| What | Tool Options | Alert When |
|---|---|---|
| Uptime | UptimeRobot, Checkly | Down > 30 seconds |
| Errors | Sentry, Datadog | Error rate > 1% |
| Latency | Grafana, Datadog | p95 > 2 seconds |
| Resources | Cloud provider metrics | CPU > 80%, memory > 85% |
| Logs | Datadog, Axiom, CloudWatch | Error patterns, keywords |
Rules:
- Every alert must have a runbook (even a one-liner)
- If an alert fires and needs no action, delete it — alert fatigue kills
- Log structured JSON, not printf strings
- Include request ID in every log line for tracing
| 监控项 | 工具选项 | 告警触发条件 |
|---|---|---|
| 在线时长 | UptimeRobot、Checkly | 服务宕机超过30秒 |
| 错误 | Sentry、Datadog | 错误率超过1% |
| 延迟 | Grafana、Datadog | p95延迟超过2秒 |
| 资源使用 | 云服务商指标 | CPU使用率超过80%,内存使用率超过85% |
| 日志 | Datadog、Axiom、CloudWatch | 出现错误模式或特定关键词 |
规则:
- 每个告警都必须有运行手册(哪怕只有一行)
- 如果告警触发但无需任何操作,删除该告警 — 告警疲劳会导致严重问题
- 记录结构化JSON日志,而非格式化字符串
- 每条日志都包含请求ID以便追踪
Infrastructure Defaults
基础设施默认配置
| Decision | Default | Why |
|---|---|---|
| Hosting | Vercel / Railway / Fly.io | Zero-config, scales |
| Database | Managed Postgres (Supabase, Neon, RDS) | Don't manage your own DB |
| Cache | Upstash Redis | Serverless, no ops |
| Queue | Inngest, Trigger.dev, or SQS | Managed, retries built-in |
| Storage | S3 / R2 / Supabase Storage | Cheap, reliable |
| DNS | Cloudflare | Fast, free tier |
| Secrets | Environment variables via platform | Never in code or git |
| 决策项 | 默认选择 | 原因 |
|---|---|---|
| 托管平台 | Vercel / Railway / Fly.io | 零配置、可扩展 |
| 数据库 | 托管式Postgres(Supabase、Neon、RDS) | 无需自行管理数据库 |
| 缓存 | Upstash Redis | 无服务器、无需运维 |
| 队列 | Inngest、Trigger.dev或SQS | 托管式、内置重试机制 |
| 存储 | S3 / R2 / Supabase Storage | 低成本、可靠 |
| DNS | Cloudflare | 快速、有免费层级 |
| 密钥管理 | 通过平台设置环境变量 | 绝不要存放在代码或Git中 |
Incident Response
故障响应流程
- Detect — alert fires or user report
- Acknowledge — someone owns it (within 5 min)
- Mitigate — rollback, feature flag off, or scale up (fix the bleeding)
- Investigate — root cause after bleeding stops
- Fix — proper fix with tests
- Postmortem — blameless, focus on systems not people
- 检测 — 告警触发或用户反馈
- 确认 — 有人认领问题(5分钟内)
- 缓解 — 回滚、关闭功能开关或扩容(止住问题恶化)
- 调查 — 问题缓解后排查根本原因
- 修复 — 完成带测试的正式修复
- 事后复盘 — 无责复盘,聚焦系统而非个人