devops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DevOps — Infrastructure & Deployment

DevOps — 基础设施与部署

Ship reliably. Monitor everything. Fix fast.

可靠发布，全面监控，快速修复。

Deployment Checklist

部署检查清单

Before any deploy:

All tests pass in CI (not just locally)
Environment variables set in target environment
Database migrations tested against production-like data
Rollback plan documented (even if it's "revert this commit")
Health check endpoint exists and returns 200

任何部署前需完成：

CI中的所有测试通过（不只是本地测试）
目标环境中已配置好环境变量
数据库迁移已针对类生产数据测试
已记录回滚方案（哪怕只是“回滚此提交”）
存在健康检查端点且返回200状态码

CI/CD Pipeline

CI/CD 流水线

push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod

Stage	Fails?	Action
Lint/Types	Block merge	Fix locally
Tests	Block merge	Fix or update tests
Build	Block merge	Fix build errors
Staging deploy	Block prod	Debug in staging
Smoke test	Block prod	Rollback staging, investigate
Prod deploy	Alert on-call	Rollback immediately

push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod

阶段	失败时？	操作
代码检查/类型校验	阻止合并	本地修复
测试	阻止合并	修复或更新测试
构建	阻止合并	修复构建错误
预发布环境部署	阻止生产环境部署	在预发布环境调试
冒烟测试	阻止生产环境部署	回滚预发布环境，调查问题
生产环境部署	通知值班人员	立即回滚

Docker

dockerfile

undefined

dockerfile

undefined

Multi-stage build — keep images small

FROM node:22-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --production=false COPY . . RUN npm run build

FROM node:22-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules EXPOSE 3000 CMD ["node", "dist/server.js"]


**Rules:**
- Always pin base image versions (not `latest`)
- Use `.dockerignore` — never ship `node_modules`, `.git`, `.env`
- One process per container
- Health check in Dockerfile: `HEALTHCHECK CMD curl -f http://localhost:3000/health`

FROM node:22-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --production=false COPY . . RUN npm run build

FROM node:22-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules EXPOSE 3000 CMD ["node", "dist/server.js"]


**规则：**
- 始终固定基础镜像版本（不要使用`latest`）
- 使用`.dockerignore` — 绝不打包`node_modules`、`.git`、`.env`
- 每个容器仅运行一个进程
- 在Dockerfile中添加健康检查：`HEALTHCHECK CMD curl -f http://localhost:3000/health`

Monitoring

监控

What	Tool Options	Alert When
Uptime	UptimeRobot, Checkly	Down > 30 seconds
Errors	Sentry, Datadog	Error rate > 1%
Latency	Grafana, Datadog	p95 > 2 seconds
Resources	Cloud provider metrics	CPU > 80%, memory > 85%
Logs	Datadog, Axiom, CloudWatch	Error patterns, keywords

Rules:

Every alert must have a runbook (even a one-liner)
If an alert fires and needs no action, delete it — alert fatigue kills
Log structured JSON, not printf strings
Include request ID in every log line for tracing

监控项	工具选项	告警触发条件
在线时长	UptimeRobot、Checkly	服务宕机超过30秒
错误	Sentry、Datadog	错误率超过1%
延迟	Grafana、Datadog	p95延迟超过2秒
资源使用	云服务商指标	CPU使用率超过80%，内存使用率超过85%
日志	Datadog、Axiom、CloudWatch	出现错误模式或特定关键词

规则：

每个告警都必须有运行手册（哪怕只有一行）
如果告警触发但无需任何操作，删除该告警 — 告警疲劳会导致严重问题
记录结构化JSON日志，而非格式化字符串
每条日志都包含请求ID以便追踪

Infrastructure Defaults

基础设施默认配置

Decision	Default	Why
Hosting	Vercel / Railway / Fly.io	Zero-config, scales
Database	Managed Postgres (Supabase, Neon, RDS)	Don't manage your own DB
Cache	Upstash Redis	Serverless, no ops
Queue	Inngest, Trigger.dev, or SQS	Managed, retries built-in
Storage	S3 / R2 / Supabase Storage	Cheap, reliable
DNS	Cloudflare	Fast, free tier
Secrets	Environment variables via platform	Never in code or git

决策项	默认选择	原因
托管平台	Vercel / Railway / Fly.io	零配置、可扩展
数据库	托管式Postgres（Supabase、Neon、RDS）	无需自行管理数据库
缓存	Upstash Redis	无服务器、无需运维
队列	Inngest、Trigger.dev或SQS	托管式、内置重试机制
存储	S3 / R2 / Supabase Storage	低成本、可靠
DNS	Cloudflare	快速、有免费层级
密钥管理	通过平台设置环境变量	绝不要存放在代码或Git中

Incident Response

故障响应流程

Detect — alert fires or user report
Acknowledge — someone owns it (within 5 min)
Mitigate — rollback, feature flag off, or scale up (fix the bleeding)
Investigate — root cause after bleeding stops
Fix — proper fix with tests
Postmortem — blameless, focus on systems not people

检测 — 告警触发或用户反馈
确认 — 有人认领问题（5分钟内）
缓解 — 回滚、关闭功能开关或扩容（止住问题恶化）
调查 — 问题缓解后排查根本原因
修复 — 完成带测试的正式修复
事后复盘 — 无责复盘，聚焦系统而非个人