devops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DevOps — Infrastructure & Deployment

DevOps — 基础设施与部署

Ship reliably. Monitor everything. Fix fast.
可靠发布,全面监控,快速修复。

Deployment Checklist

部署检查清单

Before any deploy:
  1. All tests pass in CI (not just locally)
  2. Environment variables set in target environment
  3. Database migrations tested against production-like data
  4. Rollback plan documented (even if it's "revert this commit")
  5. Health check endpoint exists and returns 200
任何部署前需完成:
  1. CI中的所有测试通过(不只是本地测试)
  2. 目标环境中已配置好环境变量
  3. 数据库迁移已针对类生产数据测试
  4. 已记录回滚方案(哪怕只是“回滚此提交”)
  5. 存在健康检查端点且返回200状态码

CI/CD Pipeline

CI/CD 流水线

push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod
StageFails?Action
Lint/TypesBlock mergeFix locally
TestsBlock mergeFix or update tests
BuildBlock mergeFix build errors
Staging deployBlock prodDebug in staging
Smoke testBlock prodRollback staging, investigate
Prod deployAlert on-callRollback immediately
push → lint → typecheck → test → build → deploy staging → smoke test → deploy prod
阶段失败时?操作
代码检查/类型校验阻止合并本地修复
测试阻止合并修复或更新测试
构建阻止合并修复构建错误
预发布环境部署阻止生产环境部署在预发布环境调试
冒烟测试阻止生产环境部署回滚预发布环境,调查问题
生产环境部署通知值班人员立即回滚

Docker

Docker

dockerfile
undefined
dockerfile
undefined

Multi-stage build — keep images small

Multi-stage build — keep images small

FROM node:22-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --production=false COPY . . RUN npm run build
FROM node:22-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules EXPOSE 3000 CMD ["node", "dist/server.js"]

**Rules:**
- Always pin base image versions (not `latest`)
- Use `.dockerignore` — never ship `node_modules`, `.git`, `.env`
- One process per container
- Health check in Dockerfile: `HEALTHCHECK CMD curl -f http://localhost:3000/health`
FROM node:22-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --production=false COPY . . RUN npm run build
FROM node:22-alpine WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules EXPOSE 3000 CMD ["node", "dist/server.js"]

**规则:**
- 始终固定基础镜像版本(不要使用`latest`)
- 使用`.dockerignore` — 绝不打包`node_modules`、`.git`、`.env`
- 每个容器仅运行一个进程
- 在Dockerfile中添加健康检查:`HEALTHCHECK CMD curl -f http://localhost:3000/health`

Monitoring

监控

WhatTool OptionsAlert When
UptimeUptimeRobot, ChecklyDown > 30 seconds
ErrorsSentry, DatadogError rate > 1%
LatencyGrafana, Datadogp95 > 2 seconds
ResourcesCloud provider metricsCPU > 80%, memory > 85%
LogsDatadog, Axiom, CloudWatchError patterns, keywords
Rules:
  • Every alert must have a runbook (even a one-liner)
  • If an alert fires and needs no action, delete it — alert fatigue kills
  • Log structured JSON, not printf strings
  • Include request ID in every log line for tracing
监控项工具选项告警触发条件
在线时长UptimeRobot、Checkly服务宕机超过30秒
错误Sentry、Datadog错误率超过1%
延迟Grafana、Datadogp95延迟超过2秒
资源使用云服务商指标CPU使用率超过80%,内存使用率超过85%
日志Datadog、Axiom、CloudWatch出现错误模式或特定关键词
规则:
  • 每个告警都必须有运行手册(哪怕只有一行)
  • 如果告警触发但无需任何操作,删除该告警 — 告警疲劳会导致严重问题
  • 记录结构化JSON日志,而非格式化字符串
  • 每条日志都包含请求ID以便追踪

Infrastructure Defaults

基础设施默认配置

DecisionDefaultWhy
HostingVercel / Railway / Fly.ioZero-config, scales
DatabaseManaged Postgres (Supabase, Neon, RDS)Don't manage your own DB
CacheUpstash RedisServerless, no ops
QueueInngest, Trigger.dev, or SQSManaged, retries built-in
StorageS3 / R2 / Supabase StorageCheap, reliable
DNSCloudflareFast, free tier
SecretsEnvironment variables via platformNever in code or git
决策项默认选择原因
托管平台Vercel / Railway / Fly.io零配置、可扩展
数据库托管式Postgres(Supabase、Neon、RDS)无需自行管理数据库
缓存Upstash Redis无服务器、无需运维
队列Inngest、Trigger.dev或SQS托管式、内置重试机制
存储S3 / R2 / Supabase Storage低成本、可靠
DNSCloudflare快速、有免费层级
密钥管理通过平台设置环境变量绝不要存放在代码或Git中

Incident Response

故障响应流程

  1. Detect — alert fires or user report
  2. Acknowledge — someone owns it (within 5 min)
  3. Mitigate — rollback, feature flag off, or scale up (fix the bleeding)
  4. Investigate — root cause after bleeding stops
  5. Fix — proper fix with tests
  6. Postmortem — blameless, focus on systems not people
  1. 检测 — 告警触发或用户反馈
  2. 确认 — 有人认领问题(5分钟内)
  3. 缓解 — 回滚、关闭功能开关或扩容(止住问题恶化)
  4. 调查 — 问题缓解后排查根本原因
  5. 修复 — 完成带测试的正式修复
  6. 事后复盘 — 无责复盘,聚焦系统而非个人