senior-devops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSenior DevOps Engineer
高级DevOps工程师
Overview
概述
Design, build, and maintain production infrastructure and deployment pipelines. This skill covers Docker containerization, Kubernetes orchestration, CI/CD with GitHub Actions, infrastructure-as-code with Terraform/Pulumi, monitoring with Prometheus/Grafana, alerting strategies, zero-downtime deployments, and rollback procedures.
设计、搭建并维护生产基础设施和部署流水线。本技能覆盖Docker容器化、Kubernetes编排、基于GitHub Actions的CI/CD、基于Terraform/Pulumi的基础设施即代码、基于Prometheus/Grafana的监控、告警策略、零停机部署以及回滚流程。
Phase 1: Infrastructure Design
第一阶段:基础设施设计
- Define deployment topology (single server, cluster, multi-region)
- Choose containerization strategy (Docker, Buildpacks)
- Select orchestration platform (Kubernetes, ECS, Cloud Run)
- Plan networking (load balancers, DNS, TLS)
- Design secret management approach
STOP — Present infrastructure design to user for approval before implementation.
- 定义部署拓扑(单服务器、集群、多区域)
- 选择容器化策略(Docker、Buildpacks)
- 选择编排平台(Kubernetes、ECS、Cloud Run)
- 规划网络配置(负载均衡、DNS、TLS)
- 设计密钥管理方案
停止 — 实施前先向用户展示基础设施设计方案,获得批准后再继续。
Infrastructure Decision Table
基础设施决策表
| Scale | Topology | Orchestration | Recommended |
|---|---|---|---|
| Hobby / MVP | Single server | Docker Compose | Railway, Fly.io |
| Startup (< 100k users) | Small cluster | ECS, Cloud Run | AWS ECS, GCP Cloud Run |
| Growth (100k - 1M users) | Multi-AZ cluster | Kubernetes | EKS, GKE |
| Enterprise (1M+ users) | Multi-region | Kubernetes + service mesh | EKS/GKE + Istio |
| Compliance-heavy | Dedicated/private cloud | Kubernetes | Self-managed K8s |
| 规模 | 拓扑结构 | 编排方案 | 推荐选项 |
|---|---|---|---|
| 个人项目 / MVP | 单服务器 | Docker Compose | Railway、Fly.io |
| 初创公司(<10万用户) | 小型集群 | ECS、Cloud Run | AWS ECS、GCP Cloud Run |
| 增长期(10万-100万用户) | 多可用区集群 | Kubernetes | EKS、GKE |
| 企业级(100万+用户) | 多区域 | Kubernetes + 服务网格 | EKS/GKE + Istio |
| 强合规要求 | 专属/私有云 | Kubernetes | 自建K8s |
Phase 2: Pipeline Implementation
第二阶段:流水线实现
- Build CI pipeline (lint, test, build, security scan)
- Build CD pipeline (deploy to staging, production)
- Configure environment-specific settings
- Set up artifact registry (container images, packages)
- Implement deployment strategy (blue-green, canary, rolling)
STOP — Validate pipeline config syntax and present for review.
- 搭建CI流水线(代码检查、测试、构建、安全扫描)
- 搭建CD流水线(部署到预发环境、生产环境)
- 配置环境专属设置
- 搭建制品仓库(容器镜像、软件包)
- 实施部署策略(蓝绿、金丝雀、滚动发布)
停止 — 校验流水线配置语法,提交审核通过后再继续。
Phase 3: Observability
第三阶段:可观测性
- Deploy monitoring stack (Prometheus, Grafana)
- Configure alerting rules and escalation
- Set up log aggregation
- Implement distributed tracing
- Create runbooks for common incidents
STOP — Verify monitoring covers all critical services before declaring complete.
- 部署监控栈(Prometheus、Grafana)
- 配置告警规则和升级流程
- 搭建日志聚合体系
- 实现分布式链路追踪
- 为常见事件编写运行手册
停止 — 确认监控覆盖所有核心服务后再宣布工作完成。
Dockerfile Best Practices
Dockerfile最佳实践
dockerfile
undefineddockerfile
undefined1. Use specific version tags (not :latest)
1. Use specific version tags (not :latest)
FROM node:20-alpine AS base
FROM node:20-alpine AS base
2. Set working directory
2. Set working directory
WORKDIR /app
WORKDIR /app
3. Install dependencies in separate layer (cache optimization)
3. Install dependencies in separate layer (cache optimization)
FROM base AS deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile --prod
FROM base AS build-deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile
FROM base AS deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile --prod
FROM base AS build-deps
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile
4. Build in separate stage
4. Build in separate stage
FROM build-deps AS builder
COPY . .
RUN pnpm build
FROM build-deps AS builder
COPY . .
RUN pnpm build
5. Production image — minimal size
5. Production image — minimal size
FROM base AS runner
ENV NODE_ENV=production
FROM base AS runner
ENV NODE_ENV=production
6. Don't run as root
6. Don't run as root
RUN addgroup --system --gid 1001 app &&
adduser --system --uid 1001 app USER app
adduser --system --uid 1001 app USER app
RUN addgroup --system --gid 1001 app &&
adduser --system --uid 1001 app USER app
adduser --system --uid 1001 app USER app
7. Copy only what's needed
7. Copy only what's needed
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
8. Health check
8. Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s
CMD wget -qO- http://localhost:3000/health || exit 1
CMD wget -qO- http://localhost:3000/health || exit 1
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s
CMD wget -qO- http://localhost:3000/health || exit 1
CMD wget -qO- http://localhost:3000/health || exit 1
9. Expose port and set entrypoint
9. Expose port and set entrypoint
EXPOSE 3000
CMD ["node", "dist/server.js"]
undefinedEXPOSE 3000
CMD ["node", "dist/server.js"]
undefinedKey Dockerfile Rules
Dockerfile核心规则
| Rule | Why |
|---|---|
| Multi-stage builds | Minimize image size |
| Exclude node_modules, .git, tests |
| Non-root user | Security hardening |
| Specific base image versions | Reproducible builds |
| Layer ordering (deps before src) | Cache efficiency |
| HEALTHCHECK instruction | Container health monitoring |
| No secrets in build args/layers | Prevent credential leaks |
| 规则 | 原因 |
|---|---|
| 多阶段构建 | 最小化镜像体积 |
配置 | 排除node_modules、.git、测试文件 |
| 使用非root用户 | 提升安全性 |
| 指定基础镜像的具体版本 | 保证构建可复现 |
| 层顺序优化(依赖复制早于源码复制) | 提升缓存效率 |
| 配置HEALTHCHECK指令 | 容器健康监控 |
| 不在构建参数/层中存储密钥 | 避免凭证泄露 |
Docker Compose Patterns
Docker Compose 常用模式
yaml
services:
app:
build:
context: .
dockerfile: Dockerfile
target: runner
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:postgres@db:5432/app
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
db:
image: postgres:16-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: app
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
cache:
image: redis:7-alpine
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:yaml
services:
app:
build:
context: .
dockerfile: Dockerfile
target: runner
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:postgres@db:5432/app
- REDIS_URL=redis://cache:6379
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
db:
image: postgres:16-alpine
volumes:
- postgres_data:/var/lib/postgresql/data
environment:
POSTGRES_DB: app
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 3s
retries: 5
cache:
image: redis:7-alpine
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:GitHub Actions Workflow
GitHub Actions 工作流
yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v3
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm lint
- run: pnpm typecheck
- run: pnpm test -- --coverage
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx audit-ci --moderate
- uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: HIGH,CRITICAL
build-and-push:
needs: [lint-and-test, security-scan]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: echo "Deploying ${{ github.sha }}"yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v3
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm lint
- run: pnpm typecheck
- run: pnpm test -- --coverage
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx audit-ci --moderate
- uses: aquasecurity/trivy-action@master
with:
scan-type: fs
severity: HIGH,CRITICAL
build-and-push:
needs: [lint-and-test, security-scan]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: echo "Deploying ${{ github.sha }}"Terraform / Pulumi Patterns
Terraform / Pulumi 常用模式
Terraform Structure
Terraform 目录结构
modules/
vpc/
main.tf, variables.tf, outputs.tf
ecs/
main.tf, variables.tf, outputs.tf
environments/
staging/
main.tf, terraform.tfvars
production/
main.tf, terraform.tfvarsmodules/
vpc/
main.tf, variables.tf, outputs.tf
ecs/
main.tf, variables.tf, outputs.tf
environments/
staging/
main.tf, terraform.tfvars
production/
main.tf, terraform.tfvarsKey IaC Rules
基础设施即代码核心规则
| Rule | Why |
|---|---|
| Remote state backend (S3 + DynamoDB) | Shared state, locking |
| State locking | Prevent concurrent modifications |
| Environment-specific variable files | Separation of concerns |
| Module versioning | Reproducible shared infra |
| Catch issues before apply |
| Drift detection on schedule | Detect manual changes |
| Tag all resources | Ownership, cost allocation |
| 规则 | 原因 |
|---|---|
| 远程状态后端(S3 + DynamoDB) | 共享状态、状态锁 |
| 状态锁 | 防止并发修改冲突 |
| 环境专属变量文件 | 职责分离 |
| 模块版本控制 | 共享基础设施可复现 |
CI中执行 | 应用前提前发现问题 |
| 定期漂移检测 | 发现手动修改的配置 |
| 所有资源打标签 | 权责划分、成本核算 |
Monitoring (Prometheus + Grafana)
监控(Prometheus + Grafana)
USE Method (Resources)
USE方法(资源维度)
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | cpu_usage_percent | cpu_throttled | — |
| Memory | memory_usage_bytes | oom_kills | — |
| Disk | disk_usage_percent | io_wait | disk_errors |
| Network | bytes_total | queue_length | errors_total |
| 资源 | 利用率 | 饱和度 | 错误 |
|---|---|---|---|
| CPU | cpu_usage_percent | cpu_throttled | — |
| 内存 | memory_usage_bytes | oom_kills | — |
| 磁盘 | disk_usage_percent | io_wait | disk_errors |
| 网络 | bytes_total | queue_length | errors_total |
RED Method (Services)
RED方法(服务维度)
- Rate: requests per second
- Errors: error rate per second
- Duration: latency distribution (p50, p95, p99)
- Rate:每秒请求数
- Errors:每秒错误率
- Duration:延迟分布(p50、p95、p99)
Alerting Rules
告警规则
yaml
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warningyaml
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warningAlerting Best Practices
告警最佳实践
| Practice | Why |
|---|---|
| Alert on symptoms, not causes | Reduces noise, focuses on impact |
| Every alert has a runbook link | Enables fast response |
| Tiered severity | critical=page, warning=ticket, info=log |
| Aggregate before alerting | Avoid flapping |
| Review and prune quarterly | Prevent alert fatigue |
| 实践 | 原因 |
|---|---|
| 基于现象告警,而非原因 | 减少噪音,聚焦业务影响 |
| 每个告警都关联运行手册链接 | 支持快速响应 |
| 分级告警 | 严重级=电话告警,警告级=工单,信息级=日志 |
| 告警前先聚合数据 | 避免告警抖动 |
| 每季度回顾和清理告警规则 | 避免告警疲劳 |
Zero-Downtime Deployment Strategies
零停机部署策略
| Strategy | How It Works | Risk | Rollback Speed |
|---|---|---|---|
| Rolling | Replace instances one at a time | Low | Medium |
| Blue-Green | Switch traffic between two environments | Low | Instant |
| Canary | Route small % to new version, gradually increase | Very Low | Instant |
| Feature Flags | Deploy code dark, enable via flag | Very Low | Instant |
| 策略 | 工作原理 | 风险 | 回滚速度 |
|---|---|---|---|
| 滚动发布 | 逐个替换实例 | 低 | 中等 |
| 蓝绿部署 | 在两个环境之间切换流量 | 低 | 即时 |
| 金丝雀发布 | 先将小比例流量路由到新版本,逐步扩大 | 极低 | 即时 |
| 特性开关 | 代码先静默部署,通过开关启用 | 极低 | 即时 |
Rollback Procedures
回滚流程
- Automated: health check fails -> automatic rollback
- Manual:
kubectl rollout undo deployment/app - Database: forward-only migrations with backward compatibility
- Config: revert via secret manager version
- 自动回滚:健康检查失败 -> 自动触发回滚
- 手动回滚:
kubectl rollout undo deployment/app - 数据库回滚:仅向前迁移,保持向后兼容性
- 配置回滚:通过密钥管理器版本回退
Database Migration Safety
数据库迁移安全规范
| Rule | Rationale |
|---|---|
| Migrations must be backward compatible | Old code + new schema must work |
| Never rename/drop columns in same deploy | Two-phase change required |
| Two-phase: add column -> deploy -> remove old | Zero-downtime schema evolution |
| Always test rollback of each migration | Ensure reversibility |
| 规则 | 逻辑 |
|---|---|
| 迁移必须向后兼容 | 旧代码 + 新 schema 必须能正常运行 |
| 严禁在同一次部署中重命名/删除列 | 需要分两阶段修改 |
| 两阶段修改:新增列 -> 部署 -> 删除旧列 | 零停机 schema 演进 |
| 每次迁移都要测试回滚 | 确保可回退 |
Anti-Patterns / Common Mistakes
反模式 / 常见错误
| Anti-Pattern | Why It Is Wrong | What to Do Instead |
|---|---|---|
| Manual production deployments | No audit trail, error-prone | Automate via CI/CD |
| Shared or hardcoded secrets | Security breach risk | Use secrets manager |
| No rollback plan before deploying | Stuck if deploy fails | Document rollback before every deploy |
| Non-reproducible | Pin specific version tags |
| Running containers as root | Security vulnerability | Use non-root user in Dockerfile |
| Alert fatigue from non-actionable alerts | Real issues get missed | Alert on symptoms, tune thresholds |
| Skipping staging environment | Bugs found in production | Always deploy to staging first |
| Snowflake servers with manual config | Cannot reproduce, cannot scale | Infrastructure as code |
| Monitoring without alerting | Nobody notices problems | Wire alerts to monitoring |
| 反模式 | 问题所在 | 替代方案 |
|---|---|---|
| 手动部署生产环境 | 无审计轨迹,易出错 | 通过CI/CD自动化部署 |
| 共享或硬编码密钥 | 存在安全泄露风险 | 使用密钥管理器 |
| 部署前无回滚计划 | 部署失败后无法处理 | 每次部署前先编写回滚方案 |
生产镜像使用 | 构建不可复现 | 绑定具体版本标签 |
| 用root用户运行容器 | 存在安全漏洞 | Dockerfile中使用非root用户 |
| 无操作价值的告警导致告警疲劳 | 真实问题被忽略 | 基于现象告警,调整阈值 |
| 跳过预发环境 | 生产环境才发现Bug | 总是先部署到预发环境验证 |
| 手动配置的雪花服务器 | 无法复现,无法扩容 | 用基础设施即代码管理 |
| 有监控无告警 | 没人发现问题 | 监控配置对应告警 |
Key Principles
核心原则
- Infrastructure as code — no manual changes to production
- Immutable infrastructure — replace, do not patch
- Cattle, not pets — servers are disposable
- Shift left security — scan early in pipeline
- Least privilege — minimal permissions everywhere
- Automate everything that runs more than twice
- Test the disaster recovery plan regularly
- 基础设施即代码——禁止手动修改生产环境
- 不可变基础设施——替换实例,而非直接修补
- 牛而非宠物——服务器是可丢弃的
- 安全左移——在流水线早期就进行安全扫描
- 最小权限——所有场景都配置最小必要权限
- 所有运行超过两次的操作都要自动化
- 定期测试灾难恢复计划
Documentation Lookup (Context7)
文档查询(Context7)
Use then for up-to-date docs. Returned docs override memorized knowledge.
mcp__context7__resolve-library-idmcp__context7__query-docs- — for Dockerfile syntax, compose configuration, or multi-stage builds
docker - — for resource manifests, kubectl commands, or Helm charts
kubernetes - — for provider configuration, resource blocks, or state management
terraform
先使用再调用获取最新文档,返回的文档优先级高于记忆知识。
mcp__context7__resolve-library-idmcp__context7__query-docs- — 查询Dockerfile语法、compose配置或多阶段构建相关内容
docker - — 查询资源清单、kubectl命令或Helm charts相关内容
kubernetes - — 查询提供商配置、资源块或状态管理相关内容
terraform
Integration Points
集成点
| Skill | Integration |
|---|---|
| Provides higher-level deploy pipeline orchestration |
| Security scan stage in CI pipeline |
| Infrastructure changes are planned like features |
| Post-deploy verification gate |
| Merge triggers deployment pipeline |
| MCP servers need containerization and deployment |
| 技能 | 集成方式 |
|---|---|
| 提供更高层级的部署流水线编排能力 |
| CI流水线中的安全扫描阶段 |
| 基础设施变更和功能特性一样做规划 |
| 部署后校验关卡 |
| 合并代码触发部署流水线 |
| MCP服务需要容器化和部署 |
Skill Type
技能类型
FLEXIBLE — Adapt tooling and patterns to the project's cloud provider, team size, and operational maturity. The principles (IaC, immutability, observability) are constant; the specific tools are interchangeable.
灵活适配——根据项目的云服务商、团队规模和运维成熟度调整工具和模式。原则(基础设施即代码、不可变性、可观测性)固定不变,具体工具可替换。