senior-devops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Senior DevOps Engineer

高级DevOps工程师

Overview

概述

Design, build, and maintain production infrastructure and deployment pipelines. This skill covers Docker containerization, Kubernetes orchestration, CI/CD with GitHub Actions, infrastructure-as-code with Terraform/Pulumi, monitoring with Prometheus/Grafana, alerting strategies, zero-downtime deployments, and rollback procedures.
设计、搭建并维护生产基础设施和部署流水线。本技能覆盖Docker容器化、Kubernetes编排、基于GitHub Actions的CI/CD、基于Terraform/Pulumi的基础设施即代码、基于Prometheus/Grafana的监控、告警策略、零停机部署以及回滚流程。

Phase 1: Infrastructure Design

第一阶段:基础设施设计

  1. Define deployment topology (single server, cluster, multi-region)
  2. Choose containerization strategy (Docker, Buildpacks)
  3. Select orchestration platform (Kubernetes, ECS, Cloud Run)
  4. Plan networking (load balancers, DNS, TLS)
  5. Design secret management approach
STOP — Present infrastructure design to user for approval before implementation.
  1. 定义部署拓扑(单服务器、集群、多区域)
  2. 选择容器化策略(Docker、Buildpacks)
  3. 选择编排平台(Kubernetes、ECS、Cloud Run)
  4. 规划网络配置(负载均衡、DNS、TLS)
  5. 设计密钥管理方案
停止 — 实施前先向用户展示基础设施设计方案,获得批准后再继续。

Infrastructure Decision Table

基础设施决策表

ScaleTopologyOrchestrationRecommended
Hobby / MVPSingle serverDocker ComposeRailway, Fly.io
Startup (< 100k users)Small clusterECS, Cloud RunAWS ECS, GCP Cloud Run
Growth (100k - 1M users)Multi-AZ clusterKubernetesEKS, GKE
Enterprise (1M+ users)Multi-regionKubernetes + service meshEKS/GKE + Istio
Compliance-heavyDedicated/private cloudKubernetesSelf-managed K8s
规模拓扑结构编排方案推荐选项
个人项目 / MVP单服务器Docker ComposeRailway、Fly.io
初创公司(<10万用户)小型集群ECS、Cloud RunAWS ECS、GCP Cloud Run
增长期(10万-100万用户)多可用区集群KubernetesEKS、GKE
企业级(100万+用户)多区域Kubernetes + 服务网格EKS/GKE + Istio
强合规要求专属/私有云Kubernetes自建K8s

Phase 2: Pipeline Implementation

第二阶段:流水线实现

  1. Build CI pipeline (lint, test, build, security scan)
  2. Build CD pipeline (deploy to staging, production)
  3. Configure environment-specific settings
  4. Set up artifact registry (container images, packages)
  5. Implement deployment strategy (blue-green, canary, rolling)
STOP — Validate pipeline config syntax and present for review.
  1. 搭建CI流水线(代码检查、测试、构建、安全扫描)
  2. 搭建CD流水线(部署到预发环境、生产环境)
  3. 配置环境专属设置
  4. 搭建制品仓库(容器镜像、软件包)
  5. 实施部署策略(蓝绿、金丝雀、滚动发布)
停止 — 校验流水线配置语法,提交审核通过后再继续。

Phase 3: Observability

第三阶段:可观测性

  1. Deploy monitoring stack (Prometheus, Grafana)
  2. Configure alerting rules and escalation
  3. Set up log aggregation
  4. Implement distributed tracing
  5. Create runbooks for common incidents
STOP — Verify monitoring covers all critical services before declaring complete.
  1. 部署监控栈(Prometheus、Grafana)
  2. 配置告警规则和升级流程
  3. 搭建日志聚合体系
  4. 实现分布式链路追踪
  5. 为常见事件编写运行手册
停止 — 确认监控覆盖所有核心服务后再宣布工作完成。

Dockerfile Best Practices

Dockerfile最佳实践

dockerfile
undefined
dockerfile
undefined

1. Use specific version tags (not :latest)

1. Use specific version tags (not :latest)

FROM node:20-alpine AS base
FROM node:20-alpine AS base

2. Set working directory

2. Set working directory

WORKDIR /app
WORKDIR /app

3. Install dependencies in separate layer (cache optimization)

3. Install dependencies in separate layer (cache optimization)

FROM base AS deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile --prod
FROM base AS build-deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile
FROM base AS deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile --prod
FROM base AS build-deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile

4. Build in separate stage

4. Build in separate stage

FROM build-deps AS builder COPY . . RUN pnpm build
FROM build-deps AS builder COPY . . RUN pnpm build

5. Production image — minimal size

5. Production image — minimal size

FROM base AS runner ENV NODE_ENV=production
FROM base AS runner ENV NODE_ENV=production

6. Don't run as root

6. Don't run as root

RUN addgroup --system --gid 1001 app &&
adduser --system --uid 1001 app USER app
RUN addgroup --system --gid 1001 app &&
adduser --system --uid 1001 app USER app

7. Copy only what's needed

7. Copy only what's needed

COPY --from=deps /app/node_modules ./node_modules COPY --from=builder /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules COPY --from=builder /app/dist ./dist

8. Health check

8. Health check

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s
CMD wget -qO- http://localhost:3000/health || exit 1
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s
CMD wget -qO- http://localhost:3000/health || exit 1

9. Expose port and set entrypoint

9. Expose port and set entrypoint

EXPOSE 3000 CMD ["node", "dist/server.js"]
undefined
EXPOSE 3000 CMD ["node", "dist/server.js"]
undefined

Key Dockerfile Rules

Dockerfile核心规则

RuleWhy
Multi-stage buildsMinimize image size
.dockerignore
file
Exclude node_modules, .git, tests
Non-root userSecurity hardening
Specific base image versionsReproducible builds
Layer ordering (deps before src)Cache efficiency
HEALTHCHECK instructionContainer health monitoring
No secrets in build args/layersPrevent credential leaks
规则原因
多阶段构建最小化镜像体积
配置
.dockerignore
文件
排除node_modules、.git、测试文件
使用非root用户提升安全性
指定基础镜像的具体版本保证构建可复现
层顺序优化(依赖复制早于源码复制)提升缓存效率
配置HEALTHCHECK指令容器健康监控
不在构建参数/层中存储密钥避免凭证泄露

Docker Compose Patterns

Docker Compose 常用模式

yaml
services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: runner
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@db:5432/app
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  cache:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:
yaml
services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: runner
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@db:5432/app
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  cache:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

GitHub Actions Workflow

GitHub Actions 工作流

yaml
name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm typecheck
      - run: pnpm test -- --coverage

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx audit-ci --moderate
      - uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: HIGH,CRITICAL

  build-and-push:
    needs: [lint-and-test, security-scan]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: echo "Deploying ${{ github.sha }}"
yaml
name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm typecheck
      - run: pnpm test -- --coverage

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx audit-ci --moderate
      - uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: HIGH,CRITICAL

  build-and-push:
    needs: [lint-and-test, security-scan]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: echo "Deploying ${{ github.sha }}"

Terraform / Pulumi Patterns

Terraform / Pulumi 常用模式

Terraform Structure

Terraform 目录结构

modules/
  vpc/
    main.tf, variables.tf, outputs.tf
  ecs/
    main.tf, variables.tf, outputs.tf
environments/
  staging/
    main.tf, terraform.tfvars
  production/
    main.tf, terraform.tfvars
modules/
  vpc/
    main.tf, variables.tf, outputs.tf
  ecs/
    main.tf, variables.tf, outputs.tf
environments/
  staging/
    main.tf, terraform.tfvars
  production/
    main.tf, terraform.tfvars

Key IaC Rules

基础设施即代码核心规则

RuleWhy
Remote state backend (S3 + DynamoDB)Shared state, locking
State lockingPrevent concurrent modifications
Environment-specific variable filesSeparation of concerns
Module versioningReproducible shared infra
terraform plan
in CI
Catch issues before apply
Drift detection on scheduleDetect manual changes
Tag all resourcesOwnership, cost allocation
规则原因
远程状态后端(S3 + DynamoDB)共享状态、状态锁
状态锁防止并发修改冲突
环境专属变量文件职责分离
模块版本控制共享基础设施可复现
CI中执行
terraform plan
应用前提前发现问题
定期漂移检测发现手动修改的配置
所有资源打标签权责划分、成本核算

Monitoring (Prometheus + Grafana)

监控(Prometheus + Grafana)

USE Method (Resources)

USE方法(资源维度)

ResourceUtilizationSaturationErrors
CPUcpu_usage_percentcpu_throttled
Memorymemory_usage_bytesoom_kills
Diskdisk_usage_percentio_waitdisk_errors
Networkbytes_totalqueue_lengtherrors_total
资源利用率饱和度错误
CPUcpu_usage_percentcpu_throttled
内存memory_usage_bytesoom_kills
磁盘disk_usage_percentio_waitdisk_errors
网络bytes_totalqueue_lengtherrors_total

RED Method (Services)

RED方法(服务维度)

  • Rate: requests per second
  • Errors: error rate per second
  • Duration: latency distribution (p50, p95, p99)
  • Rate:每秒请求数
  • Errors:每秒错误率
  • Duration:延迟分布(p50、p95、p99)

Alerting Rules

告警规则

yaml
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
yaml
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

Alerting Best Practices

告警最佳实践

PracticeWhy
Alert on symptoms, not causesReduces noise, focuses on impact
Every alert has a runbook linkEnables fast response
Tiered severitycritical=page, warning=ticket, info=log
Aggregate before alertingAvoid flapping
Review and prune quarterlyPrevent alert fatigue
实践原因
基于现象告警,而非原因减少噪音,聚焦业务影响
每个告警都关联运行手册链接支持快速响应
分级告警严重级=电话告警,警告级=工单,信息级=日志
告警前先聚合数据避免告警抖动
每季度回顾和清理告警规则避免告警疲劳

Zero-Downtime Deployment Strategies

零停机部署策略

StrategyHow It WorksRiskRollback Speed
RollingReplace instances one at a timeLowMedium
Blue-GreenSwitch traffic between two environmentsLowInstant
CanaryRoute small % to new version, gradually increaseVery LowInstant
Feature FlagsDeploy code dark, enable via flagVery LowInstant
策略工作原理风险回滚速度
滚动发布逐个替换实例中等
蓝绿部署在两个环境之间切换流量即时
金丝雀发布先将小比例流量路由到新版本,逐步扩大极低即时
特性开关代码先静默部署,通过开关启用极低即时

Rollback Procedures

回滚流程

  1. Automated: health check fails -> automatic rollback
  2. Manual:
    kubectl rollout undo deployment/app
  3. Database: forward-only migrations with backward compatibility
  4. Config: revert via secret manager version
  1. 自动回滚:健康检查失败 -> 自动触发回滚
  2. 手动回滚
    kubectl rollout undo deployment/app
  3. 数据库回滚:仅向前迁移,保持向后兼容性
  4. 配置回滚:通过密钥管理器版本回退

Database Migration Safety

数据库迁移安全规范

RuleRationale
Migrations must be backward compatibleOld code + new schema must work
Never rename/drop columns in same deployTwo-phase change required
Two-phase: add column -> deploy -> remove oldZero-downtime schema evolution
Always test rollback of each migrationEnsure reversibility
规则逻辑
迁移必须向后兼容旧代码 + 新 schema 必须能正常运行
严禁在同一次部署中重命名/删除列需要分两阶段修改
两阶段修改:新增列 -> 部署 -> 删除旧列零停机 schema 演进
每次迁移都要测试回滚确保可回退

Anti-Patterns / Common Mistakes

反模式 / 常见错误

Anti-PatternWhy It Is WrongWhat to Do Instead
Manual production deploymentsNo audit trail, error-proneAutomate via CI/CD
Shared or hardcoded secretsSecurity breach riskUse secrets manager
No rollback plan before deployingStuck if deploy failsDocument rollback before every deploy
latest
tag for production images
Non-reproduciblePin specific version tags
Running containers as rootSecurity vulnerabilityUse non-root user in Dockerfile
Alert fatigue from non-actionable alertsReal issues get missedAlert on symptoms, tune thresholds
Skipping staging environmentBugs found in productionAlways deploy to staging first
Snowflake servers with manual configCannot reproduce, cannot scaleInfrastructure as code
Monitoring without alertingNobody notices problemsWire alerts to monitoring
反模式问题所在替代方案
手动部署生产环境无审计轨迹,易出错通过CI/CD自动化部署
共享或硬编码密钥存在安全泄露风险使用密钥管理器
部署前无回滚计划部署失败后无法处理每次部署前先编写回滚方案
生产镜像使用
latest
标签
构建不可复现绑定具体版本标签
用root用户运行容器存在安全漏洞Dockerfile中使用非root用户
无操作价值的告警导致告警疲劳真实问题被忽略基于现象告警,调整阈值
跳过预发环境生产环境才发现Bug总是先部署到预发环境验证
手动配置的雪花服务器无法复现,无法扩容用基础设施即代码管理
有监控无告警没人发现问题监控配置对应告警

Key Principles

核心原则

  • Infrastructure as code — no manual changes to production
  • Immutable infrastructure — replace, do not patch
  • Cattle, not pets — servers are disposable
  • Shift left security — scan early in pipeline
  • Least privilege — minimal permissions everywhere
  • Automate everything that runs more than twice
  • Test the disaster recovery plan regularly
  • 基础设施即代码——禁止手动修改生产环境
  • 不可变基础设施——替换实例,而非直接修补
  • 牛而非宠物——服务器是可丢弃的
  • 安全左移——在流水线早期就进行安全扫描
  • 最小权限——所有场景都配置最小必要权限
  • 所有运行超过两次的操作都要自动化
  • 定期测试灾难恢复计划

Documentation Lookup (Context7)

文档查询(Context7)

Use
mcp__context7__resolve-library-id
then
mcp__context7__query-docs
for up-to-date docs. Returned docs override memorized knowledge.
  • docker
    — for Dockerfile syntax, compose configuration, or multi-stage builds
  • kubernetes
    — for resource manifests, kubectl commands, or Helm charts
  • terraform
    — for provider configuration, resource blocks, or state management

先使用
mcp__context7__resolve-library-id
再调用
mcp__context7__query-docs
获取最新文档,返回的文档优先级高于记忆知识。
  • docker
    — 查询Dockerfile语法、compose配置或多阶段构建相关内容
  • kubernetes
    — 查询资源清单、kubectl命令或Helm charts相关内容
  • terraform
    — 查询提供商配置、资源块或状态管理相关内容

Integration Points

集成点

SkillIntegration
deployment
Provides higher-level deploy pipeline orchestration
security-review
Security scan stage in CI pipeline
planning
Infrastructure changes are planned like features
verification-before-completion
Post-deploy verification gate
finishing-a-development-branch
Merge triggers deployment pipeline
mcp-builder
MCP servers need containerization and deployment
技能集成方式
deployment
提供更高层级的部署流水线编排能力
security-review
CI流水线中的安全扫描阶段
planning
基础设施变更和功能特性一样做规划
verification-before-completion
部署后校验关卡
finishing-a-development-branch
合并代码触发部署流水线
mcp-builder
MCP服务需要容器化和部署

Skill Type

技能类型

FLEXIBLE — Adapt tooling and patterns to the project's cloud provider, team size, and operational maturity. The principles (IaC, immutability, observability) are constant; the specific tools are interchangeable.
灵活适配——根据项目的云服务商、团队规模和运维成熟度调整工具和模式。原则(基础设施即代码、不可变性、可观测性)固定不变,具体工具可替换。