senior-devops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Senior DevOps Engineer

高级DevOps工程师

Overview

概述

Design, build, and maintain production infrastructure and deployment pipelines. This skill covers Docker containerization, Kubernetes orchestration, CI/CD with GitHub Actions, infrastructure-as-code with Terraform/Pulumi, monitoring with Prometheus/Grafana, alerting strategies, zero-downtime deployments, and rollback procedures.

设计、搭建并维护生产基础设施和部署流水线。本技能覆盖Docker容器化、Kubernetes编排、基于GitHub Actions的CI/CD、基于Terraform/Pulumi的基础设施即代码、基于Prometheus/Grafana的监控、告警策略、零停机部署以及回滚流程。

Phase 1: Infrastructure Design

第一阶段：基础设施设计

Define deployment topology (single server, cluster, multi-region)
Choose containerization strategy (Docker, Buildpacks)
Select orchestration platform (Kubernetes, ECS, Cloud Run)
Plan networking (load balancers, DNS, TLS)
Design secret management approach

STOP — Present infrastructure design to user for approval before implementation.

定义部署拓扑（单服务器、集群、多区域）
选择容器化策略（Docker、Buildpacks）
选择编排平台（Kubernetes、ECS、Cloud Run）
规划网络配置（负载均衡、DNS、TLS）
设计密钥管理方案

停止 — 实施前先向用户展示基础设施设计方案，获得批准后再继续。

Infrastructure Decision Table

基础设施决策表

Scale	Topology	Orchestration	Recommended
Hobby / MVP	Single server	Docker Compose	Railway, Fly.io
Startup (< 100k users)	Small cluster	ECS, Cloud Run	AWS ECS, GCP Cloud Run
Growth (100k - 1M users)	Multi-AZ cluster	Kubernetes	EKS, GKE
Enterprise (1M+ users)	Multi-region	Kubernetes + service mesh	EKS/GKE + Istio
Compliance-heavy	Dedicated/private cloud	Kubernetes	Self-managed K8s

规模	拓扑结构	编排方案	推荐选项
个人项目 / MVP	单服务器	Docker Compose	Railway、Fly.io
初创公司（<10万用户）	小型集群	ECS、Cloud Run	AWS ECS、GCP Cloud Run
增长期（10万-100万用户）	多可用区集群	Kubernetes	EKS、GKE
企业级（100万+用户）	多区域	Kubernetes + 服务网格	EKS/GKE + Istio
强合规要求	专属/私有云	Kubernetes	自建K8s

Phase 2: Pipeline Implementation

第二阶段：流水线实现

Build CI pipeline (lint, test, build, security scan)
Build CD pipeline (deploy to staging, production)
Configure environment-specific settings
Set up artifact registry (container images, packages)
Implement deployment strategy (blue-green, canary, rolling)

STOP — Validate pipeline config syntax and present for review.

搭建CI流水线（代码检查、测试、构建、安全扫描）
搭建CD流水线（部署到预发环境、生产环境）
配置环境专属设置
搭建制品仓库（容器镜像、软件包）
实施部署策略（蓝绿、金丝雀、滚动发布）

停止 — 校验流水线配置语法，提交审核通过后再继续。

Phase 3: Observability

第三阶段：可观测性

Deploy monitoring stack (Prometheus, Grafana)
Configure alerting rules and escalation
Set up log aggregation
Implement distributed tracing
Create runbooks for common incidents

STOP — Verify monitoring covers all critical services before declaring complete.

部署监控栈（Prometheus、Grafana）
配置告警规则和升级流程
搭建日志聚合体系
实现分布式链路追踪
为常见事件编写运行手册

停止 — 确认监控覆盖所有核心服务后再宣布工作完成。

Dockerfile Best Practices

Dockerfile最佳实践

dockerfile

undefined

dockerfile

undefined

1. Use specific version tags (not :latest)

FROM node:20-alpine AS base

2. Set working directory

WORKDIR /app

3. Install dependencies in separate layer (cache optimization)

FROM base AS deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile --prod

FROM base AS build-deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile

FROM base AS deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile --prod

FROM base AS build-deps COPY package.json pnpm-lock.yaml ./ RUN corepack enable && pnpm install --frozen-lockfile

4. Build in separate stage

FROM build-deps AS builder COPY . . RUN pnpm build

5. Production image — minimal size

FROM base AS runner ENV NODE_ENV=production

6. Don't run as root

RUN addgroup --system --gid 1001 app &&
adduser --system --uid 1001 app USER app

7. Copy only what's needed

COPY --from=deps /app/node_modules ./node_modules COPY --from=builder /app/dist ./dist

8. Health check

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s
CMD wget -qO- http://localhost:3000/health || exit 1

9. Expose port and set entrypoint

EXPOSE 3000 CMD ["node", "dist/server.js"]

undefined

EXPOSE 3000 CMD ["node", "dist/server.js"]

undefined

Key Dockerfile Rules

Dockerfile核心规则

Rule	Why
Multi-stage builds	Minimize image size
`.dockerignore` file	Exclude node_modules, .git, tests
Non-root user	Security hardening
Specific base image versions	Reproducible builds
Layer ordering (deps before src)	Cache efficiency
HEALTHCHECK instruction	Container health monitoring
No secrets in build args/layers	Prevent credential leaks

规则	原因
多阶段构建	最小化镜像体积
配置 `.dockerignore` 文件	排除node_modules、.git、测试文件
使用非root用户	提升安全性
指定基础镜像的具体版本	保证构建可复现
层顺序优化（依赖复制早于源码复制）	提升缓存效率
配置HEALTHCHECK指令	容器健康监控
不在构建参数/层中存储密钥	避免凭证泄露

Docker Compose Patterns

Docker Compose 常用模式

yaml

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: runner
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@db:5432/app
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  cache:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

yaml

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: runner
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://postgres:postgres@db:5432/app
      - REDIS_URL=redis://cache:6379
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  cache:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

GitHub Actions Workflow

GitHub Actions 工作流

yaml

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm typecheck
      - run: pnpm test -- --coverage

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx audit-ci --moderate
      - uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: HIGH,CRITICAL

  build-and-push:
    needs: [lint-and-test, security-scan]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: echo "Deploying ${{ github.sha }}"

yaml

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v3
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm
      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm typecheck
      - run: pnpm test -- --coverage

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npx audit-ci --moderate
      - uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          severity: HIGH,CRITICAL

  build-and-push:
    needs: [lint-and-test, security-scan]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: echo "Deploying ${{ github.sha }}"

Terraform / Pulumi Patterns

Terraform / Pulumi 常用模式

Terraform Structure

Terraform 目录结构

modules/
  vpc/
    main.tf, variables.tf, outputs.tf
  ecs/
    main.tf, variables.tf, outputs.tf
environments/
  staging/
    main.tf, terraform.tfvars
  production/
    main.tf, terraform.tfvars

modules/
  vpc/
    main.tf, variables.tf, outputs.tf
  ecs/
    main.tf, variables.tf, outputs.tf
environments/
  staging/
    main.tf, terraform.tfvars
  production/
    main.tf, terraform.tfvars

Key IaC Rules

基础设施即代码核心规则

Rule	Why
Remote state backend (S3 + DynamoDB)	Shared state, locking
State locking	Prevent concurrent modifications
Environment-specific variable files	Separation of concerns
Module versioning	Reproducible shared infra
`terraform plan` in CI	Catch issues before apply
Drift detection on schedule	Detect manual changes
Tag all resources	Ownership, cost allocation

规则	原因
远程状态后端（S3 + DynamoDB）	共享状态、状态锁
状态锁	防止并发修改冲突
环境专属变量文件	职责分离
模块版本控制	共享基础设施可复现
CI中执行 `terraform plan`	应用前提前发现问题
定期漂移检测	发现手动修改的配置
所有资源打标签	权责划分、成本核算

Monitoring (Prometheus + Grafana)

监控（Prometheus + Grafana）

USE Method (Resources)

USE方法（资源维度）

Resource	Utilization	Saturation	Errors
CPU	cpu_usage_percent	cpu_throttled	—
Memory	memory_usage_bytes	oom_kills	—
Disk	disk_usage_percent	io_wait	disk_errors
Network	bytes_total	queue_length	errors_total

资源	利用率	饱和度	错误
CPU	cpu_usage_percent	cpu_throttled	—
内存	memory_usage_bytes	oom_kills	—
磁盘	disk_usage_percent	io_wait	disk_errors
网络	bytes_total	queue_length	errors_total

RED Method (Services)

RED方法（服务维度）

Rate: requests per second
Errors: error rate per second
Duration: latency distribution (p50, p95, p99)

Rate：每秒请求数
Errors：每秒错误率
Duration：延迟分布（p50、p95、p99）

Alerting Rules

告警规则

yaml

groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

yaml

groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning

Alerting Best Practices

告警最佳实践

Practice	Why
Alert on symptoms, not causes	Reduces noise, focuses on impact
Every alert has a runbook link	Enables fast response
Tiered severity	critical=page, warning=ticket, info=log
Aggregate before alerting	Avoid flapping
Review and prune quarterly	Prevent alert fatigue

实践	原因
基于现象告警，而非原因	减少噪音，聚焦业务影响
每个告警都关联运行手册链接	支持快速响应
分级告警	严重级=电话告警，警告级=工单，信息级=日志
告警前先聚合数据	避免告警抖动
每季度回顾和清理告警规则	避免告警疲劳

Zero-Downtime Deployment Strategies

零停机部署策略

Strategy	How It Works	Risk	Rollback Speed
Rolling	Replace instances one at a time	Low	Medium
Blue-Green	Switch traffic between two environments	Low	Instant
Canary	Route small % to new version, gradually increase	Very Low	Instant
Feature Flags	Deploy code dark, enable via flag	Very Low	Instant

策略	工作原理	风险	回滚速度
滚动发布	逐个替换实例	低	中等
蓝绿部署	在两个环境之间切换流量	低	即时
金丝雀发布	先将小比例流量路由到新版本，逐步扩大	极低	即时
特性开关	代码先静默部署，通过开关启用	极低	即时

Rollback Procedures

回滚流程

Automated: health check fails -> automatic rollback
Manual:
```
kubectl rollout undo deployment/app
```
Database: forward-only migrations with backward compatibility
Config: revert via secret manager version

自动回滚：健康检查失败 -> 自动触发回滚
手动回滚：
```
kubectl rollout undo deployment/app
```
数据库回滚：仅向前迁移，保持向后兼容性
配置回滚：通过密钥管理器版本回退

Database Migration Safety

数据库迁移安全规范

Rule	Rationale
Migrations must be backward compatible	Old code + new schema must work
Never rename/drop columns in same deploy	Two-phase change required
Two-phase: add column -> deploy -> remove old	Zero-downtime schema evolution
Always test rollback of each migration	Ensure reversibility

规则	逻辑
迁移必须向后兼容	旧代码 + 新 schema 必须能正常运行
严禁在同一次部署中重命名/删除列	需要分两阶段修改
两阶段修改：新增列 -> 部署 -> 删除旧列	零停机 schema 演进
每次迁移都要测试回滚	确保可回退

Anti-Patterns / Common Mistakes

反模式 / 常见错误

Anti-Pattern	Why It Is Wrong	What to Do Instead
Manual production deployments	No audit trail, error-prone	Automate via CI/CD
Shared or hardcoded secrets	Security breach risk	Use secrets manager
No rollback plan before deploying	Stuck if deploy fails	Document rollback before every deploy
`latest` tag for production images	Non-reproducible	Pin specific version tags
Running containers as root	Security vulnerability	Use non-root user in Dockerfile
Alert fatigue from non-actionable alerts	Real issues get missed	Alert on symptoms, tune thresholds
Skipping staging environment	Bugs found in production	Always deploy to staging first
Snowflake servers with manual config	Cannot reproduce, cannot scale	Infrastructure as code
Monitoring without alerting	Nobody notices problems	Wire alerts to monitoring

反模式	问题所在	替代方案
手动部署生产环境	无审计轨迹，易出错	通过CI/CD自动化部署
共享或硬编码密钥	存在安全泄露风险	使用密钥管理器
部署前无回滚计划	部署失败后无法处理	每次部署前先编写回滚方案
生产镜像使用 `latest` 标签	构建不可复现	绑定具体版本标签
用root用户运行容器	存在安全漏洞	Dockerfile中使用非root用户
无操作价值的告警导致告警疲劳	真实问题被忽略	基于现象告警，调整阈值
跳过预发环境	生产环境才发现Bug	总是先部署到预发环境验证
手动配置的雪花服务器	无法复现，无法扩容	用基础设施即代码管理
有监控无告警	没人发现问题	监控配置对应告警

Key Principles

核心原则

Infrastructure as code — no manual changes to production
Immutable infrastructure — replace, do not patch
Cattle, not pets — servers are disposable
Shift left security — scan early in pipeline
Least privilege — minimal permissions everywhere
Automate everything that runs more than twice
Test the disaster recovery plan regularly

基础设施即代码——禁止手动修改生产环境
不可变基础设施——替换实例，而非直接修补
牛而非宠物——服务器是可丢弃的
安全左移——在流水线早期就进行安全扫描
最小权限——所有场景都配置最小必要权限
所有运行超过两次的操作都要自动化
定期测试灾难恢复计划

Documentation Lookup (Context7)

文档查询（Context7）

Use

mcp__context7__resolve-library-id

then

mcp__context7__query-docs

for up-to-date docs. Returned docs override memorized knowledge.

```
docker
```
— for Dockerfile syntax, compose configuration, or multi-stage builds
```
kubernetes
```
— for resource manifests, kubectl commands, or Helm charts
```
terraform
```
— for provider configuration, resource blocks, or state management

先使用

mcp__context7__resolve-library-id

再调用

mcp__context7__query-docs

获取最新文档，返回的文档优先级高于记忆知识。

```
docker
```
— 查询Dockerfile语法、compose配置或多阶段构建相关内容
```
kubernetes
```
— 查询资源清单、kubectl命令或Helm charts相关内容
```
terraform
```
— 查询提供商配置、资源块或状态管理相关内容

Integration Points

集成点

Skill	Integration
`deployment`	Provides higher-level deploy pipeline orchestration
`security-review`	Security scan stage in CI pipeline
`planning`	Infrastructure changes are planned like features
`verification-before-completion`	Post-deploy verification gate
`finishing-a-development-branch`	Merge triggers deployment pipeline
`mcp-builder`	MCP servers need containerization and deployment

技能	集成方式
`deployment`	提供更高层级的部署流水线编排能力
`security-review`	CI流水线中的安全扫描阶段
`planning`	基础设施变更和功能特性一样做规划
`verification-before-completion`	部署后校验关卡
`finishing-a-development-branch`	合并代码触发部署流水线
`mcp-builder`	MCP服务需要容器化和部署

Skill Type

技能类型

FLEXIBLE — Adapt tooling and patterns to the project's cloud provider, team size, and operational maturity. The principles (IaC, immutability, observability) are constant; the specific tools are interchangeable.

灵活适配——根据项目的云服务商、团队规模和运维成熟度调整工具和模式。原则（基础设施即代码、不可变性、可观测性）固定不变，具体工具可替换。