senior-devops

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Senior DevOps Engineer

资深DevOps工程师

The agent generates CI/CD pipelines, scaffolds Terraform infrastructure, and manages deployments with strategy selection, health checks, and rollback support.

该Agent可生成CI/CD流水线、搭建Terraform基础设施，并支持通过策略选择、健康检查和回滚机制管理部署。

Quick Start

快速开始

bash

undefined

bash

undefined

Generate CI/CD pipeline from project analysis

通过项目分析生成CI/CD流水线

python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose

Scaffold Terraform infrastructure

搭建Terraform基础设施

python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose

Manage deployment with canary strategy

使用金丝雀策略管理部署

python scripts/deployment_manager.py <target-path> --strategy canary --verbose

undefined

python scripts/deployment_manager.py <target-path> --strategy canary --verbose

undefined

Tools Overview

工具概览

Tool	Input	Output
`pipeline_generator.py`	Project path	CI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins, CircleCI)
`terraform_scaffolder.py`	Target path + provider	Terraform module structure with state config
`deployment_manager.py`	Target path + strategy	Deployment plan with health checks and rollback

All tools support

--json

for machine-readable output and

--output

-o

for file writing.

工具	输入	输出
`pipeline_generator.py`	项目路径	CI/CD流水线配置（GitHub Actions、GitLab CI、Jenkins、CircleCI）
`terraform_scaffolder.py`	目标路径 + 云服务商	带状态配置的Terraform模块结构
`deployment_manager.py`	目标路径 + 部署策略	包含健康检查和回滚机制的部署方案

所有工具均支持

--json

参数以生成机器可读输出，支持

--output

-o

参数将结果写入文件。

Workflow 1: Containerize and Deploy

工作流1：容器化与部署

Step 1 -- Build a production Dockerfile.

The agent generates multi-stage Dockerfiles following this pattern:

dockerfile

undefined

步骤1 -- 构建生产环境Dockerfile

Agent会按照以下模板生成多阶段Dockerfile：

dockerfile

undefined

Stage 1: Build

FROM node:20-alpine AS builder WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci --only=production && npm cache clean --force COPY . . RUN npm run build

Stage 2: Production

FROM node:20-alpine AS production WORKDIR /app RUN addgroup -g 1001 appgroup &&
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]


**Validation checkpoint:** Image builds with `docker build -t app:test .` and `docker run --rm app:test` returns healthy.

**Step 2 -- Deploy to Kubernetes.**

The agent creates a Deployment with probes, resource limits, and security context:

```yaml
spec:
  containers:
    - name: app
      image: myapp:1.2.3
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: "1", memory: 512Mi }
      livenessProbe:
        httpGet: { path: /healthz, port: 3000 }
        initialDelaySeconds: 15
        periodSeconds: 20
      readinessProbe:
        httpGet: { path: /ready, port: 3000 }
        initialDelaySeconds: 5
        periodSeconds: 10
      startupProbe:
        httpGet: { path: /healthz, port: 3000 }
        failureThreshold: 30
        periodSeconds: 10

Probe decision:

startupProbe: Slow-starting apps (JVM, model loading). Prevents liveness from killing during startup.
livenessProbe: Detects deadlocks. Keep simple -- do not check downstream dependencies.
readinessProbe: Controls traffic routing. Include dependency checks here.

Validation checkpoint:

kubectl get pods -l app=myapp

shows all pods Running and Ready.


**验证检查点：** 使用`docker build -t app:test .`构建镜像，运行`docker run --rm app:test`后镜像状态为健康。

**步骤2 -- 部署至Kubernetes**

Agent会创建包含探针、资源限制和安全上下文的Deployment：

```yaml
spec:
  containers:
    - name: app
      image: myapp:1.2.3
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: "1", memory: 512Mi }
      livenessProbe:
        httpGet: { path: /healthz, port: 3000 }
        initialDelaySeconds: 15
        periodSeconds: 20
      readinessProbe:
        httpGet: { path: /ready, port: 3000 }
        initialDelaySeconds: 5
        periodSeconds: 10
      startupProbe:
        httpGet: { path: /healthz, port: 3000 }
        failureThreshold: 30
        periodSeconds: 10

探针决策：

startupProbe：适用于启动缓慢的应用（如JVM、模型加载场景），避免启动期间被存活探针终止。
livenessProbe：检测死锁情况，配置需简洁——不要检查下游依赖。
readinessProbe：控制流量路由，需在此处包含依赖检查。

验证检查点： 执行

kubectl get pods -l app=myapp

显示所有Pod处于Running且Ready状态。

Workflow 2: Infrastructure as Code with Terraform

工作流2：基于Terraform的基础设施即代码

Step 1 -- Scaffold the module structure.

bash

python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose

The agent produces:

infrastructure/
  modules/
    vpc/         # main.tf, variables.tf, outputs.tf
    eks/
    rds/
  environments/
    staging/     # main.tf, terraform.tfvars, backend.tf
    production/

Step 2 -- Configure remote state.

hcl

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Step 3 -- Run drift detection in CI.

bash

terraform plan -detailed-exitcode -out=plan.tfplan

步骤1 -- 搭建模块结构

bash

python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose

Agent会生成如下结构：

infrastructure/
  modules/
    vpc/         # main.tf, variables.tf, outputs.tf
    eks/
    rds/
  environments/
    staging/     # main.tf, terraform.tfvars, backend.tf
    production/

步骤2 -- 配置远程状态

hcl

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

步骤3 -- 在CI中运行漂移检测

bash

terraform plan -detailed-exitcode -out=plan.tfplan

Exit 0 = clean, Exit 1 = error, Exit 2 = drift detected

退出码0=无变更，退出码1=错误，退出码2=检测到漂移


**Validation checkpoint:** `terraform plan` shows no unexpected changes. Drift alerts fire within 24 hours.

**Key rules:**
- One state file per environment per component (blast radius control)
- Never store state locally or in git
- Run `terraform plan` in CI, `terraform apply` only after approval
- Use directories for environment separation, modules for shared logic

---


**验证检查点：** `terraform plan`显示无意外变更，漂移告警在24小时内触发。

**核心规则：**
- 每个环境的每个组件对应一个状态文件（控制故障影响范围）
- 切勿在本地或Git中存储状态文件
- 在CI中运行`terraform plan`，仅在审批后执行`terraform apply`
- 使用目录区分环境，使用模块复用共享逻辑

---

Workflow 3: CI/CD Pipeline Design

工作流3：CI/CD流水线设计

bash

python scripts/pipeline_generator.py /path/to/project --platform github-actions --json

The agent generates pipelines following these principles:

Fail fast -- lint and unit tests before expensive integration tests
Cache aggressively -- node_modules, Docker layers, pip packages
Immutable artifacts -- build once, deploy the same artifact everywhere
Gate promotions -- manual approval or smoke tests before production
Parallel execution -- independent test suites and security scans run concurrently

Example: GitHub Actions with matrix testing and deployment gates

yaml

jobs:
  test:
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "${{ matrix.node-version }}", cache: npm }
      - run: npm ci && npm run lint && npm test -- --coverage

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait

  deploy-production:
    needs: deploy-staging
    environment: production  # requires manual approval

Validation checkpoint: Pipeline runs in under 15 minutes. All stages produce exit code 0.

bash

python scripts/pipeline_generator.py /path/to/project --platform github-actions --json

Agent生成的流水线遵循以下原则：

快速失败——在执行耗时的集成测试前先进行代码检查和单元测试
积极缓存——缓存node_modules、Docker镜像层、pip包等
不可变制品——构建一次，在所有环境部署相同制品
部署闸门——生产环境部署前需人工审批或通过冒烟测试
并行执行——独立测试套件和安全扫描并行运行

示例：带矩阵测试和部署闸门的GitHub Actions

yaml

jobs:
  test:
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "${{ matrix.node-version }}", cache: npm }
      - run: npm ci && npm run lint && npm test -- --coverage

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait

  deploy-production:
    needs: deploy-staging
    environment: production  # 需要人工审批

验证检查点： 流水线运行时间不超过15分钟，所有阶段退出码均为0。

Deployment Strategy Selection

部署策略选择

Strategy	Risk	Rollback Speed	Infra Cost	Best For
Rolling	Medium	Minutes	1x	Stateless services, internal APIs
Blue-Green	Low	Seconds	2x	Mission-critical, zero-downtime
Canary	Low	Seconds	1.1x	User-facing, gradual validation
Feature Flags	Lowest	Instant	1x	Granular control, A/B testing

Canary promotion ladder:

Deploy at 5% traffic. Monitor error rate and latency for 10 min.
Promote to 25%. Monitor 10 min.
Promote to 50%. Monitor 15 min.
Promote to 100%.
Automated rollback if error rate exceeds baseline by 2x at any step.

策略	风险	回滚速度	基础设施成本	适用场景
滚动部署	中等	分钟级	1倍	无状态服务、内部API
蓝绿部署	低	秒级	2倍	核心业务系统、零停机需求
金丝雀部署	低	秒级	1.1倍	用户面向服务、逐步验证
功能开关	最低	即时	1倍	精细化控制、A/B测试

金丝雀部署晋升流程：

部署至5%流量，监控错误率和延迟10分钟。
晋升至25%流量，监控10分钟。
晋升至50%流量，监控15分钟。
晋升至100%流量。
若任意阶段错误率超过基线2倍，自动触发回滚。

Monitoring Essentials

监控要点

Every service dashboard includes the Four Golden Signals:

Latency -- P50, P90, P99 response times
Traffic -- Requests per second by endpoint and status code
Errors -- 5xx rate, 4xx rate, application error codes
Saturation -- CPU, memory, connection pool, queue depth

SLO targets (example):

Service	SLI	SLO	Error Budget
API Gateway	Successful requests / Total	99.9% (43.8 min/month downtime)	0.1%
API Latency	Requests < 500ms / Total	P99 < 500ms	1%

When the error budget is exhausted, the agent recommends freezing feature deployments until the budget recovers.

每个服务仪表盘需包含四大黄金指标：

延迟——P50、P90、P99响应时间
流量——按端点和状态码统计的每秒请求数
错误——5xx错误率、4xx错误率、应用错误码
饱和度——CPU、内存、连接池、队列深度

SLO目标示例：

服务	SLI	SLO	错误预算
API网关	成功请求数 / 总请求数	99.9%（每月最多停机43.8分钟）	0.1%
API延迟	响应时间<500ms的请求数 / 总请求数	P99延迟<500ms	1%

当错误预算耗尽时，Agent建议冻结功能部署，直至预算恢复。

Anti-Patterns

反模式

Monolithic state -- one Terraform state for everything. Split by component and environment.
latest
tag in production -- always use specific image tags.
Secrets in image layers -- inject at runtime via environment or mounted secrets. Verify with
```
docker history --no-trunc
```
.
No resource limits -- every container needs CPU/memory limits to prevent noisy-neighbor attacks.
Manual deployments -- automate with approval gates instead.

单体状态文件——用一个Terraform状态文件管理所有资源。应按组件和环境拆分。
生产环境使用
latest
标签——始终使用特定镜像标签。
镜像层中包含密钥——在运行时通过环境变量或挂载密钥注入。可使用
```
docker history --no-trunc
```
验证。
无资源限制——每个容器都需配置CPU/内存限制，防止"噪声邻居"攻击。
手动部署——改为带审批闸门的自动化部署。

Troubleshooting

故障排查

Problem	Cause	Solution
Terraform state lock stuck	Interrupted `terraform apply` left DynamoDB lock	`terraform force-unlock <LOCK_ID>` after confirming no apply running
Pods in `CrashLoopBackOff`	Failing health checks or missing config/secrets	`kubectl logs <pod>` , verify ConfigMaps/Secrets, increase `startupProbe.failureThreshold`
Docker builds slow (10+ min)	Layer cache invalidated by early COPY of changing files	Copy dependency manifests before source; use BuildKit cache mounts
Helm upgrade fails "another operation in progress"	Previous release in pending/failed state	`helm history <release>` , then `helm rollback <release> <last-good>`
Canary shows healthy but users report errors	Metrics aggregated across all pods mask canary errors	Use per-revision metric labels; configure Istio/Nginx to tag canary traffic

问题	原因	解决方案
Terraform状态锁卡住	中断的 `terraform apply` 操作导致DynamoDB锁未释放	确认无正在运行的apply操作后，执行 `terraform force-unlock <LOCK_ID>`
Pod处于 `CrashLoopBackOff` 状态	健康检查失败或缺少配置/密钥	执行 `kubectl logs <pod>` ，验证ConfigMaps/Secrets，增大 `startupProbe.failureThreshold`
Docker构建缓慢（超过10分钟）	提前复制变更文件导致镜像层缓存失效	先复制依赖清单再复制源码；使用BuildKit缓存挂载
Helm升级失败提示"another operation in progress"	之前的版本处于pending/failed状态	执行 `helm history <release>` ，然后执行 `helm rollback <release> <last-good>`
金丝雀部署显示健康但用户反馈错误	指标聚合掩盖了金丝雀实例的错误	使用按版本标记的指标标签；配置Istio/Nginx标记金丝雀流量

References

参考资料

Guide	Path	Content
CI/CD Pipeline Guide	`references/cicd_pipeline_guide.md`	Pipeline patterns, platform comparisons, optimization
Infrastructure as Code	`references/infrastructure_as_code.md`	Terraform patterns, module design, state management
Deployment Strategies	`references/deployment_strategies.md`	Strategy details, rollback procedures, traffic management

指南	路径	内容
CI/CD流水线指南	`references/cicd_pipeline_guide.md`	流水线模式、平台对比、优化方法
基础设施即代码	`references/infrastructure_as_code.md`	Terraform模式、模块设计、状态管理
部署策略	`references/deployment_strategies.md`	策略细节、回滚流程、流量管理

Integration Points

集成点

Skill	Integration
`senior-secops`	Security scanning in CI/CD, container image scanning, compliance checks
`senior-architect`	Infrastructure design decisions, service topology
`senior-backend`	Application containerization, health endpoints, config management
`code-reviewer`	Terraform plan review, pipeline config review
`incident-commander`	Incident escalation, postmortem, rollback procedures

Last Updated: April 2026 Version: 2.1.0

Skill	集成内容
`senior-secops`	CI/CD中的安全扫描、容器镜像扫描、合规检查
`senior-architect`	基础设施设计决策、服务拓扑
`senior-backend`	应用容器化、健康端点、配置管理
`code-reviewer`	Terraform计划评审、流水线配置评审
`incident-commander`	事件升级、事后复盘、回滚流程

最后更新： 2026年4月 版本： 2.1.0