senior-devops

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Senior DevOps Engineer

资深DevOps工程师

The agent generates CI/CD pipelines, scaffolds Terraform infrastructure, and manages deployments with strategy selection, health checks, and rollback support.

该Agent可生成CI/CD流水线、搭建Terraform基础设施,并支持通过策略选择、健康检查和回滚机制管理部署。

Quick Start

快速开始

bash
undefined
bash
undefined

Generate CI/CD pipeline from project analysis

通过项目分析生成CI/CD流水线

python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose

Scaffold Terraform infrastructure

搭建Terraform基础设施

python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose

Manage deployment with canary strategy

使用金丝雀策略管理部署

python scripts/deployment_manager.py <target-path> --strategy canary --verbose
undefined
python scripts/deployment_manager.py <target-path> --strategy canary --verbose
undefined

Tools Overview

工具概览

ToolInputOutput
pipeline_generator.py
Project pathCI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins, CircleCI)
terraform_scaffolder.py
Target path + providerTerraform module structure with state config
deployment_manager.py
Target path + strategyDeployment plan with health checks and rollback
All tools support
--json
for machine-readable output and
--output
/
-o
for file writing.

工具输入输出
pipeline_generator.py
项目路径CI/CD流水线配置(GitHub Actions、GitLab CI、Jenkins、CircleCI)
terraform_scaffolder.py
目标路径 + 云服务商带状态配置的Terraform模块结构
deployment_manager.py
目标路径 + 部署策略包含健康检查和回滚机制的部署方案
所有工具均支持
--json
参数以生成机器可读输出,支持
--output
/
-o
参数将结果写入文件。

Workflow 1: Containerize and Deploy

工作流1:容器化与部署

Step 1 -- Build a production Dockerfile.
The agent generates multi-stage Dockerfiles following this pattern:
dockerfile
undefined
步骤1 -- 构建生产环境Dockerfile
Agent会按照以下模板生成多阶段Dockerfile:
dockerfile
undefined

Stage 1: Build

Stage 1: Build

FROM node:20-alpine AS builder WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci --only=production && npm cache clean --force COPY . . RUN npm run build
FROM node:20-alpine AS builder WORKDIR /app COPY package.json package-lock.json ./ RUN npm ci --only=production && npm cache clean --force COPY . . RUN npm run build

Stage 2: Production

Stage 2: Production

FROM node:20-alpine AS production WORKDIR /app RUN addgroup -g 1001 appgroup &&
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]

**Validation checkpoint:** Image builds with `docker build -t app:test .` and `docker run --rm app:test` returns healthy.

**Step 2 -- Deploy to Kubernetes.**

The agent creates a Deployment with probes, resource limits, and security context:

```yaml
spec:
  containers:
    - name: app
      image: myapp:1.2.3
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: "1", memory: 512Mi }
      livenessProbe:
        httpGet: { path: /healthz, port: 3000 }
        initialDelaySeconds: 15
        periodSeconds: 20
      readinessProbe:
        httpGet: { path: /ready, port: 3000 }
        initialDelaySeconds: 5
        periodSeconds: 10
      startupProbe:
        httpGet: { path: /healthz, port: 3000 }
        failureThreshold: 30
        periodSeconds: 10
Probe decision:
  • startupProbe: Slow-starting apps (JVM, model loading). Prevents liveness from killing during startup.
  • livenessProbe: Detects deadlocks. Keep simple -- do not check downstream dependencies.
  • readinessProbe: Controls traffic routing. Include dependency checks here.
Validation checkpoint:
kubectl get pods -l app=myapp
shows all pods Running and Ready.

FROM node:20-alpine AS production WORKDIR /app RUN addgroup -g 1001 appgroup &&
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]

**验证检查点:** 使用`docker build -t app:test .`构建镜像,运行`docker run --rm app:test`后镜像状态为健康。

**步骤2 -- 部署至Kubernetes**

Agent会创建包含探针、资源限制和安全上下文的Deployment:

```yaml
spec:
  containers:
    - name: app
      image: myapp:1.2.3
      resources:
        requests: { cpu: 250m, memory: 256Mi }
        limits: { cpu: "1", memory: 512Mi }
      livenessProbe:
        httpGet: { path: /healthz, port: 3000 }
        initialDelaySeconds: 15
        periodSeconds: 20
      readinessProbe:
        httpGet: { path: /ready, port: 3000 }
        initialDelaySeconds: 5
        periodSeconds: 10
      startupProbe:
        httpGet: { path: /healthz, port: 3000 }
        failureThreshold: 30
        periodSeconds: 10
探针决策:
  • startupProbe:适用于启动缓慢的应用(如JVM、模型加载场景),避免启动期间被存活探针终止。
  • livenessProbe:检测死锁情况,配置需简洁——不要检查下游依赖。
  • readinessProbe:控制流量路由,需在此处包含依赖检查。
验证检查点: 执行
kubectl get pods -l app=myapp
显示所有Pod处于Running且Ready状态。

Workflow 2: Infrastructure as Code with Terraform

工作流2:基于Terraform的基础设施即代码

Step 1 -- Scaffold the module structure.
bash
python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose
The agent produces:
infrastructure/
  modules/
    vpc/         # main.tf, variables.tf, outputs.tf
    eks/
    rds/
  environments/
    staging/     # main.tf, terraform.tfvars, backend.tf
    production/
Step 2 -- Configure remote state.
hcl
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
Step 3 -- Run drift detection in CI.
bash
terraform plan -detailed-exitcode -out=plan.tfplan
步骤1 -- 搭建模块结构
bash
python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verbose
Agent会生成如下结构:
infrastructure/
  modules/
    vpc/         # main.tf, variables.tf, outputs.tf
    eks/
    rds/
  environments/
    staging/     # main.tf, terraform.tfvars, backend.tf
    production/
步骤2 -- 配置远程状态
hcl
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
步骤3 -- 在CI中运行漂移检测
bash
terraform plan -detailed-exitcode -out=plan.tfplan

Exit 0 = clean, Exit 1 = error, Exit 2 = drift detected

退出码0=无变更,退出码1=错误,退出码2=检测到漂移


**Validation checkpoint:** `terraform plan` shows no unexpected changes. Drift alerts fire within 24 hours.

**Key rules:**
- One state file per environment per component (blast radius control)
- Never store state locally or in git
- Run `terraform plan` in CI, `terraform apply` only after approval
- Use directories for environment separation, modules for shared logic

---

**验证检查点:** `terraform plan`显示无意外变更,漂移告警在24小时内触发。

**核心规则:**
- 每个环境的每个组件对应一个状态文件(控制故障影响范围)
- 切勿在本地或Git中存储状态文件
- 在CI中运行`terraform plan`,仅在审批后执行`terraform apply`
- 使用目录区分环境,使用模块复用共享逻辑

---

Workflow 3: CI/CD Pipeline Design

工作流3:CI/CD流水线设计

bash
python scripts/pipeline_generator.py /path/to/project --platform github-actions --json
The agent generates pipelines following these principles:
  1. Fail fast -- lint and unit tests before expensive integration tests
  2. Cache aggressively -- node_modules, Docker layers, pip packages
  3. Immutable artifacts -- build once, deploy the same artifact everywhere
  4. Gate promotions -- manual approval or smoke tests before production
  5. Parallel execution -- independent test suites and security scans run concurrently
Example: GitHub Actions with matrix testing and deployment gates
yaml
jobs:
  test:
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "${{ matrix.node-version }}", cache: npm }
      - run: npm ci && npm run lint && npm test -- --coverage

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait

  deploy-production:
    needs: deploy-staging
    environment: production  # requires manual approval
Validation checkpoint: Pipeline runs in under 15 minutes. All stages produce exit code 0.

bash
python scripts/pipeline_generator.py /path/to/project --platform github-actions --json
Agent生成的流水线遵循以下原则:
  1. 快速失败——在执行耗时的集成测试前先进行代码检查和单元测试
  2. 积极缓存——缓存node_modules、Docker镜像层、pip包等
  3. 不可变制品——构建一次,在所有环境部署相同制品
  4. 部署闸门——生产环境部署前需人工审批或通过冒烟测试
  5. 并行执行——独立测试套件和安全扫描并行运行
示例:带矩阵测试和部署闸门的GitHub Actions
yaml
jobs:
  test:
    strategy:
      matrix:
        node-version: [18, 20]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: "${{ matrix.node-version }}", cache: npm }
      - run: npm ci && npm run lint && npm test -- --coverage

  build:
    needs: [test, security]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    environment: staging
    steps:
      - run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait

  deploy-production:
    needs: deploy-staging
    environment: production  # 需要人工审批
验证检查点: 流水线运行时间不超过15分钟,所有阶段退出码均为0。

Deployment Strategy Selection

部署策略选择

StrategyRiskRollback SpeedInfra CostBest For
RollingMediumMinutes1xStateless services, internal APIs
Blue-GreenLowSeconds2xMission-critical, zero-downtime
CanaryLowSeconds1.1xUser-facing, gradual validation
Feature FlagsLowestInstant1xGranular control, A/B testing
Canary promotion ladder:
  1. Deploy at 5% traffic. Monitor error rate and latency for 10 min.
  2. Promote to 25%. Monitor 10 min.
  3. Promote to 50%. Monitor 15 min.
  4. Promote to 100%.
  5. Automated rollback if error rate exceeds baseline by 2x at any step.

策略风险回滚速度基础设施成本适用场景
滚动部署中等分钟级1倍无状态服务、内部API
蓝绿部署秒级2倍核心业务系统、零停机需求
金丝雀部署秒级1.1倍用户面向服务、逐步验证
功能开关最低即时1倍精细化控制、A/B测试
金丝雀部署晋升流程:
  1. 部署至5%流量,监控错误率和延迟10分钟。
  2. 晋升至25%流量,监控10分钟。
  3. 晋升至50%流量,监控15分钟。
  4. 晋升至100%流量。
  5. 若任意阶段错误率超过基线2倍,自动触发回滚。

Monitoring Essentials

监控要点

Every service dashboard includes the Four Golden Signals:
  1. Latency -- P50, P90, P99 response times
  2. Traffic -- Requests per second by endpoint and status code
  3. Errors -- 5xx rate, 4xx rate, application error codes
  4. Saturation -- CPU, memory, connection pool, queue depth
SLO targets (example):
ServiceSLISLOError Budget
API GatewaySuccessful requests / Total99.9% (43.8 min/month downtime)0.1%
API LatencyRequests < 500ms / TotalP99 < 500ms1%
When the error budget is exhausted, the agent recommends freezing feature deployments until the budget recovers.

每个服务仪表盘需包含四大黄金指标
  1. 延迟——P50、P90、P99响应时间
  2. 流量——按端点和状态码统计的每秒请求数
  3. 错误——5xx错误率、4xx错误率、应用错误码
  4. 饱和度——CPU、内存、连接池、队列深度
SLO目标示例:
服务SLISLO错误预算
API网关成功请求数 / 总请求数99.9%(每月最多停机43.8分钟)0.1%
API延迟响应时间<500ms的请求数 / 总请求数P99延迟<500ms1%
当错误预算耗尽时,Agent建议冻结功能部署,直至预算恢复。

Anti-Patterns

反模式

  1. Monolithic state -- one Terraform state for everything. Split by component and environment.
  2. latest
    tag in production
    -- always use specific image tags.
  3. Secrets in image layers -- inject at runtime via environment or mounted secrets. Verify with
    docker history --no-trunc
    .
  4. No resource limits -- every container needs CPU/memory limits to prevent noisy-neighbor attacks.
  5. Manual deployments -- automate with approval gates instead.

  1. 单体状态文件——用一个Terraform状态文件管理所有资源。应按组件和环境拆分。
  2. 生产环境使用
    latest
    标签
    ——始终使用特定镜像标签。
  3. 镜像层中包含密钥——在运行时通过环境变量或挂载密钥注入。可使用
    docker history --no-trunc
    验证。
  4. 无资源限制——每个容器都需配置CPU/内存限制,防止"噪声邻居"攻击。
  5. 手动部署——改为带审批闸门的自动化部署。

Troubleshooting

故障排查

ProblemCauseSolution
Terraform state lock stuckInterrupted
terraform apply
left DynamoDB lock
terraform force-unlock <LOCK_ID>
after confirming no apply running
Pods in
CrashLoopBackOff
Failing health checks or missing config/secrets
kubectl logs <pod>
, verify ConfigMaps/Secrets, increase
startupProbe.failureThreshold
Docker builds slow (10+ min)Layer cache invalidated by early COPY of changing filesCopy dependency manifests before source; use BuildKit cache mounts
Helm upgrade fails "another operation in progress"Previous release in pending/failed state
helm history <release>
, then
helm rollback <release> <last-good>
Canary shows healthy but users report errorsMetrics aggregated across all pods mask canary errorsUse per-revision metric labels; configure Istio/Nginx to tag canary traffic

问题原因解决方案
Terraform状态锁卡住中断的
terraform apply
操作导致DynamoDB锁未释放
确认无正在运行的apply操作后,执行
terraform force-unlock <LOCK_ID>
Pod处于
CrashLoopBackOff
状态
健康检查失败或缺少配置/密钥执行
kubectl logs <pod>
,验证ConfigMaps/Secrets,增大
startupProbe.failureThreshold
Docker构建缓慢(超过10分钟)提前复制变更文件导致镜像层缓存失效先复制依赖清单再复制源码;使用BuildKit缓存挂载
Helm升级失败提示"another operation in progress"之前的版本处于pending/failed状态执行
helm history <release>
,然后执行
helm rollback <release> <last-good>
金丝雀部署显示健康但用户反馈错误指标聚合掩盖了金丝雀实例的错误使用按版本标记的指标标签;配置Istio/Nginx标记金丝雀流量

References

参考资料

GuidePathContent
CI/CD Pipeline Guide
references/cicd_pipeline_guide.md
Pipeline patterns, platform comparisons, optimization
Infrastructure as Code
references/infrastructure_as_code.md
Terraform patterns, module design, state management
Deployment Strategies
references/deployment_strategies.md
Strategy details, rollback procedures, traffic management
See also:
references/kubernetes_patterns.md
for Helm charts, HPA/VPA/KEDA decisions, network policies, and RBAC patterns.
references/cloud_platform_guide.md
for AWS/GCP/Azure service comparison, multi-cloud strategy, and cost optimization.

指南路径内容
CI/CD流水线指南
references/cicd_pipeline_guide.md
流水线模式、平台对比、优化方法
基础设施即代码
references/infrastructure_as_code.md
Terraform模式、模块设计、状态管理
部署策略
references/deployment_strategies.md
策略细节、回滚流程、流量管理
另请参考:
references/kubernetes_patterns.md
,包含Helm图表、HPA/VPA/KEDA决策、网络策略和RBAC模式。
references/cloud_platform_guide.md
包含AWS/GCP/Azure服务对比、多云策略和成本优化内容。

Integration Points

集成点

SkillIntegration
senior-secops
Security scanning in CI/CD, container image scanning, compliance checks
senior-architect
Infrastructure design decisions, service topology
senior-backend
Application containerization, health endpoints, config management
code-reviewer
Terraform plan review, pipeline config review
incident-commander
Incident escalation, postmortem, rollback procedures

Last Updated: April 2026 Version: 2.1.0
Skill集成内容
senior-secops
CI/CD中的安全扫描、容器镜像扫描、合规检查
senior-architect
基础设施设计决策、服务拓扑
senior-backend
应用容器化、健康端点、配置管理
code-reviewer
Terraform计划评审、流水线配置评审
incident-commander
事件升级、事后复盘、回滚流程

最后更新: 2026年4月 版本: 2.1.0