senior-devops
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSenior DevOps Engineer
资深DevOps工程师
The agent generates CI/CD pipelines, scaffolds Terraform infrastructure, and manages deployments with strategy selection, health checks, and rollback support.
该Agent可生成CI/CD流水线、搭建Terraform基础设施,并支持通过策略选择、健康检查和回滚机制管理部署。
Quick Start
快速开始
bash
undefinedbash
undefinedGenerate CI/CD pipeline from project analysis
通过项目分析生成CI/CD流水线
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose
python scripts/pipeline_generator.py <project-path> --platform github-actions --verbose
Scaffold Terraform infrastructure
搭建Terraform基础设施
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose
python scripts/terraform_scaffolder.py <target-path> --provider aws --env production --verbose
Manage deployment with canary strategy
使用金丝雀策略管理部署
python scripts/deployment_manager.py <target-path> --strategy canary --verbose
undefinedpython scripts/deployment_manager.py <target-path> --strategy canary --verbose
undefinedTools Overview
工具概览
| Tool | Input | Output |
|---|---|---|
| Project path | CI/CD pipeline config (GitHub Actions, GitLab CI, Jenkins, CircleCI) |
| Target path + provider | Terraform module structure with state config |
| Target path + strategy | Deployment plan with health checks and rollback |
All tools support for machine-readable output and / for file writing.
--json--output-o| 工具 | 输入 | 输出 |
|---|---|---|
| 项目路径 | CI/CD流水线配置(GitHub Actions、GitLab CI、Jenkins、CircleCI) |
| 目标路径 + 云服务商 | 带状态配置的Terraform模块结构 |
| 目标路径 + 部署策略 | 包含健康检查和回滚机制的部署方案 |
所有工具均支持参数以生成机器可读输出,支持 / 参数将结果写入文件。
--json--output-oWorkflow 1: Containerize and Deploy
工作流1:容器化与部署
Step 1 -- Build a production Dockerfile.
The agent generates multi-stage Dockerfiles following this pattern:
dockerfile
undefined步骤1 -- 构建生产环境Dockerfile
Agent会按照以下模板生成多阶段Dockerfile:
dockerfile
undefinedStage 1: Build
Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
COPY . .
RUN npm run build
Stage 2: Production
Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 appgroup &&
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]
**Validation checkpoint:** Image builds with `docker build -t app:test .` and `docker run --rm app:test` returns healthy.
**Step 2 -- Deploy to Kubernetes.**
The agent creates a Deployment with probes, resource limits, and security context:
```yaml
spec:
containers:
- name: app
image: myapp:1.2.3
resources:
requests: { cpu: 250m, memory: 256Mi }
limits: { cpu: "1", memory: 512Mi }
livenessProbe:
httpGet: { path: /healthz, port: 3000 }
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet: { path: /ready, port: 3000 }
initialDelaySeconds: 5
periodSeconds: 10
startupProbe:
httpGet: { path: /healthz, port: 3000 }
failureThreshold: 30
periodSeconds: 10Probe decision:
- startupProbe: Slow-starting apps (JVM, model loading). Prevents liveness from killing during startup.
- livenessProbe: Detects deadlocks. Keep simple -- do not check downstream dependencies.
- readinessProbe: Controls traffic routing. Include dependency checks here.
Validation checkpoint: shows all pods Running and Ready.
kubectl get pods -l app=myappFROM node:20-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 appgroup &&
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]
adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1 CMD ["node", "dist/server.js"]
**验证检查点:** 使用`docker build -t app:test .`构建镜像,运行`docker run --rm app:test`后镜像状态为健康。
**步骤2 -- 部署至Kubernetes**
Agent会创建包含探针、资源限制和安全上下文的Deployment:
```yaml
spec:
containers:
- name: app
image: myapp:1.2.3
resources:
requests: { cpu: 250m, memory: 256Mi }
limits: { cpu: "1", memory: 512Mi }
livenessProbe:
httpGet: { path: /healthz, port: 3000 }
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet: { path: /ready, port: 3000 }
initialDelaySeconds: 5
periodSeconds: 10
startupProbe:
httpGet: { path: /healthz, port: 3000 }
failureThreshold: 30
periodSeconds: 10探针决策:
- startupProbe:适用于启动缓慢的应用(如JVM、模型加载场景),避免启动期间被存活探针终止。
- livenessProbe:检测死锁情况,配置需简洁——不要检查下游依赖。
- readinessProbe:控制流量路由,需在此处包含依赖检查。
验证检查点: 执行显示所有Pod处于Running且Ready状态。
kubectl get pods -l app=myappWorkflow 2: Infrastructure as Code with Terraform
工作流2:基于Terraform的基础设施即代码
Step 1 -- Scaffold the module structure.
bash
python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verboseThe agent produces:
infrastructure/
modules/
vpc/ # main.tf, variables.tf, outputs.tf
eks/
rds/
environments/
staging/ # main.tf, terraform.tfvars, backend.tf
production/Step 2 -- Configure remote state.
hcl
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}Step 3 -- Run drift detection in CI.
bash
terraform plan -detailed-exitcode -out=plan.tfplan步骤1 -- 搭建模块结构
bash
python scripts/terraform_scaffolder.py ./infrastructure --provider aws --env production --verboseAgent会生成如下结构:
infrastructure/
modules/
vpc/ # main.tf, variables.tf, outputs.tf
eks/
rds/
environments/
staging/ # main.tf, terraform.tfvars, backend.tf
production/步骤2 -- 配置远程状态
hcl
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}步骤3 -- 在CI中运行漂移检测
bash
terraform plan -detailed-exitcode -out=plan.tfplanExit 0 = clean, Exit 1 = error, Exit 2 = drift detected
退出码0=无变更,退出码1=错误,退出码2=检测到漂移
**Validation checkpoint:** `terraform plan` shows no unexpected changes. Drift alerts fire within 24 hours.
**Key rules:**
- One state file per environment per component (blast radius control)
- Never store state locally or in git
- Run `terraform plan` in CI, `terraform apply` only after approval
- Use directories for environment separation, modules for shared logic
---
**验证检查点:** `terraform plan`显示无意外变更,漂移告警在24小时内触发。
**核心规则:**
- 每个环境的每个组件对应一个状态文件(控制故障影响范围)
- 切勿在本地或Git中存储状态文件
- 在CI中运行`terraform plan`,仅在审批后执行`terraform apply`
- 使用目录区分环境,使用模块复用共享逻辑
---Workflow 3: CI/CD Pipeline Design
工作流3:CI/CD流水线设计
bash
python scripts/pipeline_generator.py /path/to/project --platform github-actions --jsonThe agent generates pipelines following these principles:
- Fail fast -- lint and unit tests before expensive integration tests
- Cache aggressively -- node_modules, Docker layers, pip packages
- Immutable artifacts -- build once, deploy the same artifact everywhere
- Gate promotions -- manual approval or smoke tests before production
- Parallel execution -- independent test suites and security scans run concurrently
Example: GitHub Actions with matrix testing and deployment gates
yaml
jobs:
test:
strategy:
matrix:
node-version: [18, 20]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: "${{ matrix.node-version }}", cache: npm }
- run: npm ci && npm run lint && npm test -- --coverage
build:
needs: [test, security]
if: github.ref == 'refs/heads/main'
steps:
- uses: docker/build-push-action@v5
with:
push: true
tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build
environment: staging
steps:
- run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait
deploy-production:
needs: deploy-staging
environment: production # requires manual approvalValidation checkpoint: Pipeline runs in under 15 minutes. All stages produce exit code 0.
bash
python scripts/pipeline_generator.py /path/to/project --platform github-actions --jsonAgent生成的流水线遵循以下原则:
- 快速失败——在执行耗时的集成测试前先进行代码检查和单元测试
- 积极缓存——缓存node_modules、Docker镜像层、pip包等
- 不可变制品——构建一次,在所有环境部署相同制品
- 部署闸门——生产环境部署前需人工审批或通过冒烟测试
- 并行执行——独立测试套件和安全扫描并行运行
示例:带矩阵测试和部署闸门的GitHub Actions
yaml
jobs:
test:
strategy:
matrix:
node-version: [18, 20]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: "${{ matrix.node-version }}", cache: npm }
- run: npm ci && npm run lint && npm test -- --coverage
build:
needs: [test, security]
if: github.ref == 'refs/heads/main'
steps:
- uses: docker/build-push-action@v5
with:
push: true
tags: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build
environment: staging
steps:
- run: helm upgrade --install app charts/myapp --set image.tag=${{ github.sha }} --wait
deploy-production:
needs: deploy-staging
environment: production # 需要人工审批验证检查点: 流水线运行时间不超过15分钟,所有阶段退出码均为0。
Deployment Strategy Selection
部署策略选择
| Strategy | Risk | Rollback Speed | Infra Cost | Best For |
|---|---|---|---|---|
| Rolling | Medium | Minutes | 1x | Stateless services, internal APIs |
| Blue-Green | Low | Seconds | 2x | Mission-critical, zero-downtime |
| Canary | Low | Seconds | 1.1x | User-facing, gradual validation |
| Feature Flags | Lowest | Instant | 1x | Granular control, A/B testing |
Canary promotion ladder:
- Deploy at 5% traffic. Monitor error rate and latency for 10 min.
- Promote to 25%. Monitor 10 min.
- Promote to 50%. Monitor 15 min.
- Promote to 100%.
- Automated rollback if error rate exceeds baseline by 2x at any step.
| 策略 | 风险 | 回滚速度 | 基础设施成本 | 适用场景 |
|---|---|---|---|---|
| 滚动部署 | 中等 | 分钟级 | 1倍 | 无状态服务、内部API |
| 蓝绿部署 | 低 | 秒级 | 2倍 | 核心业务系统、零停机需求 |
| 金丝雀部署 | 低 | 秒级 | 1.1倍 | 用户面向服务、逐步验证 |
| 功能开关 | 最低 | 即时 | 1倍 | 精细化控制、A/B测试 |
金丝雀部署晋升流程:
- 部署至5%流量,监控错误率和延迟10分钟。
- 晋升至25%流量,监控10分钟。
- 晋升至50%流量,监控15分钟。
- 晋升至100%流量。
- 若任意阶段错误率超过基线2倍,自动触发回滚。
Monitoring Essentials
监控要点
Every service dashboard includes the Four Golden Signals:
- Latency -- P50, P90, P99 response times
- Traffic -- Requests per second by endpoint and status code
- Errors -- 5xx rate, 4xx rate, application error codes
- Saturation -- CPU, memory, connection pool, queue depth
SLO targets (example):
| Service | SLI | SLO | Error Budget |
|---|---|---|---|
| API Gateway | Successful requests / Total | 99.9% (43.8 min/month downtime) | 0.1% |
| API Latency | Requests < 500ms / Total | P99 < 500ms | 1% |
When the error budget is exhausted, the agent recommends freezing feature deployments until the budget recovers.
每个服务仪表盘需包含四大黄金指标:
- 延迟——P50、P90、P99响应时间
- 流量——按端点和状态码统计的每秒请求数
- 错误——5xx错误率、4xx错误率、应用错误码
- 饱和度——CPU、内存、连接池、队列深度
SLO目标示例:
| 服务 | SLI | SLO | 错误预算 |
|---|---|---|---|
| API网关 | 成功请求数 / 总请求数 | 99.9%(每月最多停机43.8分钟) | 0.1% |
| API延迟 | 响应时间<500ms的请求数 / 总请求数 | P99延迟<500ms | 1% |
当错误预算耗尽时,Agent建议冻结功能部署,直至预算恢复。
Anti-Patterns
反模式
- Monolithic state -- one Terraform state for everything. Split by component and environment.
- tag in production -- always use specific image tags.
latest - Secrets in image layers -- inject at runtime via environment or mounted secrets. Verify with .
docker history --no-trunc - No resource limits -- every container needs CPU/memory limits to prevent noisy-neighbor attacks.
- Manual deployments -- automate with approval gates instead.
- 单体状态文件——用一个Terraform状态文件管理所有资源。应按组件和环境拆分。
- 生产环境使用标签——始终使用特定镜像标签。
latest - 镜像层中包含密钥——在运行时通过环境变量或挂载密钥注入。可使用验证。
docker history --no-trunc - 无资源限制——每个容器都需配置CPU/内存限制,防止"噪声邻居"攻击。
- 手动部署——改为带审批闸门的自动化部署。
Troubleshooting
故障排查
| Problem | Cause | Solution |
|---|---|---|
| Terraform state lock stuck | Interrupted | |
Pods in | Failing health checks or missing config/secrets | |
| Docker builds slow (10+ min) | Layer cache invalidated by early COPY of changing files | Copy dependency manifests before source; use BuildKit cache mounts |
| Helm upgrade fails "another operation in progress" | Previous release in pending/failed state | |
| Canary shows healthy but users report errors | Metrics aggregated across all pods mask canary errors | Use per-revision metric labels; configure Istio/Nginx to tag canary traffic |
| 问题 | 原因 | 解决方案 |
|---|---|---|
| Terraform状态锁卡住 | 中断的 | 确认无正在运行的apply操作后,执行 |
Pod处于 | 健康检查失败或缺少配置/密钥 | 执行 |
| Docker构建缓慢(超过10分钟) | 提前复制变更文件导致镜像层缓存失效 | 先复制依赖清单再复制源码;使用BuildKit缓存挂载 |
| Helm升级失败提示"another operation in progress" | 之前的版本处于pending/failed状态 | 执行 |
| 金丝雀部署显示健康但用户反馈错误 | 指标聚合掩盖了金丝雀实例的错误 | 使用按版本标记的指标标签;配置Istio/Nginx标记金丝雀流量 |
References
参考资料
| Guide | Path | Content |
|---|---|---|
| CI/CD Pipeline Guide | | Pipeline patterns, platform comparisons, optimization |
| Infrastructure as Code | | Terraform patterns, module design, state management |
| Deployment Strategies | | Strategy details, rollback procedures, traffic management |
See also: for Helm charts, HPA/VPA/KEDA decisions, network policies, and RBAC patterns. for AWS/GCP/Azure service comparison, multi-cloud strategy, and cost optimization.
references/kubernetes_patterns.mdreferences/cloud_platform_guide.md| 指南 | 路径 | 内容 |
|---|---|---|
| CI/CD流水线指南 | | 流水线模式、平台对比、优化方法 |
| 基础设施即代码 | | Terraform模式、模块设计、状态管理 |
| 部署策略 | | 策略细节、回滚流程、流量管理 |
另请参考:,包含Helm图表、HPA/VPA/KEDA决策、网络策略和RBAC模式。包含AWS/GCP/Azure服务对比、多云策略和成本优化内容。
references/kubernetes_patterns.mdreferences/cloud_platform_guide.mdIntegration Points
集成点
| Skill | Integration |
|---|---|
| Security scanning in CI/CD, container image scanning, compliance checks |
| Infrastructure design decisions, service topology |
| Application containerization, health endpoints, config management |
| Terraform plan review, pipeline config review |
| Incident escalation, postmortem, rollback procedures |
Last Updated: April 2026
Version: 2.1.0
| Skill | 集成内容 |
|---|---|
| CI/CD中的安全扫描、容器镜像扫描、合规检查 |
| 基础设施设计决策、服务拓扑 |
| 应用容器化、健康端点、配置管理 |
| Terraform计划评审、流水线配置评审 |
| 事件升级、事后复盘、回滚流程 |
最后更新: 2026年4月
版本: 2.1.0