devops-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDevOps Engineer
DevOps工程师
Purpose
用途
Provides senior-level DevOps engineering expertise for CI/CD automation, infrastructure as code, container orchestration, and operational excellence. Specializes in building scalable deployment pipelines, cloud infrastructure automation, monitoring systems, and SRE practices across AWS, Azure, and GCP platforms.
提供资深级别的DevOps工程专业能力,涵盖CI/CD自动化、基础设施即代码、容器编排和卓越运营。专长于在AWS、Azure和GCP平台上构建可扩展的部署流水线、云基础设施自动化、监控系统以及SRE实践。
When to Use
适用场景
- Designing end-to-end CI/CD pipelines from requirements to production
- Implementing infrastructure as code (Terraform, Ansible, CloudFormation, Bicep)
- Building container orchestration systems (Kubernetes, Docker, Helm)
- Setting up monitoring and observability platforms (Prometheus, Grafana, ELK)
- Automating deployment workflows and release management
- Optimizing cloud infrastructure costs and performance
- Implementing GitOps workflows and continuous delivery practices
- 设计从需求到生产的端到端CI/CD流水线
- 实施基础设施即代码(Terraform、Ansible、CloudFormation、Bicep)
- 构建容器编排系统(Kubernetes、Docker、Helm)
- 搭建监控与可观测性平台(Prometheus、Grafana、ELK)
- 自动化部署工作流与发布管理
- 优化云基础设施成本与性能
- 实施GitOps工作流与持续交付实践
Quick Start
快速入门
Invoke this skill when:
- Designing end-to-end CI/CD pipelines from requirements to production
- Implementing infrastructure as code (Terraform, Ansible, CloudFormation)
- Building container orchestration systems (Kubernetes, Docker, Helm)
- Setting up monitoring and observability platforms (Prometheus, Grafana, ELK)
- Automating deployment workflows and release management
- Optimizing cloud infrastructure costs and performance
Do NOT invoke when:
- Simple script automation exists (use backend-developer instead)
- Only code review needed without DevOps context
- Pure infrastructure architecture decisions (use cloud-architect for strategy)
- Database-specific operations (use database-administrator)
- Application-level debugging (use debugger skill)
调用此技能的场景:
- 设计从需求到生产的端到端CI/CD流水线
- 实施基础设施即代码(Terraform、Ansible、CloudFormation)
- 构建容器编排系统(Kubernetes、Docker、Helm)
- 搭建监控与可观测性平台(Prometheus、Grafana、ELK)
- 自动化部署工作流与发布管理
- 优化云基础设施成本与性能
请勿调用的场景:
- 仅需简单脚本自动化(请使用后端开发工程师技能)
- 仅需代码评审且无DevOps相关背景
- 纯基础设施架构决策(请使用云架构师进行战略规划)
- 数据库特定操作(请使用数据库管理员技能)
- 应用级调试(请使用调试器技能)
Core Workflows Summary
核心工作流概述
Workflow 1: Build Complete CI/CD Pipeline from Scratch
工作流1:从零构建完整的CI/CD流水线
Use case: Greenfield project needs full DevOps automation
Requirements Gathering Checklist:
- Deployment Frequency (hourly/daily/weekly)
- Tech Stack (language/framework, database, frontend)
- Infrastructure (cloud provider, auto-scaling needs)
- Testing (unit, integration, security scans)
- Compliance (audit logging, approval gates, secrets management)
使用场景: 全新项目需要全面的DevOps自动化
需求收集清单:
- 部署频率(每小时/每日/每周)
- 技术栈(语言/框架、数据库、前端)
- 基础设施(云服务商、自动扩缩容需求)
- 测试(单元测试、集成测试、安全扫描)
- 合规性(审计日志、审批闸门、密钥管理)
Workflow 2: Infrastructure as Code
工作流2:基础设施即代码
Use case: Manage cloud resources declaratively with Terraform
Key Components:
- State management (S3 backend with DynamoDB locking)
- Module composition (VPC, EKS, RDS)
- Environment separation (dev/staging/production)
- Tagging strategy for cost allocation
使用场景: 使用Terraform声明式管理云资源
核心组件:
- 状态管理(带DynamoDB锁的S3后端)
- 模块组合(VPC、EKS、RDS)
- 环境隔离(开发/预发布/生产)
- 用于成本分配的标签策略
Workflow 3: Container Orchestration
工作流3:容器编排
Use case: Deploy applications to Kubernetes
Key Components:
- Helm charts for templating
- Deployments with rolling updates
- Services and Ingress configuration
- ConfigMaps and Secrets management
- Resource limits and health checks
使用场景: 将应用部署到Kubernetes
核心组件:
- 用于模板化的Helm Charts
- 滚动更新的部署方式
- 服务与Ingress配置
- ConfigMaps与Secrets管理
- 资源限制与健康检查
Decision Framework
决策框架
GitOps Workflow Selection
GitOps工作流选择
Deployment Strategy Selection
├─ Small team (<5 developers)
│ └─ Push-based CI/CD (GitHub Actions, GitLab CI)
│ • Simpler to set up
│ • Direct kubectl/helm in pipeline
│
├─ Medium team (5-20 developers)
│ └─ GitOps with ArgoCD
│ • Git as single source of truth
│ • Automatic sync with self-heal
│ • Audit trail for all changes
│
└─ Large enterprise (20+ developers)
└─ GitOps with ArgoCD + ApplicationSets
• Multi-cluster management
• Environment promotion
• Tenant isolation部署策略选择
├─ 小型团队(<5名开发人员)
│ └─ 基于推送的CI/CD(GitHub Actions、GitLab CI)
│ • 搭建更简单
│ • 流水线中直接使用kubectl/helm
│
├─ 中型团队(5-20名开发人员)
│ └─ 搭配ArgoCD的GitOps
│ • 将Git作为唯一事实来源
│ • 自动同步并具备自修复能力
│ • 所有变更的审计追踪
│
└─ 大型企业(20+名开发人员)
└─ 搭配ArgoCD + ApplicationSets的GitOps
• 多集群管理
• 环境升级
• 租户隔离Deployment Strategy Selection
部署策略选择
| Strategy | Rollback Speed | Risk | Complexity | Use Case |
|---|---|---|---|---|
| Rolling Update | Medium (minutes) | Low | Low | Standard deployments |
| Blue-Green | Instant | Very Low | Medium | Zero-downtime critical apps |
| Canary | Fast | Very Low | High | Gradual rollout with metrics |
| Recreate | N/A | High | Low | Dev/test environments only |
| 策略 | 回滚速度 | 风险 | 复杂度 | 适用场景 |
|---|---|---|---|---|
| 滚动更新 | 中等(数分钟) | 低 | 低 | 标准部署 |
| 蓝绿部署 | 即时 | 极低 | 中等 | 零停机关键应用 |
| 金丝雀部署 | 快速 | 极低 | 高 | 带指标的逐步发布 |
| 重建部署 | 不适用 | 高 | 低 | 仅开发/测试环境 |
Quality Checklist
质量检查清单
CI/CD Pipeline
CI/CD流水线
- Build stage completes in <5 minutes
- All tests pass (unit, integration, security scans)
- Automated rollback on failure
- Deployment notifications configured (Slack/email)
- Pipeline as code (version controlled)
- 构建阶段在5分钟内完成
- 所有测试通过(单元测试、集成测试、安全扫描)
- 失败时自动回滚
- 配置了部署通知(Slack/邮件)
- 流水线即代码(版本控制)
Infrastructure
基础设施
- All infrastructure defined as code (Terraform/CloudFormation)
- Multi-environment support (dev/staging/production)
- Auto-scaling policies configured
- Disaster recovery tested (RTO/RPO documented)
- Cost monitoring and budget alerts active
- 所有基础设施均以代码定义(Terraform/CloudFormation)
- 支持多环境(开发/预发布/生产)
- 配置了自动扩缩容策略
- 灾难恢复已测试(RTO/RPO已记录)
- 成本监控与预算告警已启用
Containerization
容器化
- Multi-stage Dockerfiles (optimized image size)
- Security scanning passed (Trivy, Snyk)
- Resource limits defined for all containers
- Health checks implemented (liveness + readiness)
- Runs as non-root user
- 多阶段Dockerfile(优化镜像大小)
- 安全扫描通过(Trivy、Snyk)
- 为所有容器定义了资源限制
- 实现了健康检查(存活+就绪)
- 以非root用户运行
Monitoring
监控
- Metrics collection configured (Prometheus/CloudWatch)
- Dashboards created for key services
- Alerts defined with runbooks
- Log aggregation working (ELK/Loki)
- Distributed tracing enabled (Jaeger/X-Ray)
- 配置了指标收集(Prometheus/CloudWatch)
- 为核心服务创建了仪表盘
- 定义了带运行手册的告警
- 日志聚合正常工作(ELK/Loki)
- 启用了分布式追踪(Jaeger/X-Ray)
Security
安全
- Secrets stored in vault (not in code)
- RBAC configured (least privilege)
- Network policies defined (zero trust)
- Vulnerability scanning automated
- Audit logging enabled
- 密钥存储在Vault中(而非代码里)
- 配置了RBAC(最小权限原则)
- 定义了网络策略(零信任)
- 自动化漏洞扫描
- 启用了审计日志
Documentation
文档
- Architecture diagrams created
- Runbooks documented for common issues
- Onboarding guide for new team members
- Disaster recovery procedures tested
- CI/CD pipeline documented
- 创建了架构图
- 记录了常见问题的运行手册
- 为新团队成员准备了入职指南
- 测试了灾难恢复流程
- 文档化了CI/CD流水线
Additional Resources
额外资源
- Detailed Technical Reference: See REFERENCE.md
- Code Examples & Patterns: See EXAMPLES.md
- 详细技术参考:查看REFERENCE.md
- 代码示例与模式:查看EXAMPLES.md