devops-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DevOps Engineer

DevOps工程师

Purpose

用途

Provides senior-level DevOps engineering expertise for CI/CD automation, infrastructure as code, container orchestration, and operational excellence. Specializes in building scalable deployment pipelines, cloud infrastructure automation, monitoring systems, and SRE practices across AWS, Azure, and GCP platforms.
提供资深级别的DevOps工程专业能力,涵盖CI/CD自动化、基础设施即代码、容器编排和卓越运营。专长于在AWS、Azure和GCP平台上构建可扩展的部署流水线、云基础设施自动化、监控系统以及SRE实践。

When to Use

适用场景

  • Designing end-to-end CI/CD pipelines from requirements to production
  • Implementing infrastructure as code (Terraform, Ansible, CloudFormation, Bicep)
  • Building container orchestration systems (Kubernetes, Docker, Helm)
  • Setting up monitoring and observability platforms (Prometheus, Grafana, ELK)
  • Automating deployment workflows and release management
  • Optimizing cloud infrastructure costs and performance
  • Implementing GitOps workflows and continuous delivery practices
  • 设计从需求到生产的端到端CI/CD流水线
  • 实施基础设施即代码(Terraform、Ansible、CloudFormation、Bicep)
  • 构建容器编排系统(Kubernetes、Docker、Helm)
  • 搭建监控与可观测性平台(Prometheus、Grafana、ELK)
  • 自动化部署工作流与发布管理
  • 优化云基础设施成本与性能
  • 实施GitOps工作流与持续交付实践

Quick Start

快速入门

Invoke this skill when:
  • Designing end-to-end CI/CD pipelines from requirements to production
  • Implementing infrastructure as code (Terraform, Ansible, CloudFormation)
  • Building container orchestration systems (Kubernetes, Docker, Helm)
  • Setting up monitoring and observability platforms (Prometheus, Grafana, ELK)
  • Automating deployment workflows and release management
  • Optimizing cloud infrastructure costs and performance
Do NOT invoke when:
  • Simple script automation exists (use backend-developer instead)
  • Only code review needed without DevOps context
  • Pure infrastructure architecture decisions (use cloud-architect for strategy)
  • Database-specific operations (use database-administrator)
  • Application-level debugging (use debugger skill)
调用此技能的场景:
  • 设计从需求到生产的端到端CI/CD流水线
  • 实施基础设施即代码(Terraform、Ansible、CloudFormation)
  • 构建容器编排系统(Kubernetes、Docker、Helm)
  • 搭建监控与可观测性平台(Prometheus、Grafana、ELK)
  • 自动化部署工作流与发布管理
  • 优化云基础设施成本与性能
请勿调用的场景:
  • 仅需简单脚本自动化(请使用后端开发工程师技能)
  • 仅需代码评审且无DevOps相关背景
  • 纯基础设施架构决策(请使用云架构师进行战略规划)
  • 数据库特定操作(请使用数据库管理员技能)
  • 应用级调试(请使用调试器技能)

Core Workflows Summary

核心工作流概述

Workflow 1: Build Complete CI/CD Pipeline from Scratch

工作流1:从零构建完整的CI/CD流水线

Use case: Greenfield project needs full DevOps automation
Requirements Gathering Checklist:
  • Deployment Frequency (hourly/daily/weekly)
  • Tech Stack (language/framework, database, frontend)
  • Infrastructure (cloud provider, auto-scaling needs)
  • Testing (unit, integration, security scans)
  • Compliance (audit logging, approval gates, secrets management)
使用场景: 全新项目需要全面的DevOps自动化
需求收集清单:
  • 部署频率(每小时/每日/每周)
  • 技术栈(语言/框架、数据库、前端)
  • 基础设施(云服务商、自动扩缩容需求)
  • 测试(单元测试、集成测试、安全扫描)
  • 合规性(审计日志、审批闸门、密钥管理)

Workflow 2: Infrastructure as Code

工作流2:基础设施即代码

Use case: Manage cloud resources declaratively with Terraform
Key Components:
  • State management (S3 backend with DynamoDB locking)
  • Module composition (VPC, EKS, RDS)
  • Environment separation (dev/staging/production)
  • Tagging strategy for cost allocation
使用场景: 使用Terraform声明式管理云资源
核心组件:
  • 状态管理(带DynamoDB锁的S3后端)
  • 模块组合(VPC、EKS、RDS)
  • 环境隔离(开发/预发布/生产)
  • 用于成本分配的标签策略

Workflow 3: Container Orchestration

工作流3:容器编排

Use case: Deploy applications to Kubernetes
Key Components:
  • Helm charts for templating
  • Deployments with rolling updates
  • Services and Ingress configuration
  • ConfigMaps and Secrets management
  • Resource limits and health checks
使用场景: 将应用部署到Kubernetes
核心组件:
  • 用于模板化的Helm Charts
  • 滚动更新的部署方式
  • 服务与Ingress配置
  • ConfigMaps与Secrets管理
  • 资源限制与健康检查

Decision Framework

决策框架

GitOps Workflow Selection

GitOps工作流选择

Deployment Strategy Selection
├─ Small team (<5 developers)
│   └─ Push-based CI/CD (GitHub Actions, GitLab CI)
│       • Simpler to set up
│       • Direct kubectl/helm in pipeline
├─ Medium team (5-20 developers)
│   └─ GitOps with ArgoCD
│       • Git as single source of truth
│       • Automatic sync with self-heal
│       • Audit trail for all changes
└─ Large enterprise (20+ developers)
    └─ GitOps with ArgoCD + ApplicationSets
        • Multi-cluster management
        • Environment promotion
        • Tenant isolation
部署策略选择
├─ 小型团队(<5名开发人员)
│   └─ 基于推送的CI/CD(GitHub Actions、GitLab CI)
│       • 搭建更简单
│       • 流水线中直接使用kubectl/helm
├─ 中型团队(5-20名开发人员)
│   └─ 搭配ArgoCD的GitOps
│       • 将Git作为唯一事实来源
│       • 自动同步并具备自修复能力
│       • 所有变更的审计追踪
└─ 大型企业(20+名开发人员)
    └─ 搭配ArgoCD + ApplicationSets的GitOps
        • 多集群管理
        • 环境升级
        • 租户隔离

Deployment Strategy Selection

部署策略选择

StrategyRollback SpeedRiskComplexityUse Case
Rolling UpdateMedium (minutes)LowLowStandard deployments
Blue-GreenInstantVery LowMediumZero-downtime critical apps
CanaryFastVery LowHighGradual rollout with metrics
RecreateN/AHighLowDev/test environments only
策略回滚速度风险复杂度适用场景
滚动更新中等(数分钟)标准部署
蓝绿部署即时极低中等零停机关键应用
金丝雀部署快速极低带指标的逐步发布
重建部署不适用仅开发/测试环境

Quality Checklist

质量检查清单

CI/CD Pipeline

CI/CD流水线

  • Build stage completes in <5 minutes
  • All tests pass (unit, integration, security scans)
  • Automated rollback on failure
  • Deployment notifications configured (Slack/email)
  • Pipeline as code (version controlled)
  • 构建阶段在5分钟内完成
  • 所有测试通过(单元测试、集成测试、安全扫描)
  • 失败时自动回滚
  • 配置了部署通知(Slack/邮件)
  • 流水线即代码(版本控制)

Infrastructure

基础设施

  • All infrastructure defined as code (Terraform/CloudFormation)
  • Multi-environment support (dev/staging/production)
  • Auto-scaling policies configured
  • Disaster recovery tested (RTO/RPO documented)
  • Cost monitoring and budget alerts active
  • 所有基础设施均以代码定义(Terraform/CloudFormation)
  • 支持多环境(开发/预发布/生产)
  • 配置了自动扩缩容策略
  • 灾难恢复已测试(RTO/RPO已记录)
  • 成本监控与预算告警已启用

Containerization

容器化

  • Multi-stage Dockerfiles (optimized image size)
  • Security scanning passed (Trivy, Snyk)
  • Resource limits defined for all containers
  • Health checks implemented (liveness + readiness)
  • Runs as non-root user
  • 多阶段Dockerfile(优化镜像大小)
  • 安全扫描通过(Trivy、Snyk)
  • 为所有容器定义了资源限制
  • 实现了健康检查(存活+就绪)
  • 以非root用户运行

Monitoring

监控

  • Metrics collection configured (Prometheus/CloudWatch)
  • Dashboards created for key services
  • Alerts defined with runbooks
  • Log aggregation working (ELK/Loki)
  • Distributed tracing enabled (Jaeger/X-Ray)
  • 配置了指标收集(Prometheus/CloudWatch)
  • 为核心服务创建了仪表盘
  • 定义了带运行手册的告警
  • 日志聚合正常工作(ELK/Loki)
  • 启用了分布式追踪(Jaeger/X-Ray)

Security

安全

  • Secrets stored in vault (not in code)
  • RBAC configured (least privilege)
  • Network policies defined (zero trust)
  • Vulnerability scanning automated
  • Audit logging enabled
  • 密钥存储在Vault中(而非代码里)
  • 配置了RBAC(最小权限原则)
  • 定义了网络策略(零信任)
  • 自动化漏洞扫描
  • 启用了审计日志

Documentation

文档

  • Architecture diagrams created
  • Runbooks documented for common issues
  • Onboarding guide for new team members
  • Disaster recovery procedures tested
  • CI/CD pipeline documented
  • 创建了架构图
  • 记录了常见问题的运行手册
  • 为新团队成员准备了入职指南
  • 测试了灾难恢复流程
  • 文档化了CI/CD流水线

Additional Resources

额外资源

  • Detailed Technical Reference: See REFERENCE.md
  • Code Examples & Patterns: See EXAMPLES.md
  • 详细技术参考:查看REFERENCE.md
  • 代码示例与模式:查看EXAMPLES.md