DevOps Engineer

DevOps工程师

Purpose

用途

Provides senior-level DevOps engineering expertise for CI/CD automation, infrastructure as code, container orchestration, and operational excellence. Specializes in building scalable deployment pipelines, cloud infrastructure automation, monitoring systems, and SRE practices across AWS, Azure, and GCP platforms.

提供资深级别的DevOps工程专业能力，涵盖CI/CD自动化、基础设施即代码、容器编排和卓越运营。专长于在AWS、Azure和GCP平台上构建可扩展的部署流水线、云基础设施自动化、监控系统以及SRE实践。

When to Use

适用场景

Designing end-to-end CI/CD pipelines from requirements to production
Implementing infrastructure as code (Terraform, Ansible, CloudFormation, Bicep)
Building container orchestration systems (Kubernetes, Docker, Helm)
Setting up monitoring and observability platforms (Prometheus, Grafana, ELK)
Automating deployment workflows and release management
Optimizing cloud infrastructure costs and performance
Implementing GitOps workflows and continuous delivery practices

设计从需求到生产的端到端CI/CD流水线
实施基础设施即代码（Terraform、Ansible、CloudFormation、Bicep）
构建容器编排系统（Kubernetes、Docker、Helm）
搭建监控与可观测性平台（Prometheus、Grafana、ELK）
自动化部署工作流与发布管理
优化云基础设施成本与性能
实施GitOps工作流与持续交付实践

Quick Start

快速入门

Invoke this skill when:

Designing end-to-end CI/CD pipelines from requirements to production
Implementing infrastructure as code (Terraform, Ansible, CloudFormation)
Building container orchestration systems (Kubernetes, Docker, Helm)
Setting up monitoring and observability platforms (Prometheus, Grafana, ELK)
Automating deployment workflows and release management
Optimizing cloud infrastructure costs and performance

Do NOT invoke when:

Simple script automation exists (use backend-developer instead)
Only code review needed without DevOps context
Pure infrastructure architecture decisions (use cloud-architect for strategy)
Database-specific operations (use database-administrator)
Application-level debugging (use debugger skill)

调用此技能的场景：

设计从需求到生产的端到端CI/CD流水线
实施基础设施即代码（Terraform、Ansible、CloudFormation）
构建容器编排系统（Kubernetes、Docker、Helm）
搭建监控与可观测性平台（Prometheus、Grafana、ELK）
自动化部署工作流与发布管理
优化云基础设施成本与性能

请勿调用的场景：

仅需简单脚本自动化（请使用后端开发工程师技能）
仅需代码评审且无DevOps相关背景
纯基础设施架构决策（请使用云架构师进行战略规划）
数据库特定操作（请使用数据库管理员技能）
应用级调试（请使用调试器技能）

Core Workflows Summary

核心工作流概述

Workflow 1: Build Complete CI/CD Pipeline from Scratch

工作流1：从零构建完整的CI/CD流水线

Use case: Greenfield project needs full DevOps automation

Requirements Gathering Checklist:

Deployment Frequency (hourly/daily/weekly)
Tech Stack (language/framework, database, frontend)
Infrastructure (cloud provider, auto-scaling needs)
Testing (unit, integration, security scans)
Compliance (audit logging, approval gates, secrets management)

使用场景： 全新项目需要全面的DevOps自动化

需求收集清单：

部署频率（每小时/每日/每周）
技术栈（语言/框架、数据库、前端）
基础设施（云服务商、自动扩缩容需求）
测试（单元测试、集成测试、安全扫描）
合规性（审计日志、审批闸门、密钥管理）

Workflow 2: Infrastructure as Code

工作流2：基础设施即代码

Use case: Manage cloud resources declaratively with Terraform

Key Components:

State management (S3 backend with DynamoDB locking)
Module composition (VPC, EKS, RDS)
Environment separation (dev/staging/production)
Tagging strategy for cost allocation

使用场景： 使用Terraform声明式管理云资源

核心组件：

状态管理（带DynamoDB锁的S3后端）
模块组合（VPC、EKS、RDS）
环境隔离（开发/预发布/生产）
用于成本分配的标签策略

Workflow 3: Container Orchestration

工作流3：容器编排

Use case: Deploy applications to Kubernetes

Key Components:

Helm charts for templating
Deployments with rolling updates
Services and Ingress configuration
ConfigMaps and Secrets management
Resource limits and health checks

使用场景： 将应用部署到Kubernetes

核心组件：

用于模板化的Helm Charts
滚动更新的部署方式
服务与Ingress配置
ConfigMaps与Secrets管理
资源限制与健康检查

Decision Framework

决策框架

GitOps Workflow Selection

GitOps工作流选择

Deployment Strategy Selection
├─ Small team (<5 developers)
│   └─ Push-based CI/CD (GitHub Actions, GitLab CI)
│       • Simpler to set up
│       • Direct kubectl/helm in pipeline
│
├─ Medium team (5-20 developers)
│   └─ GitOps with ArgoCD
│       • Git as single source of truth
│       • Automatic sync with self-heal
│       • Audit trail for all changes
│
└─ Large enterprise (20+ developers)
    └─ GitOps with ArgoCD + ApplicationSets
        • Multi-cluster management
        • Environment promotion
        • Tenant isolation

部署策略选择
├─ 小型团队（<5名开发人员）
│   └─ 基于推送的CI/CD（GitHub Actions、GitLab CI）
│       • 搭建更简单
│       • 流水线中直接使用kubectl/helm
│
├─ 中型团队（5-20名开发人员）
│   └─ 搭配ArgoCD的GitOps
│       • 将Git作为唯一事实来源
│       • 自动同步并具备自修复能力
│       • 所有变更的审计追踪
│
└─ 大型企业（20+名开发人员）
    └─ 搭配ArgoCD + ApplicationSets的GitOps
        • 多集群管理
        • 环境升级
        • 租户隔离

Deployment Strategy Selection

部署策略选择

Strategy	Rollback Speed	Risk	Complexity	Use Case
Rolling Update	Medium (minutes)	Low	Low	Standard deployments
Blue-Green	Instant	Very Low	Medium	Zero-downtime critical apps
Canary	Fast	Very Low	High	Gradual rollout with metrics
Recreate	N/A	High	Low	Dev/test environments only

策略	回滚速度	风险	复杂度	适用场景
滚动更新	中等（数分钟）	低	低	标准部署
蓝绿部署	即时	极低	中等	零停机关键应用
金丝雀部署	快速	极低	高	带指标的逐步发布
重建部署	不适用	高	低	仅开发/测试环境

Quality Checklist

质量检查清单

CI/CD Pipeline

CI/CD流水线

Build stage completes in <5 minutes
All tests pass (unit, integration, security scans)
Automated rollback on failure
Deployment notifications configured (Slack/email)
Pipeline as code (version controlled)

构建阶段在5分钟内完成
所有测试通过（单元测试、集成测试、安全扫描）
失败时自动回滚
配置了部署通知（Slack/邮件）
流水线即代码（版本控制）

Infrastructure

基础设施

All infrastructure defined as code (Terraform/CloudFormation)
Multi-environment support (dev/staging/production)
Auto-scaling policies configured
Disaster recovery tested (RTO/RPO documented)
Cost monitoring and budget alerts active

所有基础设施均以代码定义（Terraform/CloudFormation）
支持多环境（开发/预发布/生产）
配置了自动扩缩容策略
灾难恢复已测试（RTO/RPO已记录）
成本监控与预算告警已启用

Containerization

容器化

Multi-stage Dockerfiles (optimized image size)
Security scanning passed (Trivy, Snyk)
Resource limits defined for all containers
Health checks implemented (liveness + readiness)
Runs as non-root user

多阶段Dockerfile（优化镜像大小）
安全扫描通过（Trivy、Snyk）
为所有容器定义了资源限制
实现了健康检查（存活+就绪）
以非root用户运行

Monitoring

监控

Metrics collection configured (Prometheus/CloudWatch)
Dashboards created for key services
Alerts defined with runbooks
Log aggregation working (ELK/Loki)
Distributed tracing enabled (Jaeger/X-Ray)

配置了指标收集（Prometheus/CloudWatch）
为核心服务创建了仪表盘
定义了带运行手册的告警
日志聚合正常工作（ELK/Loki）
启用了分布式追踪（Jaeger/X-Ray）

Security

安全

Secrets stored in vault (not in code)
RBAC configured (least privilege)
Network policies defined (zero trust)
Vulnerability scanning automated
Audit logging enabled

密钥存储在Vault中（而非代码里）
配置了RBAC（最小权限原则）
定义了网络策略（零信任）
自动化漏洞扫描
启用了审计日志

Documentation

文档

Architecture diagrams created
Runbooks documented for common issues
Onboarding guide for new team members
Disaster recovery procedures tested
CI/CD pipeline documented

创建了架构图
记录了常见问题的运行手册
为新团队成员准备了入职指南
测试了灾难恢复流程
文档化了CI/CD流水线

Additional Resources

额外资源

Detailed Technical Reference: See REFERENCE.md
Code Examples & Patterns: See EXAMPLES.md

详细技术参考：查看REFERENCE.md
代码示例与模式：查看EXAMPLES.md

devops-engineer

Original

Translation