devops-iac-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DevOps IaC Engineer

DevOps IaC工程师

This Skill helps DevOps teams design, implement, and maintain cloud infrastructure using Infrastructure as Code principles. Use this when building cloud architectures, deploying containerized applications, setting up CI/CD pipelines, or implementing observability and security practices.

本Skill帮助DevOps团队使用基础设施即代码（IaC）原则设计、实施和维护云基础设施。当你构建云架构、部署容器化应用、搭建CI/CD流水线或实施可观测性与安全实践时，可以使用本Skill。

Quick Navigation

快速导航

Terraform & IaC: See terraform.md for Terraform best practices and patterns
Kubernetes & Containers: See kubernetes.md for container orchestration
Cloud Platforms: See cloud_platforms.md for AWS, Azure, GCP guidance
CI/CD Pipelines: See cicd.md for pipeline design and GitOps
Observability: See observability.md for monitoring and logging
Security: See security.md for DevSecOps practices
Templates & Tools: See templates.md for ready-to-use templates

Terraform & IaC：查看terraform.md获取Terraform最佳实践和模式
Kubernetes & 容器：查看kubernetes.md获取容器编排相关内容
云平台：查看cloud_platforms.md获取AWS、Azure、GCP相关指导
CI/CD流水线：查看cicd.md获取流水线设计和GitOps相关内容
可观测性：查看observability.md获取监控与日志相关内容
安全：查看security.md获取DevSecOps实践内容
模板与工具：查看templates.md获取即用型模板

Core Principles

核心原则

Key DevOps Terminology (Consistent Throughout)

关键DevOps术语（全文统一）

Infrastructure as Code (IaC): Managing infrastructure through declarative code files
GitOps: Using Git as the single source of truth for infrastructure and applications
Immutable Infrastructure: Infrastructure components that are replaced rather than modified
Service Mesh: Infrastructure layer for service-to-service communication
Observability: Ability to understand system state from external outputs (logs, metrics, traces)
SLI/SLO/SLA: Service Level Indicators/Objectives/Agreements for reliability
RTO/RPO: Recovery Time Objective/Recovery Point Objective for disaster recovery

基础设施即代码（IaC）：通过声明式代码文件管理基础设施
GitOps：将Git作为基础设施和应用的单一可信数据源
不可变基础设施：通过替换而非修改的方式更新基础设施组件
服务网格：用于服务间通信的基础设施层
可观测性：通过外部输出（日志、指标、链路追踪）了解系统状态的能力
SLI/SLO/SLA：服务水平指标/目标/协议，用于保障可靠性
RTO/RPO：恢复时间目标/恢复点目标，用于灾难恢复

Workflow: Infrastructure Implementation

工作流：基础设施实施

When implementing infrastructure, follow this structured approach:

Understand Requirements
- What is the business need? (new application, migration, scaling, compliance)
- What are the scale requirements? (traffic, data, geographic distribution)
- What are the constraints? (budget, timeline, regulatory)
- What are the dependencies? (existing systems, data sources)
Design Architecture
- Choose appropriate cloud platform(s) and services
- Design for high availability and fault tolerance
- Plan network topology and security boundaries
- Identify data flows and storage requirements
- Document architecture with diagrams
Select IaC Tools
- Terraform for multi-cloud infrastructure provisioning
- Kubernetes manifests/Helm for container orchestration
- CI/CD tool selection based on team and requirements
- Configuration management tools if needed
Implement Infrastructure
- Create modular, reusable IaC code
- Follow security best practices (see security.md)
- Implement proper state management and versioning
- Use consistent naming and tagging conventions
- Document code and create README files
Set Up Observability
- Define SLIs and SLOs for critical services
- Implement logging, metrics, and tracing
- Create dashboards and alerts
- Set up log aggregation and analysis
- Plan on-call rotation and runbooks
Implement CI/CD
- Design deployment pipeline stages
- Implement automated testing (unit, integration, e2e)
- Set up GitOps workflows
- Configure deployment strategies (blue/green, canary)
- Implement rollback procedures
Test & Validate
- Run infrastructure tests (security, compliance, cost)
- Perform disaster recovery drills
- Load testing and performance validation
- Security scanning and penetration testing
- Document test results and improvements
Deploy & Monitor
- Execute phased rollout
- Monitor metrics and logs closely
- Validate against SLOs
- Document runbooks and troubleshooting guides
- Conduct post-deployment review

实施基础设施时，请遵循以下结构化流程：

理解需求
- 业务需求是什么？（新应用、迁移、扩容、合规）
- 规模需求是什么？（流量、数据、地理分布）
- 约束条件有哪些？（预算、时间线、监管要求）
- 依赖关系是什么？（现有系统、数据源）
设计架构
- 选择合适的云平台及服务
- 设计高可用和容错架构
- 规划网络拓扑和安全边界
- 确定数据流和存储需求
- 用图表记录架构设计
选择IaC工具
- 使用Terraform进行多云基础设施部署
- 使用Kubernetes清单/Helm进行容器编排
- 根据团队和需求选择CI/CD工具
- 如有需要，选择配置管理工具
实施基础设施
- 创建模块化、可复用的IaC代码
- 遵循安全最佳实践（查看security.md）
- 实施合适的状态管理和版本控制
- 使用统一的命名和标签规范
- 为代码编写文档和README文件
搭建可观测性体系
- 为关键服务定义SLI和SLO
- 实施日志、指标和链路追踪
- 创建仪表盘和告警规则
- 搭建日志聚合与分析系统
- 规划轮值待命和运行手册
实施CI/CD流水线
- 设计部署流水线阶段
- 实施自动化测试（单元测试、集成测试、端到端测试）
- 搭建GitOps工作流
- 配置部署策略（蓝绿部署、金丝雀部署）
- 实施回滚流程
测试与验证
- 运行基础设施测试（安全、合规、成本）
- 开展灾难恢复演练
- 进行负载测试和性能验证
- 开展安全扫描和渗透测试
- 记录测试结果和改进方案
部署与监控
- 执行分阶段发布
- 密切监控指标和日志
- 验证是否符合SLO要求
- 编写运行手册和故障排除指南
- 开展部署后评审

Decision Framework: Tool Selection

决策框架：工具选择

Multi-Cloud Requirements → Terraform or Pulumi AWS-Only → Terraform, AWS CDK, or CloudFormation Container Orchestration → Kubernetes (EKS, GKE, AKS) Simple Container Deployment → ECS, Cloud Run, or App Service Configuration Management → Ansible or cloud-native solutions GitOps Workflows → ArgoCD or Flux CI/CD Pipelines → GitHub Actions, GitLab CI, or Jenkins

多云需求 → Terraform 或 Pulumi 仅AWS环境 → Terraform、AWS CDK 或 CloudFormation 容器编排 → Kubernetes（EKS、GKE、AKS） 简单容器部署 → ECS、Cloud Run 或 App Service 配置管理 → Ansible 或云原生解决方案 GitOps工作流 → ArgoCD 或 Flux CI/CD流水线 → GitHub Actions、GitLab CI 或 Jenkins

Common Challenges & Solutions

常见挑战与解决方案

Problem: Infrastructure drift between code and reality Solution: Implement automated drift detection, use terraform plan in CI/CD, enable read-only production access, maintain state file integrity

Problem: Secrets management and credential exposure Solution: Use cloud-native secret managers (AWS Secrets Manager, HashiCorp Vault), implement SOPS for encrypted secrets in Git, use IRSA/workload identity

Problem: High cloud costs and unexpected bills Solution: Implement tagging strategy, use cost allocation tags, set up budget alerts, right-size resources, use spot instances, implement auto-scaling

Problem: Complex Kubernetes configurations Solution: Use Helm charts for templating, implement Kustomize for environment-specific configs, follow GitOps patterns, use operators for complex workloads

问题：代码与实际环境之间的基础设施漂移 解决方案：实施自动化漂移检测，在CI/CD中运行terraform plan，启用生产环境只读访问，维护状态文件完整性

问题：密钥管理与凭证泄露 解决方案：使用云原生密钥管理器（AWS Secrets Manager、HashiCorp Vault），在Git中使用SOPS加密密钥，使用IRSA/工作负载身份

问题：云成本过高和意外账单 解决方案：实施标签策略，使用成本分配标签，设置预算告警，优化资源规格，使用抢占式实例，实施自动扩缩容

问题：Kubernetes配置复杂 解决方案：使用Helm图表进行模板化，使用Kustomize处理环境特定配置，遵循GitOps模式，使用Operator管理复杂工作负载

Collaboration Tips

协作技巧

With Development Teams: Provide self-service platforms, document APIs, share infrastructure as reusable modules
With Security Teams: Implement policy as code, automate compliance checks, provide audit trails
With SRE Teams: Define SLIs/SLOs together, share on-call responsibilities, collaborate on incident response
With Finance Teams: Provide cost visibility, forecast expenses, implement chargeback models

与开发团队协作：提供自助服务平台，编写API文档，共享可复用的基础设施模块
与安全团队协作：实施策略即代码，自动化合规检查，提供审计追踪
与SRE团队协作：共同定义SLI/SLO，分担轮值待命职责，协作处理事件响应
与财务团队协作：提供成本可见性，预测开支，实施成本分摊模型

Next Steps

下一步行动

Start with terraform.md if you're implementing infrastructure as code
Use kubernetes.md for container orchestration
Reference templates.md for ready-to-use configurations
Check observability.md to set up monitoring

Note: Always verify current infrastructure state, security requirements, and compliance needs before implementing changes. This Skill provides frameworks and best practices but should be adapted to your organization's specific requirements.

如果你要实施基础设施即代码，请从terraform.md开始
如需容器编排，请使用kubernetes.md
参考templates.md获取即用型配置
查看observability.md搭建监控体系

注意：在实施变更前，请始终验证当前基础设施状态、安全要求和合规需求。本Skill提供框架和最佳实践，但需根据组织的具体需求进行调整。