devops-iac-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDevOps IaC Engineer
DevOps IaC工程师
This Skill helps DevOps teams design, implement, and maintain cloud infrastructure using Infrastructure as Code principles. Use this when building cloud architectures, deploying containerized applications, setting up CI/CD pipelines, or implementing observability and security practices.
本Skill帮助DevOps团队使用基础设施即代码(IaC)原则设计、实施和维护云基础设施。当你构建云架构、部署容器化应用、搭建CI/CD流水线或实施可观测性与安全实践时,可以使用本Skill。
Quick Navigation
快速导航
- Terraform & IaC: See terraform.md for Terraform best practices and patterns
- Kubernetes & Containers: See kubernetes.md for container orchestration
- Cloud Platforms: See cloud_platforms.md for AWS, Azure, GCP guidance
- CI/CD Pipelines: See cicd.md for pipeline design and GitOps
- Observability: See observability.md for monitoring and logging
- Security: See security.md for DevSecOps practices
- Templates & Tools: See templates.md for ready-to-use templates
- Terraform & IaC:查看terraform.md获取Terraform最佳实践和模式
- Kubernetes & 容器:查看kubernetes.md获取容器编排相关内容
- 云平台:查看cloud_platforms.md获取AWS、Azure、GCP相关指导
- CI/CD流水线:查看cicd.md获取流水线设计和GitOps相关内容
- 可观测性:查看observability.md获取监控与日志相关内容
- 安全:查看security.md获取DevSecOps实践内容
- 模板与工具:查看templates.md获取即用型模板
Core Principles
核心原则
Key DevOps Terminology (Consistent Throughout)
关键DevOps术语(全文统一)
- Infrastructure as Code (IaC): Managing infrastructure through declarative code files
- GitOps: Using Git as the single source of truth for infrastructure and applications
- Immutable Infrastructure: Infrastructure components that are replaced rather than modified
- Service Mesh: Infrastructure layer for service-to-service communication
- Observability: Ability to understand system state from external outputs (logs, metrics, traces)
- SLI/SLO/SLA: Service Level Indicators/Objectives/Agreements for reliability
- RTO/RPO: Recovery Time Objective/Recovery Point Objective for disaster recovery
- 基础设施即代码(IaC):通过声明式代码文件管理基础设施
- GitOps:将Git作为基础设施和应用的单一可信数据源
- 不可变基础设施:通过替换而非修改的方式更新基础设施组件
- 服务网格:用于服务间通信的基础设施层
- 可观测性:通过外部输出(日志、指标、链路追踪)了解系统状态的能力
- SLI/SLO/SLA:服务水平指标/目标/协议,用于保障可靠性
- RTO/RPO:恢复时间目标/恢复点目标,用于灾难恢复
Workflow: Infrastructure Implementation
工作流:基础设施实施
When implementing infrastructure, follow this structured approach:
-
Understand Requirements
- What is the business need? (new application, migration, scaling, compliance)
- What are the scale requirements? (traffic, data, geographic distribution)
- What are the constraints? (budget, timeline, regulatory)
- What are the dependencies? (existing systems, data sources)
-
Design Architecture
- Choose appropriate cloud platform(s) and services
- Design for high availability and fault tolerance
- Plan network topology and security boundaries
- Identify data flows and storage requirements
- Document architecture with diagrams
-
Select IaC Tools
- Terraform for multi-cloud infrastructure provisioning
- Kubernetes manifests/Helm for container orchestration
- CI/CD tool selection based on team and requirements
- Configuration management tools if needed
-
Implement Infrastructure
- Create modular, reusable IaC code
- Follow security best practices (see security.md)
- Implement proper state management and versioning
- Use consistent naming and tagging conventions
- Document code and create README files
-
Set Up Observability
- Define SLIs and SLOs for critical services
- Implement logging, metrics, and tracing
- Create dashboards and alerts
- Set up log aggregation and analysis
- Plan on-call rotation and runbooks
-
Implement CI/CD
- Design deployment pipeline stages
- Implement automated testing (unit, integration, e2e)
- Set up GitOps workflows
- Configure deployment strategies (blue/green, canary)
- Implement rollback procedures
-
Test & Validate
- Run infrastructure tests (security, compliance, cost)
- Perform disaster recovery drills
- Load testing and performance validation
- Security scanning and penetration testing
- Document test results and improvements
-
Deploy & Monitor
- Execute phased rollout
- Monitor metrics and logs closely
- Validate against SLOs
- Document runbooks and troubleshooting guides
- Conduct post-deployment review
实施基础设施时,请遵循以下结构化流程:
-
理解需求
- 业务需求是什么?(新应用、迁移、扩容、合规)
- 规模需求是什么?(流量、数据、地理分布)
- 约束条件有哪些?(预算、时间线、监管要求)
- 依赖关系是什么?(现有系统、数据源)
-
设计架构
- 选择合适的云平台及服务
- 设计高可用和容错架构
- 规划网络拓扑和安全边界
- 确定数据流和存储需求
- 用图表记录架构设计
-
选择IaC工具
- 使用Terraform进行多云基础设施部署
- 使用Kubernetes清单/Helm进行容器编排
- 根据团队和需求选择CI/CD工具
- 如有需要,选择配置管理工具
-
实施基础设施
- 创建模块化、可复用的IaC代码
- 遵循安全最佳实践(查看security.md)
- 实施合适的状态管理和版本控制
- 使用统一的命名和标签规范
- 为代码编写文档和README文件
-
搭建可观测性体系
- 为关键服务定义SLI和SLO
- 实施日志、指标和链路追踪
- 创建仪表盘和告警规则
- 搭建日志聚合与分析系统
- 规划轮值待命和运行手册
-
实施CI/CD流水线
- 设计部署流水线阶段
- 实施自动化测试(单元测试、集成测试、端到端测试)
- 搭建GitOps工作流
- 配置部署策略(蓝绿部署、金丝雀部署)
- 实施回滚流程
-
测试与验证
- 运行基础设施测试(安全、合规、成本)
- 开展灾难恢复演练
- 进行负载测试和性能验证
- 开展安全扫描和渗透测试
- 记录测试结果和改进方案
-
部署与监控
- 执行分阶段发布
- 密切监控指标和日志
- 验证是否符合SLO要求
- 编写运行手册和故障排除指南
- 开展部署后评审
Decision Framework: Tool Selection
决策框架:工具选择
Multi-Cloud Requirements → Terraform or Pulumi
AWS-Only → Terraform, AWS CDK, or CloudFormation
Container Orchestration → Kubernetes (EKS, GKE, AKS)
Simple Container Deployment → ECS, Cloud Run, or App Service
Configuration Management → Ansible or cloud-native solutions
GitOps Workflows → ArgoCD or Flux
CI/CD Pipelines → GitHub Actions, GitLab CI, or Jenkins
多云需求 → Terraform 或 Pulumi
仅AWS环境 → Terraform、AWS CDK 或 CloudFormation
容器编排 → Kubernetes(EKS、GKE、AKS)
简单容器部署 → ECS、Cloud Run 或 App Service
配置管理 → Ansible 或云原生解决方案
GitOps工作流 → ArgoCD 或 Flux
CI/CD流水线 → GitHub Actions、GitLab CI 或 Jenkins
Common Challenges & Solutions
常见挑战与解决方案
Problem: Infrastructure drift between code and reality
Solution: Implement automated drift detection, use terraform plan in CI/CD, enable read-only production access, maintain state file integrity
Problem: Secrets management and credential exposure
Solution: Use cloud-native secret managers (AWS Secrets Manager, HashiCorp Vault), implement SOPS for encrypted secrets in Git, use IRSA/workload identity
Problem: High cloud costs and unexpected bills
Solution: Implement tagging strategy, use cost allocation tags, set up budget alerts, right-size resources, use spot instances, implement auto-scaling
Problem: Complex Kubernetes configurations
Solution: Use Helm charts for templating, implement Kustomize for environment-specific configs, follow GitOps patterns, use operators for complex workloads
问题:代码与实际环境之间的基础设施漂移
解决方案:实施自动化漂移检测,在CI/CD中运行terraform plan,启用生产环境只读访问,维护状态文件完整性
问题:密钥管理与凭证泄露
解决方案:使用云原生密钥管理器(AWS Secrets Manager、HashiCorp Vault),在Git中使用SOPS加密密钥,使用IRSA/工作负载身份
问题:云成本过高和意外账单
解决方案:实施标签策略,使用成本分配标签,设置预算告警,优化资源规格,使用抢占式实例,实施自动扩缩容
问题:Kubernetes配置复杂
解决方案:使用Helm图表进行模板化,使用Kustomize处理环境特定配置,遵循GitOps模式,使用Operator管理复杂工作负载
Collaboration Tips
协作技巧
- With Development Teams: Provide self-service platforms, document APIs, share infrastructure as reusable modules
- With Security Teams: Implement policy as code, automate compliance checks, provide audit trails
- With SRE Teams: Define SLIs/SLOs together, share on-call responsibilities, collaborate on incident response
- With Finance Teams: Provide cost visibility, forecast expenses, implement chargeback models
- 与开发团队协作:提供自助服务平台,编写API文档,共享可复用的基础设施模块
- 与安全团队协作:实施策略即代码,自动化合规检查,提供审计追踪
- 与SRE团队协作:共同定义SLI/SLO,分担轮值待命职责,协作处理事件响应
- 与财务团队协作:提供成本可见性,预测开支,实施成本分摊模型
Next Steps
下一步行动
- Start with terraform.md if you're implementing infrastructure as code
- Use kubernetes.md for container orchestration
- Reference templates.md for ready-to-use configurations
- Check observability.md to set up monitoring
Note: Always verify current infrastructure state, security requirements, and compliance needs before implementing changes. This Skill provides frameworks and best practices but should be adapted to your organization's specific requirements.
- 如果你要实施基础设施即代码,请从terraform.md开始
- 如需容器编排,请使用kubernetes.md
- 参考templates.md获取即用型配置
- 查看observability.md搭建监控体系
注意:在实施变更前,请始终验证当前基础设施状态、安全要求和合规需求。本Skill提供框架和最佳实践,但需根据组织的具体需求进行调整。