writing-infrastructure-code

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Infrastructure as Code

基础设施即代码(IaC)

Provision and manage cloud infrastructure using code-based automation tools. This skill covers tool selection, state management, module design, and operational patterns across Terraform/OpenTofu, Pulumi, and AWS CDK.
使用基于代码的自动化工具配置和管理云基础设施。本技能涵盖Terraform/OpenTofu、Pulumi和AWS CDK的工具选择、状态管理、模块设计及运维模式。

When to Use

适用场景

Use this skill when:
  • Provisioning cloud infrastructure (compute, networking, databases, storage)
  • Migrating from manual infrastructure to code-based workflows
  • Designing reusable infrastructure modules
  • Implementing multi-cloud or hybrid-cloud deployments
  • Establishing state management and drift detection patterns
  • Integrating infrastructure provisioning into CI/CD pipelines
  • Evaluating IaC tools (Terraform vs Pulumi vs CDK)
Common requests:
  • "Create a Terraform module for VPC provisioning"
  • "Set up remote state with locking for team collaboration"
  • "Compare Pulumi vs Terraform for our use case"
  • "Design composable infrastructure modules"
  • "Implement drift detection for existing infrastructure"
在以下场景中使用本技能:
  • 配置云基础设施(计算、网络、数据库、存储)
  • 从手动基础设施管理迁移到基于代码的工作流
  • 设计可复用的基础设施模块
  • 实现多云或混合云部署
  • 构建状态管理和漂移检测模式
  • 将基础设施配置集成到CI/CD流水线
  • 评估IaC工具(Terraform vs Pulumi vs CDK)
常见需求:
  • "创建用于VPC配置的Terraform模块"
  • "设置带锁定功能的远程状态以支持团队协作"
  • "针对我们的使用场景对比Pulumi和Terraform"
  • "设计可组合的基础设施模块"
  • "为现有基础设施实现漂移检测"

Core Concepts

核心概念

Infrastructure as Code Fundamentals

IaC核心基础

Key Principles:
  1. Declarative vs Imperative - Describe desired state (Terraform) or program infrastructure (Pulumi)
  2. Idempotency - Same input produces same output, safe to re-run
  3. Version Control - Infrastructure changes tracked in Git
  4. State Management - Track actual infrastructure state
  5. Module Composition - Reusable, versioned infrastructure components
Benefits:
  • Reproducibility (same code = same infrastructure)
  • Auditability (Git history shows all changes)
  • Collaboration (code reviews for infrastructure changes)
  • Automation (CI/CD deploys infrastructure)
  • Disaster recovery (rebuild from code)
关键原则:
  1. 声明式与命令式 - 描述期望状态(如Terraform)或通过代码编写基础设施(如Pulumi)
  2. 幂等性 - 相同输入产生相同输出,可安全重复执行
  3. 版本控制 - 基础设施变更在Git中追踪
  4. 状态管理 - 追踪基础设施的实际状态
  5. 模块组合 - 可复用、带版本的基础设施组件
优势:
  • 可重复性(相同代码 = 相同基础设施)
  • 可审计性(Git历史记录显示所有变更)
  • 协作性(基础设施变更可进行代码评审)
  • 自动化(CI/CD部署基础设施)
  • 灾难恢复(通过代码重建基础设施)

Tool Selection Framework

工具选择框架

Choose IaC tools based on team composition and cloud strategy:
Terraform/OpenTofu - Declarative, HCL-based
  • Multi-cloud and hybrid-cloud deployments
  • Operations/SRE teams prefer declarative approach
  • Largest provider ecosystem (AWS, GCP, Azure, 3000+ providers)
  • Mature module registry and community
Pulumi - Imperative, programming language-based
  • Developer-centric teams familiar with TypeScript/Python/Go
  • Complex logic requires programming constructs (loops, conditionals, functions)
  • Native unit testing using familiar test frameworks
  • Strong typing and IDE support
AWS CDK - AWS-native, programming language-based
  • AWS-only infrastructure
  • Tight integration with AWS services
  • L1/L2/L3 construct abstractions
  • CloudFormation under the hood
Decision Tree:
Multi-cloud required?
├─ YES → Team composition?
│  ├─ Ops/SRE focused → Terraform/OpenTofu
│  └─ Developer focused → Pulumi
└─ NO → AWS only?
   ├─ YES → Language preference?
   │  ├─ HCL/declarative → Terraform
   │  ├─ TypeScript/Python → AWS CDK
   │  └─ YAML/simple → CloudFormation
   └─ NO → GCP/Azure only?
      └─ Terraform or Pulumi
根据团队构成和云战略选择IaC工具:
Terraform/OpenTofu - 声明式、基于HCL语言
  • 支持多云和混合云部署
  • 运维/SRE团队偏好声明式方法
  • 最大的提供商生态系统(AWS、GCP、Azure等3000+提供商)
  • 成熟的模块注册中心和社区
Pulumi - 命令式、基于编程语言
  • 适合熟悉TypeScript/Python/Go的以开发者为中心的团队
  • 复杂逻辑可使用编程构造(循环、条件、函数)
  • 支持使用熟悉的测试框架进行原生单元测试
  • 强类型和IDE支持
AWS CDK - AWS原生、基于编程语言
  • 仅适用于AWS基础设施
  • 与AWS服务深度集成
  • L1/L2/L3构造抽象
  • 底层基于CloudFormation
决策树:
是否需要多云支持?
├─ 是 → 团队构成?
│  ├─ 以运维/SRE为主 → Terraform/OpenTofu
│  └─ 以开发者为主 → Pulumi
└─ 否 → 是否仅使用AWS?
   ├─ 是 → 语言偏好?
   │  ├─ HCL/声明式 → Terraform
   │  ├─ TypeScript/Python → AWS CDK
   │  └─ YAML/简单场景 → CloudFormation
   └─ 否 → 是否仅使用GCP/Azure?
      └─ Terraform或Pulumi

State Management Architecture

状态管理架构

Remote state with locking enables team collaboration:
Backend Selection:
Cloud ProviderRecommended BackendLocking Mechanism
AWSS3 + DynamoDBDynamoDB table
GCPGoogle Cloud StorageNative
AzureAzure Blob StorageLease-based
Multi-cloudTerraform Cloud/EnterpriseBuilt-in
PulumiPulumi ServiceBuilt-in
State Isolation Strategies:
  1. Directory Separation (recommended for most teams)
    • Separate directories per environment (
      prod/
      ,
      staging/
      ,
      dev/
      )
    • Complete state file isolation
    • No risk of cross-environment contamination
  2. Workspaces
    • Single codebase, multiple environments
    • Shared state backend, environment namespacing
    • Risk: accidental cross-environment operations
  3. Layered Architecture
    • Separate state files for networking, compute, data layers
    • Blast radius reduction
    • Cross-layer references via remote state data sources
Critical State Management Rules:
  • Always use remote state for team environments
  • Enable state file encryption at rest
  • Enable versioning on state storage
  • Use state locking to prevent concurrent modifications
  • Never commit state files to Git
  • Mark sensitive outputs as
    sensitive = true
带锁定功能的远程状态可实现团队协作:
后端选择:
云服务商推荐后端锁定机制
AWSS3 + DynamoDBDynamoDB表
GCPGoogle Cloud Storage原生支持
AzureAzure Blob Storage基于租约
多云Terraform Cloud/Enterprise内置支持
PulumiPulumi Service内置支持
状态隔离策略:
  1. 目录分离(大多数团队推荐)
    • 按环境分目录(
      prod/
      staging/
      dev/
    • 状态文件完全隔离
    • 无跨环境污染风险
  2. 工作区
    • 单一代码库,多环境
    • 共享状态后端,按环境命名空间隔离
    • 风险:意外的跨环境操作
  3. 分层架构
    • 为网络、计算、数据层分别设置状态文件
    • 减少影响范围
    • 通过远程状态数据源实现跨层引用
关键状态管理规则:
  • 团队环境始终使用远程状态
  • 启用状态文件静态加密
  • 启用状态存储的版本控制
  • 使用状态锁定防止并发修改
  • 切勿将状态文件提交到Git
  • 将敏感输出标记为
    sensitive = true

Module Design Patterns

模块设计模式

Composable Module Structure:
modules/
├── vpc/              # Network foundation
├── security-group/   # Reusable security group patterns
├── rds/              # Database with backups, encryption
├── ecs-cluster/      # Container orchestration base
├── ecs-service/      # Individual microservice
└── alb/              # Application load balancer
Module Versioning:
  • Pin module versions in production (
    version = "5.1.0"
    )
  • Use semantic versioning for internal modules
  • Test module updates in non-prod first
  • Maintain CHANGELOG for module releases
Module Design Principles:
  • Clear input contract (required vs optional variables)
  • Documented outputs (what consumers can reference)
  • Sane defaults where possible
  • Validation rules for inputs
  • Examples directory showing usage
When to Create a Module:
  • Resource group is reused 3+ times
  • Clear boundaries and responsibilities
  • Stable interface contract
  • Team has module maintenance capacity
When to Keep Monolithic:
  • One-off infrastructure
  • Rapid prototyping phase
  • High coupling between resources
  • Small team, simple infrastructure
可组合模块结构:
modules/
├── vpc/              # 网络基础
├── security-group/   # 可复用安全组模式
├── rds/              # 带备份、加密的数据库
├── ecs-cluster/      # 容器编排基础
├── ecs-service/      # 单个微服务
└── alb/              # 应用负载均衡器
模块版本控制:
  • 生产环境固定模块版本(
    version = "5.1.0"
  • 内部模块使用语义化版本
  • 先在非生产环境测试模块更新
  • 为模块发布维护CHANGELOG
模块设计原则:
  • 清晰的输入契约(必填与可选变量)
  • 文档化的输出(消费者可引用的内容)
  • 合理的默认值
  • 输入验证规则
  • 包含使用示例的目录
何时创建模块:
  • 资源组被复用3次以上
  • 边界和职责清晰
  • 接口契约稳定
  • 团队有模块维护能力
何时保持单体结构:
  • 一次性基础设施
  • 快速原型阶段
  • 资源间高度耦合
  • 团队规模小、基础设施简单

Quick Reference

快速参考

Terraform/OpenTofu Commands

Terraform/OpenTofu 常用命令

bash
undefined
bash
undefined

Initialize providers and backend

初始化提供商和后端

terraform init
terraform init

Plan changes (preview)

规划变更(预览)

terraform plan
terraform plan

Apply changes

应用变更

terraform apply
terraform apply

Destroy infrastructure

销毁基础设施

terraform destroy
terraform destroy

Format HCL files

格式化HCL文件

terraform fmt
terraform fmt

Validate syntax

验证语法

terraform validate
terraform validate

Show state

查看状态

terraform state list terraform state show <resource>
terraform state list terraform state show <resource>

Import existing resources

导入现有资源

terraform import <resource.name> <id>
terraform import <resource.name> <id>

Workspace management

工作区管理

terraform workspace list terraform workspace new staging terraform workspace select prod
undefined
terraform workspace list terraform workspace new staging terraform workspace select prod
undefined

Pulumi Commands

Pulumi 常用命令

bash
undefined
bash
undefined

Initialize new project

初始化新项目

pulumi new aws-typescript
pulumi new aws-typescript

Preview changes

预览变更

pulumi preview
pulumi preview

Apply changes

应用变更

pulumi up
pulumi up

Destroy infrastructure

销毁基础设施

pulumi destroy
pulumi destroy

Show stack outputs

查看栈输出

pulumi stack output
pulumi stack output

Manage stacks

管理栈

pulumi stack ls pulumi stack select prod
pulumi stack ls pulumi stack select prod

Import existing resources

导入现有资源

pulumi import <type> <name> <id>
pulumi import <type> <name> <id>

Export/import state

导出/导入状态

pulumi stack export > state.json pulumi stack import < state.json
undefined
pulumi stack export > state.json pulumi stack import < state.json
undefined

AWS CDK Commands

AWS CDK 常用命令

bash
undefined
bash
undefined

Initialize new app

初始化新应用

cdk init app --language typescript
cdk init app --language typescript

Synthesize CloudFormation

生成CloudFormation模板

cdk synth
cdk synth

Preview changes

预览变更

cdk diff
cdk diff

Deploy stack

部署栈

cdk deploy
cdk deploy

Destroy stack

销毁栈

cdk destroy
cdk destroy

Bootstrap account/region

引导账号/区域

cdk bootstrap
cdk bootstrap

List stacks

列出栈

cdk list
undefined
cdk list
undefined

Common Patterns Checklist

通用模式检查清单

Infrastructure Provisioning:
  • Remote state configured with locking
  • State file encryption enabled
  • Provider versions pinned
  • Module versions pinned (production)
  • Variables have descriptions and types
  • Sensitive outputs marked as sensitive
  • Tagging strategy implemented
  • Cost allocation tags applied
Module Development:
  • Clear README with usage examples
  • Required vs optional variables documented
  • Outputs documented with descriptions
  • Validation rules for critical inputs
  • Examples directory with working code
  • Tests for module behavior (Terratest/CDK assertions)
  • CHANGELOG for version tracking
  • Semantic versioning followed
Operational Readiness:
  • Drift detection scheduled
  • CI/CD pipeline for plan/apply
  • State backup strategy
  • Disaster recovery documented
  • Team access controls configured (IAM/RBAC)
  • Cost estimation integrated (Infracost)
  • Security scanning integrated (Checkov/tfsec)
  • Documentation kept current
基础设施配置:
  • 已配置带锁定功能的远程状态
  • 已启用状态文件加密
  • 已固定提供商版本
  • 已固定模块版本(生产环境)
  • 变量包含描述和类型
  • 敏感输出已标记为敏感
  • 已实现标签策略
  • 已应用成本分配标签
模块开发:
  • 清晰的README及使用示例
  • 已文档化必填与可选变量
  • 已文档化输出及描述
  • 关键输入已设置验证规则
  • 包含可运行代码的示例目录
  • 已为模块行为编写测试(Terratest/CDK assertions)
  • 已维护版本变更日志(CHANGELOG)
  • 已遵循语义化版本控制
运维就绪:
  • 已计划漂移检测
  • 已集成CI/CD流水线用于规划/应用变更
  • 已制定状态备份策略
  • 已文档化灾难恢复流程
  • 已配置团队访问控制(IAM/RBAC)
  • 已集成成本估算(Infracost)
  • 已集成安全扫描(Checkov/tfsec)
  • 文档已保持最新

Detailed Documentation

详细文档

For comprehensive patterns and implementation details:
Tool-Specific Patterns:
  • references/terraform-patterns.md
    - Terraform/OpenTofu best practices, HCL patterns
  • references/pulumi-patterns.md
    - Pulumi across TypeScript/Python/Go
Architecture and Design:
  • references/state-management.md
    - Remote state, locking, isolation strategies
  • references/module-design.md
    - Composable modules, versioning, registries
Operations:
  • references/drift-detection.md
    - Detecting and remediating infrastructure drift
如需全面的模式和实现细节:
工具特定模式:
  • references/terraform-patterns.md
    - Terraform/OpenTofu最佳实践、HCL模式
  • references/pulumi-patterns.md
    - 跨TypeScript/Python/Go的Pulumi实践
架构与设计:
  • references/state-management.md
    - 远程状态、锁定、隔离策略
  • references/module-design.md
    - 可组合模块、版本控制、注册中心
运维:
  • references/drift-detection.md
    - 检测和修复基础设施漂移

Working Examples

实战示例

Practical implementations demonstrating IaC patterns:
Terraform Examples:
  • examples/terraform/vpc-module/
    - Multi-AZ VPC with public/private subnets
  • examples/terraform/ecs-service/
    - ECS service with ALB, autoscaling
  • examples/terraform/rds-cluster/
    - Aurora cluster with backups, encryption
  • examples/terraform/state-backend/
    - S3 + DynamoDB backend setup
Pulumi Examples:
  • examples/pulumi/typescript/vpc/
    - TypeScript VPC component
  • examples/pulumi/python/ecs-service/
    - Python ECS service
  • examples/pulumi/go/rds-cluster/
    - Go RDS cluster
  • examples/pulumi/testing/
    - Unit tests for Pulumi programs
AWS CDK Examples:
  • examples/cdk/typescript/vpc-stack/
    - VPC using L2 constructs
  • examples/cdk/typescript/ecs-fargate/
    - Fargate service with ALB
  • examples/cdk/typescript/pipeline-stack/
    - Self-mutating CDK pipeline
  • examples/cdk/testing/
    - CDK assertions and snapshot tests
展示IaC模式的实际实现:
Terraform示例:
  • examples/terraform/vpc-module/
    - 带公有/私有子网的多可用区VPC
  • examples/terraform/ecs-service/
    - 带ALB、自动扩缩容的ECS服务
  • examples/terraform/rds-cluster/
    - 带备份、加密的Aurora集群
  • examples/terraform/state-backend/
    - S3 + DynamoDB后端设置
Pulumi示例:
  • examples/pulumi/typescript/vpc/
    - TypeScript VPC组件
  • examples/pulumi/python/ecs-service/
    - Python ECS服务
  • examples/pulumi/go/rds-cluster/
    - Go RDS集群
  • examples/pulumi/testing/
    - Pulumi程序的单元测试
AWS CDK示例:
  • examples/cdk/typescript/vpc-stack/
    - 使用L2构造的VPC
  • examples/cdk/typescript/ecs-fargate/
    - 带ALB的Fargate服务
  • examples/cdk/typescript/pipeline-stack/
    - 自变更CDK流水线
  • examples/cdk/testing/
    - CDK断言和快照测试

Utility Scripts

实用脚本

Automated validation and operational tools:
  • scripts/validate-terraform.sh
    - Terraform fmt, validate, tflint
  • scripts/cost-estimate.sh
    - Infracost wrapper for cost analysis
  • scripts/drift-check.sh
    - Scheduled drift detection
  • scripts/security-scan.sh
    - Checkov/tfsec security scanning
  • scripts/state-backup.sh
    - State file backup automation
  • scripts/module-release.sh
    - Module versioning and publishing
自动化验证和运维工具:
  • scripts/validate-terraform.sh
    - Terraform格式化、验证、tflint检查
  • scripts/cost-estimate.sh
    - Infracost封装脚本用于成本分析
  • scripts/drift-check.sh
    - 定时漂移检测
  • scripts/security-scan.sh
    - Checkov/tfsec安全扫描
  • scripts/state-backup.sh
    - 状态文件备份自动化
  • scripts/module-release.sh
    - 模块版本控制和发布

Integration with Other Skills

与其他技能的集成

Deployment Pipeline:
  • building-ci-pipelines
    - Automate terraform plan/apply in CI/CD
  • gitops-workflows
    - GitOps-based infrastructure deployment
Platform Engineering:
  • kubernetes-operations
    - Provision EKS, GKE, AKS clusters
  • platform-engineering
    - Internal developer platform infrastructure
Security:
  • secret-management
    - Provision Vault, External Secrets Operator
  • security-hardening
    - Implement infrastructure security controls
  • compliance-frameworks
    - Policy-as-code for compliance
Operations:
  • observability
    - Provision monitoring infrastructure (Prometheus, Grafana)
  • disaster-recovery
    - Infrastructure rebuild procedures
  • cost-optimization
    - Implement cost controls via IaC
Data Platform:
  • data-architecture
    - Provision data lakes, warehouses
  • streaming-data
    - Provision Kafka, Kinesis infrastructure
部署流水线:
  • building-ci-pipelines
    - 在CI/CD中自动化terraform plan/apply
  • gitops-workflows
    - 基于GitOps的基础设施部署
平台工程:
  • kubernetes-operations
    - 配置EKS、GKE、AKS集群
  • platform-engineering
    - 内部开发者平台基础设施
安全:
  • secret-management
    - 配置Vault、External Secrets Operator
  • security-hardening
    - 实现基础设施安全控制
  • compliance-frameworks
    - 用于合规的策略即代码
运维:
  • observability
    - 配置监控基础设施(Prometheus、Grafana)
  • disaster-recovery
    - 基础设施重建流程
  • cost-optimization
    - 通过IaC实现成本控制
数据平台:
  • data-architecture
    - 配置数据湖、数据仓库
  • streaming-data
    - 配置Kafka、Kinesis基础设施

Best Practices

最佳实践

Development Workflow:
  1. Write infrastructure code in feature branches
  2. Run
    terraform plan
    /
    pulumi preview
    locally
  3. Submit pull request with plan output
  4. Code review focuses on security, cost, blast radius
  5. CI runs automated tests and security scans
  6. Apply only after approval and CI passes
  7. Monitor for drift post-deployment
State Management:
  • Use remote state from day one (never local state for teams)
  • Separate state files per environment
  • Enable state locking to prevent concurrent modifications
  • Version state storage for rollback capability
  • Encrypt state at rest (contains sensitive data)
  • Regular state backups to separate location
Module Development:
  • Start with monolithic code, extract modules when patterns emerge
  • Design for reusability but avoid premature abstraction
  • Document all inputs and outputs
  • Provide working examples in
    examples/
    directory
  • Pin provider versions in modules
  • Test modules before publishing
  • Use semantic versioning for releases
Security:
  • Scan IaC for security issues before apply (Checkov, tfsec)
  • Never commit secrets to code (use secret references)
  • Mark sensitive outputs as
    sensitive = true
  • Implement least-privilege IAM policies
  • Enable resource encryption by default
  • Use private module registries for internal modules
Cost Management:
  • Estimate costs before applying changes (Infracost)
  • Tag all resources for cost allocation
  • Review cost impact in pull requests
  • Set up cost alerts for drift
  • Rightsize resources based on usage
Operational Excellence:
  • Schedule regular drift detection
  • Document disaster recovery procedures
  • Maintain runbooks for common operations
  • Monitor state file access logs
  • Practice infrastructure rebuilds periodically
  • Keep provider versions current with testing
开发工作流:
  1. 在特性分支中编写基础设施代码
  2. 本地运行
    terraform plan
    /
    pulumi preview
  3. 提交包含计划输出的拉取请求
  4. 代码评审聚焦于安全、成本、影响范围
  5. CI运行自动化测试和安全扫描
  6. 仅在审批通过且CI通过后应用变更
  7. 部署后监控漂移情况
状态管理:
  • 从第一天起使用远程状态(团队环境切勿使用本地状态)
  • 按环境分离状态文件
  • 启用状态锁定以防止并发修改
  • 启用状态存储版本控制以支持回滚
  • 静态加密状态(包含敏感数据)
  • 定期将状态备份到独立位置
模块开发:
  • 从单体代码开始,模式出现后再提取模块
  • 为复用性设计,但避免过早抽象
  • 文档化所有输入和输出
  • examples/
    目录中提供工作示例
  • 在模块中固定提供商版本
  • 发布前测试模块
  • 使用语义化版本进行发布
安全:
  • 应用变更前扫描IaC的安全问题(Checkov、tfsec)
  • 切勿将密钥提交到代码中(使用密钥引用)
  • 将敏感输出标记为
    sensitive = true
  • 实现最小权限IAM策略
  • 默认启用资源加密
  • 内部模块使用私有模块注册中心
成本管理:
  • 应用变更前估算成本(Infracost)
  • 为所有资源添加标签以进行成本分配
  • 在拉取请求中评审成本影响
  • 为漂移设置成本警报
  • 根据使用情况调整资源规格
运维卓越:
  • 定期执行漂移检测
  • 文档化灾难恢复流程
  • 维护常见操作的运行手册
  • 监控状态文件访问日志
  • 定期演练基础设施重建
  • 保持提供商版本最新并进行测试

Common Pitfalls

常见陷阱

State File Issues:
  • Manual state editing - Use terraform state commands, not direct edits
  • No state locking - Race conditions corrupt state
  • Local state for teams - State divergence across team members
  • Large state files - Break into multiple state files by layer
Module Design:
  • Over-abstraction - Too generic, hard to understand
  • Under-abstraction - Copy-paste code everywhere
  • No version pinning - Unexpected breaking changes
  • No examples - Users don't know how to consume module
Operations:
  • No drift detection - Manual changes go unnoticed
  • Direct resource modification - Bypassing IaC creates drift
  • No rollback plan - Can't recover from failed apply
  • Ignoring plan output - Surprises during apply
Security:
  • Secrets in code - Hard-coded credentials
  • No security scanning - Vulnerabilities in production
  • Overly permissive IAM - Excessive privileges
  • No state encryption - Sensitive data exposed
状态文件问题:
  • 手动编辑状态 - 使用terraform state命令,不要直接编辑
  • 无状态锁定 - 竞态条件会损坏状态
  • 团队使用本地状态 - 团队成员间状态不一致
  • 大型状态文件 - 按层拆分为多个状态文件
模块设计:
  • 过度抽象 - 过于通用,难以理解
  • 抽象不足 - 代码到处复制粘贴
  • 未固定版本 - 意外的破坏性变更
  • 无示例 - 用户不知道如何使用模块
运维:
  • 无漂移检测 - 手动变更未被发现
  • 直接修改资源 - 绕过IaC会导致漂移
  • 无回滚计划 - 无法从失败的应用中恢复
  • 忽略计划输出 - 应用时出现意外
安全:
  • 代码中包含密钥 - 硬编码凭证
  • 无安全扫描 - 生产环境存在漏洞
  • IAM权限过度宽松 - 权限过大
  • 无状态加密 - 敏感数据暴露

Troubleshooting Guide

故障排除指南

State Lock Issues:
bash
terraform force-unlock <lock-id>  # Use only if certain no other process running
Import Existing Resources:
bash
terraform import aws_vpc.main vpc-12345678
pulumi import aws:ec2/vpc:Vpc main vpc-12345678
Drift Detection:
bash
terraform plan -detailed-exitcode  # Exit 2 = drift detected
pulumi preview --diff
For detailed drift remediation, see
references/drift-detection.md
.
State Recovery:
bash
undefined
状态锁定问题:
bash
terraform force-unlock <lock-id>  # 仅在确认无其他进程运行时使用
导入现有资源:
bash
terraform import aws_vpc.main vpc-12345678
pulumi import aws:ec2/vpc:Vpc main vpc-12345678
漂移检测:
bash
terraform plan -detailed-exitcode  # 退出码2表示检测到漂移
pulumi preview --diff
如需详细的漂移修复方法,请查看
references/drift-detection.md
状态恢复:
bash
undefined

Terraform: Restore from S3 versioning

Terraform:从S3版本控制恢复

aws s3 cp s3://bucket/backup/terraform.tfstate terraform.tfstate
aws s3 cp s3://bucket/backup/terraform.tfstate terraform.tfstate

Pulumi: Restore from checkpoint

Pulumi:从检查点恢复

pulumi stack export --version <timestamp> | pulumi stack import
undefined
pulumi stack export --version <timestamp> | pulumi stack import
undefined

Related Skills

相关技能

For cloud-specific implementations:
  • aws-patterns
    - AWS-specific resource patterns
  • gcp-patterns
    - GCP-specific resource patterns
  • azure-patterns
    - Azure-specific resource patterns
For infrastructure operations:
  • kubernetes-operations
    - Manage Kubernetes clusters provisioned via IaC
  • gitops-workflows
    - GitOps-based infrastructure deployment
  • platform-engineering
    - Internal developer platforms
For security and compliance:
  • security-hardening
    - Infrastructure security controls
  • secret-management
    - Secret injection and rotation
  • compliance-frameworks
    - Policy-as-code for compliance
For deployment automation:
  • building-ci-pipelines
    - CI/CD for infrastructure code
  • deploying-applications
    - Application deployment to provisioned infrastructure
For cost and observability:
  • cost-optimization
    - FinOps practices for infrastructure
  • observability
    - Monitoring infrastructure health
针对云特定实现:
  • aws-patterns
    - AWS特定资源模式
  • gcp-patterns
    - GCP特定资源模式
  • azure-patterns
    - Azure特定资源模式
针对基础设施运维:
  • kubernetes-operations
    - 管理通过IaC配置的Kubernetes集群
  • gitops-workflows
    - 基于GitOps的基础设施部署
  • platform-engineering
    - 内部开发者平台
针对安全与合规:
  • security-hardening
    - 基础设施安全控制
  • secret-management
    - 密钥注入和轮换
  • compliance-frameworks
    - 用于合规的策略即代码
针对部署自动化:
  • building-ci-pipelines
    - 基础设施代码的CI/CD
  • deploying-applications
    - 将应用部署到已配置的基础设施
针对成本与可观测性:
  • cost-optimization
    - 基础设施的FinOps实践
  • observability
    - 基础设施健康监控