Infrastructure as Code
Provision and manage cloud infrastructure using code-based automation tools. This skill covers tool selection, state management, module design, and operational patterns across Terraform/OpenTofu, Pulumi, and AWS CDK.
When to Use
Use this skill when:
- Provisioning cloud infrastructure (compute, networking, databases, storage)
- Migrating from manual infrastructure to code-based workflows
- Designing reusable infrastructure modules
- Implementing multi-cloud or hybrid-cloud deployments
- Establishing state management and drift detection patterns
- Integrating infrastructure provisioning into CI/CD pipelines
- Evaluating IaC tools (Terraform vs Pulumi vs CDK)
Common requests:
- "Create a Terraform module for VPC provisioning"
- "Set up remote state with locking for team collaboration"
- "Compare Pulumi vs Terraform for our use case"
- "Design composable infrastructure modules"
- "Implement drift detection for existing infrastructure"
Core Concepts
Infrastructure as Code Fundamentals
Key Principles:
- Declarative vs Imperative - Describe desired state (Terraform) or program infrastructure (Pulumi)
- Idempotency - Same input produces same output, safe to re-run
- Version Control - Infrastructure changes tracked in Git
- State Management - Track actual infrastructure state
- Module Composition - Reusable, versioned infrastructure components
Benefits:
- Reproducibility (same code = same infrastructure)
- Auditability (Git history shows all changes)
- Collaboration (code reviews for infrastructure changes)
- Automation (CI/CD deploys infrastructure)
- Disaster recovery (rebuild from code)
Tool Selection Framework
Choose IaC tools based on team composition and cloud strategy:
Terraform/OpenTofu - Declarative, HCL-based
- Multi-cloud and hybrid-cloud deployments
- Operations/SRE teams prefer declarative approach
- Largest provider ecosystem (AWS, GCP, Azure, 3000+ providers)
- Mature module registry and community
Pulumi - Imperative, programming language-based
- Developer-centric teams familiar with TypeScript/Python/Go
- Complex logic requires programming constructs (loops, conditionals, functions)
- Native unit testing using familiar test frameworks
- Strong typing and IDE support
AWS CDK - AWS-native, programming language-based
- AWS-only infrastructure
- Tight integration with AWS services
- L1/L2/L3 construct abstractions
- CloudFormation under the hood
Decision Tree:
Multi-cloud required?
├─ YES → Team composition?
│ ├─ Ops/SRE focused → Terraform/OpenTofu
│ └─ Developer focused → Pulumi
└─ NO → AWS only?
├─ YES → Language preference?
│ ├─ HCL/declarative → Terraform
│ ├─ TypeScript/Python → AWS CDK
│ └─ YAML/simple → CloudFormation
└─ NO → GCP/Azure only?
└─ Terraform or Pulumi
State Management Architecture
Remote state with locking enables team collaboration:
Backend Selection:
| Cloud Provider | Recommended Backend | Locking Mechanism |
|---|
| AWS | S3 + DynamoDB | DynamoDB table |
| GCP | Google Cloud Storage | Native |
| Azure | Azure Blob Storage | Lease-based |
| Multi-cloud | Terraform Cloud/Enterprise | Built-in |
| Pulumi | Pulumi Service | Built-in |
State Isolation Strategies:
-
Directory Separation (recommended for most teams)
- Separate directories per environment (, , )
- Complete state file isolation
- No risk of cross-environment contamination
-
Workspaces
- Single codebase, multiple environments
- Shared state backend, environment namespacing
- Risk: accidental cross-environment operations
-
Layered Architecture
- Separate state files for networking, compute, data layers
- Blast radius reduction
- Cross-layer references via remote state data sources
Critical State Management Rules:
- Always use remote state for team environments
- Enable state file encryption at rest
- Enable versioning on state storage
- Use state locking to prevent concurrent modifications
- Never commit state files to Git
- Mark sensitive outputs as
Module Design Patterns
Composable Module Structure:
modules/
├── vpc/ # Network foundation
├── security-group/ # Reusable security group patterns
├── rds/ # Database with backups, encryption
├── ecs-cluster/ # Container orchestration base
├── ecs-service/ # Individual microservice
└── alb/ # Application load balancer
Module Versioning:
- Pin module versions in production ()
- Use semantic versioning for internal modules
- Test module updates in non-prod first
- Maintain CHANGELOG for module releases
Module Design Principles:
- Clear input contract (required vs optional variables)
- Documented outputs (what consumers can reference)
- Sane defaults where possible
- Validation rules for inputs
- Examples directory showing usage
When to Create a Module:
- Resource group is reused 3+ times
- Clear boundaries and responsibilities
- Stable interface contract
- Team has module maintenance capacity
When to Keep Monolithic:
- One-off infrastructure
- Rapid prototyping phase
- High coupling between resources
- Small team, simple infrastructure
Quick Reference
Terraform/OpenTofu Commands
bash
# Initialize providers and backend
terraform init
# Plan changes (preview)
terraform plan
# Apply changes
terraform apply
# Destroy infrastructure
terraform destroy
# Format HCL files
terraform fmt
# Validate syntax
terraform validate
# Show state
terraform state list
terraform state show <resource>
# Import existing resources
terraform import <resource.name> <id>
# Workspace management
terraform workspace list
terraform workspace new staging
terraform workspace select prod
Pulumi Commands
bash
# Initialize new project
pulumi new aws-typescript
# Preview changes
pulumi preview
# Apply changes
pulumi up
# Destroy infrastructure
pulumi destroy
# Show stack outputs
pulumi stack output
# Manage stacks
pulumi stack ls
pulumi stack select prod
# Import existing resources
pulumi import <type> <name> <id>
# Export/import state
pulumi stack export > state.json
pulumi stack import < state.json
AWS CDK Commands
bash
# Initialize new app
cdk init app --language typescript
# Synthesize CloudFormation
cdk synth
# Preview changes
cdk diff
# Deploy stack
cdk deploy
# Destroy stack
cdk destroy
# Bootstrap account/region
cdk bootstrap
# List stacks
cdk list
Common Patterns Checklist
Infrastructure Provisioning:
Module Development:
Operational Readiness:
Detailed Documentation
For comprehensive patterns and implementation details:
Tool-Specific Patterns:
references/terraform-patterns.md
- Terraform/OpenTofu best practices, HCL patterns
references/pulumi-patterns.md
- Pulumi across TypeScript/Python/Go
Architecture and Design:
references/state-management.md
- Remote state, locking, isolation strategies
references/module-design.md
- Composable modules, versioning, registries
Operations:
references/drift-detection.md
- Detecting and remediating infrastructure drift
Working Examples
Practical implementations demonstrating IaC patterns:
Terraform Examples:
examples/terraform/vpc-module/
- Multi-AZ VPC with public/private subnets
examples/terraform/ecs-service/
- ECS service with ALB, autoscaling
examples/terraform/rds-cluster/
- Aurora cluster with backups, encryption
examples/terraform/state-backend/
- S3 + DynamoDB backend setup
Pulumi Examples:
examples/pulumi/typescript/vpc/
- TypeScript VPC component
examples/pulumi/python/ecs-service/
- Python ECS service
examples/pulumi/go/rds-cluster/
- Go RDS cluster
- - Unit tests for Pulumi programs
AWS CDK Examples:
examples/cdk/typescript/vpc-stack/
- VPC using L2 constructs
examples/cdk/typescript/ecs-fargate/
- Fargate service with ALB
examples/cdk/typescript/pipeline-stack/
- Self-mutating CDK pipeline
- - CDK assertions and snapshot tests
Utility Scripts
Automated validation and operational tools:
scripts/validate-terraform.sh
- Terraform fmt, validate, tflint
- - Infracost wrapper for cost analysis
- - Scheduled drift detection
- - Checkov/tfsec security scanning
- - State file backup automation
scripts/module-release.sh
- Module versioning and publishing
Integration with Other Skills
Deployment Pipeline:
- - Automate terraform plan/apply in CI/CD
- - GitOps-based infrastructure deployment
Platform Engineering:
- - Provision EKS, GKE, AKS clusters
- - Internal developer platform infrastructure
Security:
- - Provision Vault, External Secrets Operator
- - Implement infrastructure security controls
- - Policy-as-code for compliance
Operations:
- - Provision monitoring infrastructure (Prometheus, Grafana)
- - Infrastructure rebuild procedures
- - Implement cost controls via IaC
Data Platform:
- - Provision data lakes, warehouses
- - Provision Kafka, Kinesis infrastructure
Best Practices
Development Workflow:
- Write infrastructure code in feature branches
- Run / locally
- Submit pull request with plan output
- Code review focuses on security, cost, blast radius
- CI runs automated tests and security scans
- Apply only after approval and CI passes
- Monitor for drift post-deployment
State Management:
- Use remote state from day one (never local state for teams)
- Separate state files per environment
- Enable state locking to prevent concurrent modifications
- Version state storage for rollback capability
- Encrypt state at rest (contains sensitive data)
- Regular state backups to separate location
Module Development:
- Start with monolithic code, extract modules when patterns emerge
- Design for reusability but avoid premature abstraction
- Document all inputs and outputs
- Provide working examples in directory
- Pin provider versions in modules
- Test modules before publishing
- Use semantic versioning for releases
Security:
- Scan IaC for security issues before apply (Checkov, tfsec)
- Never commit secrets to code (use secret references)
- Mark sensitive outputs as
- Implement least-privilege IAM policies
- Enable resource encryption by default
- Use private module registries for internal modules
Cost Management:
- Estimate costs before applying changes (Infracost)
- Tag all resources for cost allocation
- Review cost impact in pull requests
- Set up cost alerts for drift
- Rightsize resources based on usage
Operational Excellence:
- Schedule regular drift detection
- Document disaster recovery procedures
- Maintain runbooks for common operations
- Monitor state file access logs
- Practice infrastructure rebuilds periodically
- Keep provider versions current with testing
Common Pitfalls
State File Issues:
- Manual state editing - Use terraform state commands, not direct edits
- No state locking - Race conditions corrupt state
- Local state for teams - State divergence across team members
- Large state files - Break into multiple state files by layer
Module Design:
- Over-abstraction - Too generic, hard to understand
- Under-abstraction - Copy-paste code everywhere
- No version pinning - Unexpected breaking changes
- No examples - Users don't know how to consume module
Operations:
- No drift detection - Manual changes go unnoticed
- Direct resource modification - Bypassing IaC creates drift
- No rollback plan - Can't recover from failed apply
- Ignoring plan output - Surprises during apply
Security:
- Secrets in code - Hard-coded credentials
- No security scanning - Vulnerabilities in production
- Overly permissive IAM - Excessive privileges
- No state encryption - Sensitive data exposed
Troubleshooting Guide
State Lock Issues:
bash
terraform force-unlock <lock-id> # Use only if certain no other process running
Import Existing Resources:
bash
terraform import aws_vpc.main vpc-12345678
pulumi import aws:ec2/vpc:Vpc main vpc-12345678
Drift Detection:
bash
terraform plan -detailed-exitcode # Exit 2 = drift detected
pulumi preview --diff
For detailed drift remediation, see
references/drift-detection.md
.
State Recovery:
bash
# Terraform: Restore from S3 versioning
aws s3 cp s3://bucket/backup/terraform.tfstate terraform.tfstate
# Pulumi: Restore from checkpoint
pulumi stack export --version <timestamp> | pulumi stack import
Related Skills
For cloud-specific implementations:
- - AWS-specific resource patterns
- - GCP-specific resource patterns
- - Azure-specific resource patterns
For infrastructure operations:
- - Manage Kubernetes clusters provisioned via IaC
- - GitOps-based infrastructure deployment
- - Internal developer platforms
For security and compliance:
- - Infrastructure security controls
- - Secret injection and rotation
- - Policy-as-code for compliance
For deployment automation:
- - CI/CD for infrastructure code
- - Application deployment to provisioned infrastructure
For cost and observability:
- - FinOps practices for infrastructure
- - Monitoring infrastructure health