iac-terraform-data-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

IaC for Data Engineering with Terraform

面向数据工程的Terraform基础设施即代码(IaC)实践

Skill by ara.so — Data Skills collection.
This project provides Infrastructure-as-Code (IaC) templates and patterns for data engineers using Terraform to provision and manage AWS resources. It focuses on creating reproducible, version-controlled infrastructure for data platforms including S3 storage, EC2 compute instances, and IAM permissions.
技能来自 ara.so —— 数据技能合集。
本项目为数据工程师提供使用Terraform配置和管理AWS资源的基础设施即代码(IaC)模板与实践模式,专注于为数据平台创建可复现、受版本控制的基础设施,包括S3存储、EC2计算实例和IAM权限管理。

What This Project Does

本项目功能

  • Provides Terraform configurations for common data engineering infrastructure on AWS
  • Demonstrates IaC best practices for S3 buckets, EC2 instances, and IAM roles
  • Shows state management and lifecycle operations for data infrastructure
  • Teaches reproducible infrastructure provisioning for data pipelines
  • 提供AWS上常见数据工程基础设施的Terraform配置
  • 展示针对S3存储桶、EC2实例和IAM角色的IaC最佳实践
  • 演示数据基础设施的状态管理与生命周期操作
  • 教授数据管道的可复现基础设施配置方法

Prerequisites

前置条件

Before using this project, ensure you have:
  1. AWS Account with root or admin access
  2. Terraform CLI installed (installation guide)
  3. AWS CLI installed and configured (setup guide)
  4. AWS Credentials configured via
    aws configure
使用本项目前,请确保已具备:
  1. AWS账户,拥有管理员权限
  2. Terraform CLI 已安装(安装指南
  3. AWS CLI 已安装并配置(设置指南
  4. 通过
    aws configure
    配置好 AWS凭证

AWS IAM Setup

AWS IAM 设置

Create an IAM user with appropriate permissions:
  1. Create IAM User: Navigate to AWS Console → IAM → Users → Create user
  2. Create Inline Policy: Attach a custom policy to the user
  3. Grant Permissions: For development/learning, grant full access to:
    • Amazon S3
    • Amazon EC2
    • AWS IAM
⚠️ Security Note: Full service access is NOT recommended for production. Use least-privilege policies in production environments.
创建具备合适权限的IAM用户:
  1. 创建IAM用户:进入AWS控制台 → IAM → 用户 → 创建用户
  2. 创建内联策略:为用户附加自定义策略
  3. 授予权限:开发/学习场景下,授予以下服务的完全访问权限:
    • Amazon S3
    • Amazon EC2
    • AWS IAM
⚠️ 安全提示:生产环境不建议授予全服务访问权限,请遵循最小权限原则配置策略。

Project Structure

项目结构

terraform/
├── main.tf           # Main Terraform configuration
├── variables.tf      # Input variables (if present)
├── outputs.tf        # Output values (if present)
└── terraform.tfstate # State file (generated)
terraform/
├── main.tf           # 主Terraform配置文件
├── variables.tf      # 输入变量(若存在)
├── outputs.tf        # 输出值(若存在)
└── terraform.tfstate # 生成的状态文件

Key Terraform Commands

核心Terraform命令

Initialize Terraform

初始化Terraform

Initialize the working directory and download provider plugins:
bash
terraform -chdir=terraform init
初始化工作目录并下载提供商插件:
bash
terraform -chdir=terraform init

Validate Configuration

验证配置

Check if the configuration is syntactically valid:
bash
terraform -chdir=terraform validate
检查配置是否符合语法规范:
bash
terraform -chdir=terraform validate

Format Code

格式化代码

Automatically format Terraform files to canonical style:
bash
terraform -chdir=terraform fmt
自动将Terraform文件格式化为标准风格:
bash
terraform -chdir=terraform fmt

Plan Infrastructure Changes

预览基础设施变更

Preview what Terraform will create/modify/destroy:
bash
terraform -chdir=terraform plan
预览Terraform将创建/修改/销毁的资源:
bash
terraform -chdir=terraform plan

Apply Configuration

应用配置

Create or update infrastructure:
bash
terraform -chdir=terraform apply
Terraform will show a plan and ask for confirmation. Type
yes
to proceed.
创建或更新基础设施:
bash
terraform -chdir=terraform apply
Terraform会展示变更计划并请求确认,输入
yes
继续执行。

Auto-approve (for automation)

自动确认(适用于自动化场景)

bash
terraform -chdir=terraform apply -auto-approve
bash
terraform -chdir=terraform apply -auto-approve

Destroy Infrastructure

销毁基础设施

Remove all resources managed by Terraform:
bash
terraform -chdir=terraform destroy
移除Terraform管理的所有资源:
bash
terraform -chdir=terraform destroy

Configuration

配置说明

Basic Terraform Configuration Example

基础Terraform配置示例

Before applying, modify
terraform/main.tf
to customize resource names:
hcl
undefined
应用前,请修改
terraform/main.tf
自定义资源名称:
hcl
undefined

terraform/main.tf

terraform/main.tf

terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } }
provider "aws" { region = "us-east-1" }
terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } }
provider "aws" { region = "us-east-1" }

S3 bucket for data storage

用于数据存储的S3桶

resource "aws_s3_bucket" "data_bucket" { bucket = "my-unique-data-engineering-bucket-12345"
tags = { Name = "Data Engineering Bucket" Environment = "dev" ManagedBy = "Terraform" } }
resource "aws_s3_bucket" "data_bucket" { bucket = "my-unique-data-engineering-bucket-12345"
tags = { Name = "Data Engineering Bucket" Environment = "dev" ManagedBy = "Terraform" } }

EC2 instance for data processing

用于数据处理的EC2实例

resource "aws_instance" "data_processor" { ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2 instance_type = "t2.micro"
tags = { Name = "Data Processor" Environment = "dev" ManagedBy = "Terraform" } }
resource "aws_instance" "data_processor" { ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2 instance_type = "t2.micro"
tags = { Name = "Data Processor" Environment = "dev" ManagedBy = "Terraform" } }

IAM role for EC2 instance

用于EC2实例的IAM角色

resource "aws_iam_role" "ec2_s3_role" { name = "ec2-s3-access-role"
assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [ { Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "ec2.amazonaws.com" } } ] }) }
undefined
resource "aws_iam_role" "ec2_s3_role" { name = "ec2-s3-access-role"
assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [ { Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "ec2.amazonaws.com" } } ] }) }
undefined

Variables Configuration

变量配置

Create
terraform/variables.tf
for reusable configurations:
hcl
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "bucket_name" {
  description = "S3 bucket name for data storage"
  type        = string
  # Set via terraform.tfvars or -var flag
}
Use variables in
main.tf
:
hcl
provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "data_bucket" {
  bucket = var.bucket_name
  
  tags = {
    Environment = var.environment
  }
}
Create
terraform/terraform.tfvars
:
hcl
bucket_name  = "my-unique-bucket-name-2026"
aws_region   = "us-west-2"
environment  = "production"
创建
terraform/variables.tf
实现可复用配置:
hcl
variable "aws_region" {
  description = "资源所在的AWS区域"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "环境名称"
  type        = string
  default     = "dev"
}

variable "bucket_name" {
  description = "数据存储用S3桶名称"
  type        = string
  # 通过terraform.tfvars或-var参数设置
}
main.tf
中使用变量:
hcl
provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "data_bucket" {
  bucket = var.bucket_name
  
  tags = {
    Environment = var.environment
  }
}
创建
terraform/terraform.tfvars
hcl
bucket_name  = "my-unique-bucket-name-2026"
aws_region   = "us-west-2"
environment  = "production"

State Management

状态管理

Inspect State

查看状态

List all resources in the state:
bash
terraform -chdir=terraform state list
View detailed state information:
bash
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
列出状态中的所有资源:
bash
terraform -chdir=terraform state list
查看详细状态信息:
bash
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'

Remote State (Production Pattern)

远程状态(生产环境模式)

For production, store state remotely in S3:
hcl
undefined
生产环境下,建议将状态存储在S3远程存储中:
hcl
undefined

terraform/backend.tf

terraform/backend.tf

terraform { backend "s3" { bucket = "my-terraform-state-bucket" key = "data-platform/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" } }

Initialize with backend configuration:

```bash
terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"
terraform { backend "s3" { bucket = "my-terraform-state-bucket" key = "data-platform/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" } }

使用后端配置初始化:

```bash
terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"

Verification Commands

验证命令

Verify S3 Bucket Creation

验证S3桶创建

bash
aws s3 ls
bash
aws s3 ls

Verify EC2 Instance

验证EC2实例

bash
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table
bash
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table

Check Specific Resource

检查特定资源

bash
terraform -chdir=terraform show aws_s3_bucket.data_bucket
bash
terraform -chdir=terraform show aws_s3_bucket.data_bucket

Common Patterns for Data Engineering

数据工程常见实践模式

Pattern 1: Data Lake with Multiple Buckets

模式1:多桶数据湖

hcl
undefined
hcl
undefined

Raw data bucket

原始数据桶

resource "aws_s3_bucket" "raw_data" { bucket = "my-data-lake-raw-${var.environment}" }
resource "aws_s3_bucket" "raw_data" { bucket = "my-data-lake-raw-${var.environment}" }

Processed data bucket

处理后数据桶

resource "aws_s3_bucket" "processed_data" { bucket = "my-data-lake-processed-${var.environment}" }
resource "aws_s3_bucket" "processed_data" { bucket = "my-data-lake-processed-${var.environment}" }

Enable versioning for data lineage

启用版本控制以支持数据血缘

resource "aws_s3_bucket_versioning" "raw_data_versioning" { bucket = aws_s3_bucket.raw_data.id
versioning_configuration { status = "Enabled" } }
resource "aws_s3_bucket_versioning" "raw_data_versioning" { bucket = aws_s3_bucket.raw_data.id
versioning_configuration { status = "Enabled" } }

Lifecycle rules for cost optimization

生命周期规则优化成本

resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" { bucket = aws_s3_bucket.raw_data.id
rule { id = "archive-old-data" status = "Enabled"
transition {
  days          = 90
  storage_class = "GLACIER"
}
} }
undefined
resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" { bucket = aws_s3_bucket.raw_data.id
rule { id = "archive-old-data" status = "Enabled"
transition {
  days          = 90
  storage_class = "GLACIER"
}
} }
undefined

Pattern 2: EC2 with Data Processing Tools

模式2:搭载数据处理工具的EC2实例

hcl
undefined
hcl
undefined

Security group for data processor

数据处理器安全组

resource "aws_security_group" "data_processor_sg" { name = "data-processor-sg" description = "Security group for data processing instances"
ingress { from_port = 22 to_port = 22 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] # Restrict in production }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }
resource "aws_security_group" "data_processor_sg" { name = "data-processor-sg" description = "数据处理实例的安全组"
ingress { from_port = 22 to_port = 22 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] # 生产环境请限制访问范围 }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }

EC2 instance with user data for setup

带用户数据初始化的EC2实例

resource "aws_instance" "data_processor" { ami = var.ami_id instance_type = "t3.medium"
vpc_security_group_ids = [aws_security_group.data_processor_sg.id] iam_instance_profile = aws_iam_instance_profile.ec2_profile.name
user_data = <<-EOF #!/bin/bash yum update -y yum install -y python3 python3-pip pip3 install pandas boto3 EOF
tags = { Name = "Data Processor Instance" } }
resource "aws_instance" "data_processor" { ami = var.ami_id instance_type = "t3.medium"
vpc_security_group_ids = [aws_security_group.data_processor_sg.id] iam_instance_profile = aws_iam_instance_profile.ec2_profile.name
user_data = <<-EOF #!/bin/bash yum update -y yum install -y python3 python3-pip pip3 install pandas boto3 EOF
tags = { Name = "Data Processor Instance" } }

IAM instance profile

IAM实例配置文件

resource "aws_iam_instance_profile" "ec2_profile" { name = "ec2-data-processor-profile" role = aws_iam_role.ec2_s3_role.name }
undefined
resource "aws_iam_instance_profile" "ec2_profile" { name = "ec2-data-processor-profile" role = aws_iam_role.ec2_s3_role.name }
undefined

Pattern 3: Outputs for Integration

模式3:用于集成的输出配置

hcl
undefined
hcl
undefined

terraform/outputs.tf

terraform/outputs.tf

output "s3_bucket_name" { description = "Name of the S3 bucket" value = aws_s3_bucket.data_bucket.id }
output "s3_bucket_arn" { description = "ARN of the S3 bucket" value = aws_s3_bucket.data_bucket.arn }
output "ec2_instance_id" { description = "ID of the EC2 instance" value = aws_instance.data_processor.id }
output "ec2_public_ip" { description = "Public IP of the EC2 instance" value = aws_instance.data_processor.public_ip }

Access outputs:

```bash
terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'
output "s3_bucket_name" { description = "S3桶名称" value = aws_s3_bucket.data_bucket.id }
output "s3_bucket_arn" { description = "S3桶ARN" value = aws_s3_bucket.data_bucket.arn }
output "ec2_instance_id" { description = "EC2实例ID" value = aws_instance.data_processor.id }
output "ec2_public_ip" { description = "EC2实例公网IP" value = aws_instance.data_processor.public_ip }

访问输出值:

```bash
terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'

Troubleshooting

故障排查

Issue: "Error acquiring the state lock"

问题:"Error acquiring the state lock"

Cause: Another Terraform process is running or a previous run didn't release the lock.
Solution:
bash
undefined
原因:存在其他Terraform进程在运行,或上一次运行未释放锁。
解决方案
bash
undefined

Force unlock (use with caution)

强制解锁(谨慎使用)

terraform -chdir=terraform force-unlock <LOCK_ID>
undefined
terraform -chdir=terraform force-unlock <LOCK_ID>
undefined

Issue: "bucket name already exists"

问题:"bucket name already exists"

Cause: S3 bucket names must be globally unique across all AWS accounts.
Solution: Change the bucket name in
main.tf
to something unique:
hcl
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}
原因:S3桶名称在所有AWS账户中必须全局唯一。
解决方案:修改
main.tf
中的桶名称为唯一值:
hcl
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

Issue: "insufficient IAM permissions"

问题:"insufficient IAM permissions"

Cause: The IAM user doesn't have required permissions.
Solution: Verify IAM policy includes necessary actions:
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "ec2:*",
        "iam:*"
      ],
      "Resource": "*"
    }
  ]
}
原因:IAM用户缺少必要权限。
解决方案:验证IAM策略包含所需操作:
json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "ec2:*",
        "iam:*"
      ],
      "Resource": "*"
    }
  ]
}

Issue: State file out of sync

问题:状态文件不同步

Cause: Manual changes made outside Terraform.
Solution: Refresh the state:
bash
terraform -chdir=terraform refresh
Or import existing resources:
bash
terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket
原因:在Terraform之外手动修改了资源。
解决方案:刷新状态:
bash
terraform -chdir=terraform refresh
或导入现有资源:
bash
terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket

Workflow Example

工作流示例

Complete workflow for setting up data infrastructure:
bash
undefined
搭建数据基础设施的完整工作流:
bash
undefined

1. Configure AWS credentials

1. 配置AWS凭证

export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" export AWS_DEFAULT_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" export AWS_DEFAULT_REGION="us-east-1"

2. Customize configuration

2. 自定义配置

cd terraform
cd terraform

Edit main.tf to set unique bucket name

编辑main.tf设置唯一桶名称

3. Initialize Terraform

3. 初始化Terraform

terraform init
terraform init

4. Validate configuration

4. 验证配置

terraform validate
terraform validate

5. Format code

5. 格式化代码

terraform fmt
terraform fmt

6. Preview changes

6. 预览变更

terraform plan
terraform plan

7. Apply configuration

7. 应用配置

terraform apply
terraform apply

8. Verify resources

8. 验证资源

aws s3 ls aws ec2 describe-instances --output table
aws s3 ls aws ec2 describe-instances --output table

9. When done, clean up

9. 使用完成后清理资源

terraform destroy
undefined
terraform destroy
undefined

Best Practices for Data Engineering IaC

数据工程IaC最佳实践

  1. Use variables for environment-specific values
  2. Enable S3 versioning for data lineage and recovery
  3. Tag all resources for cost tracking and management
  4. Store state remotely in S3 with encryption and locking
  5. Use modules to organize reusable infrastructure components
  6. Never commit
    .tfstate
    files or AWS credentials to version control
  7. Implement lifecycle rules on S3 for cost optimization
  8. Use IAM roles instead of access keys for EC2 instances
  9. Plan before apply to review changes
  10. Destroy unused resources to avoid unnecessary costs
  1. 使用变量存储环境特定值
  2. 启用S3版本控制以支持数据血缘与恢复
  3. 为所有资源添加标签便于成本追踪与管理
  4. 将状态存储在远程S3中,并启用加密与锁定
  5. 使用模块组织可复用的基础设施组件
  6. 切勿提交
    .tfstate
    文件或AWS凭证到版本控制系统
  7. 为S3配置生命周期规则优化成本
  8. 为EC2实例使用IAM角色而非访问密钥
  9. 应用前先执行计划以审查变更
  10. 销毁未使用资源避免不必要的成本