iac-data-engineering-terraform

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

IaC for Data Engineering with Terraform

基于Terraform的数据工程基础设施即代码(IaC)实践

Skill by ara.so — Data Skills collection.
This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineers using Terraform to provision AWS resources including S3 buckets, EC2 instances, and IAM configurations. It provides reusable patterns for managing data infrastructure declaratively.
技能来自 ara.so —— 数据技能合集。
本项目展示了面向数据工程师的基础设施即代码(IaC)基础实践,通过Terraform来部署AWS资源,包括S3存储桶、EC2实例以及IAM配置。它提供了可复用的声明式数据基础设施管理模式。

What This Project Does

本项目功能

  • Provisions AWS S3 buckets for data storage
  • Creates and configures EC2 instances for data processing
  • Sets up IAM roles and policies with proper permissions
  • Manages infrastructure state with Terraform
  • Provides reproducible data engineering environments
  • 部署用于数据存储的AWS S3存储桶
  • 创建并配置用于数据处理的EC2实例
  • 设置具备合理权限的IAM角色与策略
  • 通过Terraform管理基础设施状态
  • 提供可复现的数据工程环境

Prerequisites

前置条件

Before using this project, ensure you have:
bash
undefined
使用本项目前,请确保已完成以下配置:
bash
undefined

Install Terraform

安装Terraform

brew tap hashicorp/tap brew install hashicorp/tap/terraform
brew tap hashicorp/tap brew install hashicorp/tap/terraform

Install AWS CLI

安装AWS CLI

brew install awscli
brew install awscli

Configure AWS credentials

配置AWS凭证

aws configure
aws configure

Enter your AWS Access Key ID, Secret Access Key, region, and output format

输入你的AWS访问密钥ID、秘密访问密钥、区域和输出格式


Set up required environment variables:

```bash
export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_KEY
export AWS_DEFAULT_REGION=us-east-1

设置所需环境变量:

```bash
export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_KEY
export AWS_DEFAULT_REGION=us-east-1

Project Structure

项目结构

terraform/
├── main.tf          # Main infrastructure definitions
├── variables.tf     # Input variables
├── outputs.tf       # Output values
└── terraform.tfstate # State file (auto-generated)
terraform/
├── main.tf          # 核心基础设施定义
├── variables.tf     # 输入变量
├── outputs.tf       # 输出值
└── terraform.tfstate # 状态文件(自动生成)

Core Terraform Commands

Terraform核心命令

Initialize Terraform

初始化Terraform

bash
undefined
bash
undefined

Initialize the working directory and download providers

初始化工作目录并下载提供商

terraform -chdir=terraform init
terraform -chdir=terraform init

Validate configuration syntax

验证配置语法

terraform -chdir=terraform validate
terraform -chdir=terraform validate

Format configuration files

格式化配置文件

terraform -chdir=terraform fmt
undefined
terraform -chdir=terraform fmt
undefined

Plan and Apply Infrastructure

规划与部署基础设施

bash
undefined
bash
undefined

Preview changes without applying

预览变更而不实际部署

terraform -chdir=terraform plan
terraform -chdir=terraform plan

Apply infrastructure changes

应用基础设施变更

terraform -chdir=terraform apply
terraform -chdir=terraform apply

Auto-approve without prompts (use carefully)

自动确认无需交互提示(谨慎使用)

terraform -chdir=terraform apply -auto-approve
undefined
terraform -chdir=terraform apply -auto-approve
undefined

Inspect Infrastructure

查看基础设施

bash
undefined
bash
undefined

List all resources in state

列出状态中的所有资源

terraform -chdir=terraform state list
terraform -chdir=terraform state list

Show detailed state information

显示详细状态信息

terraform -chdir=terraform show
terraform -chdir=terraform show

Output specific values

输出指定值

terraform -chdir=terraform output
undefined
terraform -chdir=terraform output
undefined

Destroy Infrastructure

销毁基础设施

bash
undefined
bash
undefined

Destroy all managed infrastructure

销毁所有托管的基础设施

terraform -chdir=terraform destroy
terraform -chdir=terraform destroy

Destroy specific resource

销毁指定资源

terraform -chdir=terraform destroy -target=aws_s3_bucket.data_bucket
undefined
terraform -chdir=terraform destroy -target=aws_s3_bucket.data_bucket
undefined

Key Configuration Patterns

核心配置模式

S3 Bucket for Data Storage

用于数据存储的S3存储桶

hcl
undefined
hcl
undefined

main.tf

main.tf

resource "aws_s3_bucket" "data_lake" { bucket = "my-data-engineering-bucket-${random_id.bucket_suffix.hex}"
tags = { Environment = "dev" Purpose = "data-engineering" ManagedBy = "terraform" } }
resource "random_id" "bucket_suffix" { byte_length = 4 }
resource "aws_s3_bucket" "data_lake" { bucket = "my-data-engineering-bucket-${random_id.bucket_suffix.hex}"
tags = { Environment = "dev" Purpose = "data-engineering" ManagedBy = "terraform" } }
resource "random_id" "bucket_suffix" { byte_length = 4 }

Enable versioning for data protection

启用版本控制以保护数据

resource "aws_s3_bucket_versioning" "data_lake_versioning" { bucket = aws_s3_bucket.data_lake.id
versioning_configuration { status = "Enabled" } }
resource "aws_s3_bucket_versioning" "data_lake_versioning" { bucket = aws_s3_bucket.data_lake.id
versioning_configuration { status = "Enabled" } }

Configure lifecycle rules

配置生命周期规则

resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" { bucket = aws_s3_bucket.data_lake.id
rule { id = "archive-old-data" status = "Enabled"
transition {
  days          = 90
  storage_class = "GLACIER"
}

expiration {
  days = 365
}
} }
undefined
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" { bucket = aws_s3_bucket.data_lake.id
rule { id = "archive-old-data" status = "Enabled"
transition {
  days          = 90
  storage_class = "GLACIER"
}

expiration {
  days = 365
}
} }
undefined

EC2 Instance for Data Processing

用于数据处理的EC2实例

hcl
undefined
hcl
undefined

main.tf

main.tf

resource "aws_instance" "data_processor" { ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2 instance_type = "t3.medium"
key_name = aws_key_pair.data_eng_key.key_name
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
user_data = <<-EOF #!/bin/bash yum update -y yum install -y python3 python3-pip pip3 install pandas boto3 awscli EOF
tags = { Name = "data-processor" Environment = "dev" ManagedBy = "terraform" }
root_block_device { volume_size = 50 volume_type = "gp3" } }
resource "aws_key_pair" "data_eng_key" { key_name = "data-engineering-key" public_key = file("~/.ssh/id_rsa.pub") }
undefined
resource "aws_instance" "data_processor" { ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2 instance_type = "t3.medium"
key_name = aws_key_pair.data_eng_key.key_name
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
user_data = <<-EOF #!/bin/bash yum update -y yum install -y python3 python3-pip pip3 install pandas boto3 awscli EOF
tags = { Name = "data-processor" Environment = "dev" ManagedBy = "terraform" }
root_block_device { volume_size = 50 volume_type = "gp3" } }
resource "aws_key_pair" "data_eng_key" { key_name = "data-engineering-key" public_key = file("~/.ssh/id_rsa.pub") }
undefined

Security Group Configuration

安全组配置

hcl
resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing EC2 instances"
  
  # SSH access
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict in production
  }
  
  # Allow all outbound traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "data-processor-sg"
  }
}
hcl
resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "数据处理EC2实例的安全组"
  
  # SSH访问
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # 生产环境中请限制访问范围
  }
  
  # 允许所有出站流量
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "data-processor-sg"
  }
}

IAM Role for EC2 with S3 Access

具备S3访问权限的EC2 IAM角色

hcl
resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "s3_access_policy" {
  name = "s3-access-policy"
  role = aws_iam_role.data_processor_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}
hcl
resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "s3_access_policy" {
  name = "s3-access-policy"
  role = aws_iam_role.data_processor_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}

Variables and Outputs

变量与输出

Define Variables

定义变量

hcl
undefined
hcl
undefined

variables.tf

variables.tf

variable "aws_region" { description = "AWS region for resources" type = string default = "us-east-1" }
variable "environment" { description = "Environment name" type = string default = "dev" }
variable "instance_type" { description = "EC2 instance type" type = string default = "t3.medium" }
variable "bucket_prefix" { description = "Prefix for S3 bucket names" type = string default = "data-engineering" }
undefined
variable "aws_region" { description = "资源所在的AWS区域" type = string default = "us-east-1" }
variable "environment" { description = "环境名称" type = string default = "dev" }
variable "instance_type" { description = "EC2实例类型" type = string default = "t3.medium" }
variable "bucket_prefix" { description = "S3存储桶名称前缀" type = string default = "data-engineering" }
undefined

Configure Outputs

配置输出

hcl
undefined
hcl
undefined

outputs.tf

outputs.tf

output "s3_bucket_name" { description = "Name of the created S3 bucket" value = aws_s3_bucket.data_lake.id }
output "s3_bucket_arn" { description = "ARN of the S3 bucket" value = aws_s3_bucket.data_lake.arn }
output "ec2_instance_id" { description = "ID of the EC2 instance" value = aws_instance.data_processor.id }
output "ec2_public_ip" { description = "Public IP of the EC2 instance" value = aws_instance.data_processor.public_ip }
output "ec2_private_ip" { description = "Private IP of the EC2 instance" value = aws_instance.data_processor.private_ip }
undefined
output "s3_bucket_name" { description = "创建的S3存储桶名称" value = aws_s3_bucket.data_lake.id }
output "s3_bucket_arn" { description = "S3存储桶的ARN" value = aws_s3_bucket.data_lake.arn }
output "ec2_instance_id" { description = "EC2实例ID" value = aws_instance.data_processor.id }
output "ec2_public_ip" { description = "EC2实例的公网IP" value = aws_instance.data_processor.public_ip }
output "ec2_private_ip" { description = "EC2实例的内网IP" value = aws_instance.data_processor.private_ip }
undefined

Remote State Management

远程状态管理

For team collaboration, use S3 backend for state:
hcl
undefined
针对团队协作场景,使用S3后端存储状态:
hcl
undefined

backend.tf

backend.tf

terraform { backend "s3" { bucket = "terraform-state-bucket-name" key = "data-engineering/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" } }

Create the backend resources:

```hcl
resource "aws_s3_bucket" "terraform_state" {
  bucket = "terraform-state-bucket-name"
  
  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}
terraform { backend "s3" { bucket = "terraform-state-bucket-name" key = "data-engineering/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" } }

创建后端资源:

```hcl
resource "aws_s3_bucket" "terraform_state" {
  bucket = "terraform-state-bucket-name"
  
  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}

Verification Commands

验证命令

After applying infrastructure:
bash
undefined
部署基础设施后,可执行以下命令验证:
bash
undefined

Verify S3 buckets

验证S3存储桶

aws s3 ls
aws s3 ls

Verify EC2 instances

验证EC2实例

aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==
Name
].Value,Type:InstanceType,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}'
--output table
aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==
Name
].Value,Type:InstanceType,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}'
--output table

Check IAM roles

检查IAM角色

aws iam list-roles --query 'Roles[?contains(RoleName,
data-processor
)].RoleName'
aws iam list-roles --query 'Roles[?contains(RoleName,
data-processor
)].RoleName'

Inspect Terraform state

查看Terraform状态

terraform -chdir=terraform state list cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefined
terraform -chdir=terraform state list cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefined

Common Patterns

常见模式

Multi-Environment Setup

多环境配置

hcl
undefined
hcl
undefined

environments/dev/main.tf

environments/dev/main.tf

module "data_infrastructure" { source = "../../modules/data-infra"
environment = "dev" instance_type = "t3.small" bucket_prefix = "dev-data" }
module "data_infrastructure" { source = "../../modules/data-infra"
environment = "dev" instance_type = "t3.small" bucket_prefix = "dev-data" }

environments/prod/main.tf

environments/prod/main.tf

module "data_infrastructure" { source = "../../modules/data-infra"
environment = "prod" instance_type = "t3.large" bucket_prefix = "prod-data" }
undefined
module "data_infrastructure" { source = "../../modules/data-infra"
environment = "prod" instance_type = "t3.large" bucket_prefix = "prod-data" }
undefined

Using terraform.tfvars

使用terraform.tfvars

hcl
undefined
hcl
undefined

terraform.tfvars

terraform.tfvars

aws_region = "us-west-2" environment = "staging" instance_type = "t3.medium" bucket_prefix = "staging-data-lake"

Apply with variables:

```bash
terraform -chdir=terraform apply -var-file="terraform.tfvars"
aws_region = "us-west-2" environment = "staging" instance_type = "t3.medium" bucket_prefix = "staging-data-lake"

通过变量文件部署:

```bash
terraform -chdir=terraform apply -var-file="terraform.tfvars"

Troubleshooting

故障排查

State Lock Issues

状态锁定问题

bash
undefined
bash
undefined

Force unlock if state is stuck

若状态被卡住,强制解锁

terraform -chdir=terraform force-unlock LOCK_ID
terraform -chdir=terraform force-unlock LOCK_ID

View current state

查看当前状态

terraform -chdir=terraform show
undefined
terraform -chdir=terraform show
undefined

S3 Bucket Name Conflicts

S3存储桶名称冲突

If bucket name is taken:
hcl
undefined
若存储桶名称已被占用:
hcl
undefined

Use random suffix

使用随机后缀

resource "random_id" "bucket_suffix" { byte_length = 8 }
resource "aws_s3_bucket" "data_lake" { bucket = "${var.bucket_prefix}-${random_id.bucket_suffix.hex}" }
undefined
resource "random_id" "bucket_suffix" { byte_length = 8 }
resource "aws_s3_bucket" "data_lake" { bucket = "${var.bucket_prefix}-${random_id.bucket_suffix.hex}" }
undefined

Import Existing Resources

导入已有资源

bash
undefined
bash
undefined

Import existing S3 bucket

导入已有S3存储桶

terraform -chdir=terraform import aws_s3_bucket.data_lake existing-bucket-name
terraform -chdir=terraform import aws_s3_bucket.data_lake existing-bucket-name

Import EC2 instance

导入EC2实例

terraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0
undefined
terraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0
undefined

Debugging Terraform

Terraform调试

bash
undefined
bash
undefined

Enable detailed logging

启用详细日志

export TF_LOG=DEBUG terraform -chdir=terraform apply
export TF_LOG=DEBUG terraform -chdir=terraform apply

Disable logging

关闭日志

unset TF_LOG
undefined
unset TF_LOG
undefined

Refresh State

刷新状态

bash
undefined
bash
undefined

Sync state with real infrastructure

同步状态与实际基础设施

terraform -chdir=terraform refresh
terraform -chdir=terraform refresh

Replace corrupted resource

替换损坏的资源

terraform -chdir=terraform apply -replace=aws_instance.data_processor
undefined
terraform -chdir=terraform apply -replace=aws_instance.data_processor
undefined

Best Practices

最佳实践

  1. Always use variables for environment-specific values
  2. Enable S3 versioning for data protection
  3. Use IAM roles instead of access keys for EC2
  4. Tag all resources for cost tracking and management
  5. Store state remotely for team collaboration
  6. Use modules for reusable infrastructure patterns
  7. Run
    terraform plan
    before every apply
  8. Never commit
    .tfstate
    files or sensitive variables to Git
  9. Use
    .gitignore
    for Terraform files:
gitignore
undefined
  1. 始终使用变量存储环境特定值
  2. 启用S3版本控制以保护数据
  3. 为EC2使用IAM角色而非访问密钥
  4. 为所有资源添加标签以便成本追踪与管理
  5. 远程存储状态支持团队协作
  6. 使用模块实现可复用的基础设施模式
  7. 每次部署前运行
    terraform plan
  8. 切勿将
    .tfstate
    文件或敏感变量提交至Git
  9. 为Terraform文件配置
    .gitignore
gitignore
undefined

.gitignore

.gitignore

.terraform/ *.tfstate *.tfstate.backup .terraform.lock.hcl terraform.tfvars *.auto.tfvars
undefined
.terraform/ *.tfstate *.tfstate.backup .terraform.lock.hcl terraform.tfvars *.auto.tfvars
undefined