terraform-data-engineering-iac

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Terraform Data Engineering IaC

Terraform 数据工程基础设施即代码(IaC)

Skill by ara.so — Data Skills collection.
This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineering using Terraform. It provisions AWS resources commonly used in data pipelines including S3 buckets for data storage and EC2 instances for data processing workloads.
ara.so 提供的技能 — 数据技能合集。
本项目展示了如何使用Terraform实现数据工程领域的基础设施即代码(IaC)基础操作,可部署数据管道中常用的AWS资源,包括用于数据存储的S3存储桶和用于数据处理工作负载的EC2实例。

What It Does

功能说明

  • Provisions AWS S3 buckets for data lake storage
  • Creates EC2 instances for data processing and pipeline execution
  • Manages IAM policies for secure resource access
  • Uses Terraform state to track and manage infrastructure changes
  • Provides reproducible infrastructure for data engineering environments
  • 部署AWS S3存储桶 用于数据湖存储
  • 创建EC2实例 用于数据处理和数据管道执行
  • 管理IAM策略 实现安全的资源访问控制
  • 使用Terraform状态 跟踪和管理基础设施变更
  • 提供可复用的基础设施 用于数据工程环境搭建

Prerequisites

前置条件

Before using this project, ensure you have:
  1. AWS Account with appropriate permissions
  2. Terraform CLI installed
  3. AWS CLI installed and configured
  4. IAM user with S3, EC2, and IAM permissions
使用本项目前,请确保你已具备:
  1. 拥有合适权限的AWS账号
  2. 已安装Terraform CLI
  3. 已安装并配置AWS CLI
  4. 拥有S3、EC2和IAM权限的IAM用户

Installation

安装步骤

1. Install Terraform

1. 安装Terraform

bash
undefined
bash
undefined

macOS

macOS

brew install terraform
brew install terraform

Linux

Linux

wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip unzip terraform_1.5.0_linux_amd64.zip sudo mv terraform /usr/local/bin/
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip unzip terraform_1.5.0_linux_amd64.zip sudo mv terraform /usr/local/bin/

Verify installation

验证安装

terraform version
undefined
terraform version
undefined

2. Install AWS CLI

2. 安装AWS CLI

bash
undefined
bash
undefined

macOS

macOS

brew install awscli
brew install awscli

Linux

Linux

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install

Configure AWS credentials

配置AWS凭证

aws configure
undefined
aws configure
undefined

3. Set Up IAM Permissions

3. 配置IAM权限

Create an IAM user with the following managed policies:
  • AmazonS3FullAccess
  • AmazonEC2FullAccess
  • IAMFullAccess
Note: For production, use fine-grained permissions instead of full access.
创建具备以下托管策略的IAM用户:
  • AmazonS3FullAccess
  • AmazonEC2FullAccess
  • IAMFullAccess
注意: 生产环境请使用细粒度权限,而非全权限。

Project Structure

项目结构

terraform/
├── main.tf           # Main infrastructure definitions
├── variables.tf      # Input variables
├── outputs.tf        # Output values
└── terraform.tfstate # State file (generated)
terraform/
├── main.tf           # 核心基础设施定义
├── variables.tf      # 输入变量
├── outputs.tf        # 输出值
└── terraform.tfstate # 状态文件(自动生成)

Key Terraform Commands

Terraform核心命令

Initialize Terraform

初始化Terraform

bash
undefined
bash
undefined

Initialize backend and download providers

初始化后端并下载提供者

terraform -chdir=terraform init
undefined
terraform -chdir=terraform init
undefined

Validate Configuration

验证配置

bash
undefined
bash
undefined

Check syntax and validate configuration

检查语法并验证配置

terraform -chdir=terraform validate
undefined
terraform -chdir=terraform validate
undefined

Format Code

格式化代码

bash
undefined
bash
undefined

Auto-format HCL files

自动格式化HCL文件

terraform -chdir=terraform fmt
undefined
terraform -chdir=terraform fmt
undefined

Plan Infrastructure Changes

预览基础设施变更

bash
undefined
bash
undefined

Preview what will be created/changed

预览将创建/变更的资源

terraform -chdir=terraform plan
undefined
terraform -chdir=terraform plan
undefined

Apply Infrastructure

部署基础设施

bash
undefined
bash
undefined

Create or update infrastructure

创建或更新基础设施

terraform -chdir=terraform apply
terraform -chdir=terraform apply

Auto-approve without confirmation (use carefully)

自动确认无需交互(谨慎使用)

terraform -chdir=terraform apply -auto-approve
undefined
terraform -chdir=terraform apply -auto-approve
undefined

Destroy Infrastructure

销毁基础设施

bash
undefined
bash
undefined

Remove all managed infrastructure

删除所有托管的基础设施

terraform -chdir=terraform destroy
terraform -chdir=terraform destroy

Auto-approve destruction (use carefully)

自动确认销毁(谨慎使用)

terraform -chdir=terraform destroy -auto-approve
undefined
terraform -chdir=terraform destroy -auto-approve
undefined

State Management

状态管理

bash
undefined
bash
undefined

List all resources in state

列出状态中的所有资源

terraform -chdir=terraform state list
terraform -chdir=terraform state list

Show detailed resource information

查看资源详细信息

terraform -chdir=terraform state show aws_s3_bucket.data_bucket
terraform -chdir=terraform state show aws_s3_bucket.data_bucket

View state as JSON

以JSON格式查看状态

cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefined
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefined

Configuration Examples

配置示例

Basic S3 Bucket for Data Storage

用于数据存储的基础S3存储桶

hcl
undefined
hcl
undefined

terraform/main.tf

terraform/main.tf

terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } }
provider "aws" { region = var.aws_region }
resource "aws_s3_bucket" "data_lake" { bucket = "my-unique-data-lake-bucket-${var.environment}"
tags = { Name = "Data Lake Bucket" Environment = var.environment Project = "data-engineering" } }
resource "aws_s3_bucket_versioning" "data_lake_versioning" { bucket = aws_s3_bucket.data_lake.id
versioning_configuration { status = "Enabled" } }
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" { bucket = aws_s3_bucket.data_lake.id
rule { id = "archive_old_data" status = "Enabled"
transition {
  days          = 90
  storage_class = "GLACIER"
}

expiration {
  days = 365
}
} }
undefined
terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } }
provider "aws" { region = var.aws_region }
resource "aws_s3_bucket" "data_lake" { bucket = "my-unique-data-lake-bucket-${var.environment}"
tags = { Name = "Data Lake Bucket" Environment = var.environment Project = "data-engineering" } }
resource "aws_s3_bucket_versioning" "data_lake_versioning" { bucket = aws_s3_bucket.data_lake.id
versioning_configuration { status = "Enabled" } }
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" { bucket = aws_s3_bucket.data_lake.id
rule { id = "archive_old_data" status = "Enabled"
transition {
  days          = 90
  storage_class = "GLACIER"
}

expiration {
  days = 365
}
} }
undefined

EC2 Instance for Data Processing

用于数据处理的EC2实例

hcl
undefined
hcl
undefined

terraform/main.tf (continued)

terraform/main.tf(续)

data "aws_ami" "ubuntu" { most_recent = true owners = ["099720109477"] # Canonical
filter { name = "name" values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"] } }
resource "aws_instance" "data_processor" { ami = data.aws_ami.ubuntu.id instance_type = var.instance_type
tags = { Name = "Data Processing Server" Environment = var.environment }
user_data = <<-EOF #!/bin/bash sudo apt-get update sudo apt-get install -y python3-pip pip3 install pandas boto3 apache-airflow EOF }
resource "aws_eip" "data_processor_eip" { instance = aws_instance.data_processor.id domain = "vpc" }
undefined
data "aws_ami" "ubuntu" { most_recent = true owners = ["099720109477"] # Canonical
filter { name = "name" values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"] } }
resource "aws_instance" "data_processor" { ami = data.aws_ami.ubuntu.id instance_type = var.instance_type
tags = { Name = "Data Processing Server" Environment = var.environment }
user_data = <<-EOF #!/bin/bash sudo apt-get update sudo apt-get install -y python3-pip pip3 install pandas boto3 apache-airflow EOF }
resource "aws_eip" "data_processor_eip" { instance = aws_instance.data_processor.id domain = "vpc" }
undefined

Variables Configuration

变量配置

hcl
undefined
hcl
undefined

terraform/variables.tf

terraform/variables.tf

variable "aws_region" { description = "AWS region for resources" type = string default = "us-east-1" }
variable "environment" { description = "Environment name" type = string default = "dev" }
variable "instance_type" { description = "EC2 instance type" type = string default = "t3.medium" }
undefined
variable "aws_region" { description = "AWS资源所在区域" type = string default = "us-east-1" }
variable "environment" { description = "环境名称" type = string default = "dev" }
variable "instance_type" { description = "EC2实例类型" type = string default = "t3.medium" }
undefined

Outputs Configuration

输出配置

hcl
undefined
hcl
undefined

terraform/outputs.tf

terraform/outputs.tf

output "s3_bucket_name" { description = "Name of the S3 data lake bucket" value = aws_s3_bucket.data_lake.id }
output "ec2_public_ip" { description = "Public IP of data processing EC2 instance" value = aws_eip.data_processor_eip.public_ip }
output "ec2_instance_id" { description = "Instance ID of data processor" value = aws_instance.data_processor.id }
undefined
output "s3_bucket_name" { description = "S3数据湖存储桶名称" value = aws_s3_bucket.data_lake.id }
output "ec2_public_ip" { description = "数据处理EC2实例的公网IP" value = aws_eip.data_processor_eip.public_ip }
output "ec2_instance_id" { description = "数据处理实例ID" value = aws_instance.data_processor.id }
undefined

Common Patterns

常见模式

Multi-Environment Setup

多环境配置

hcl
undefined
hcl
undefined

Use workspace or separate state files

使用工作区或独立状态文件

terraform workspace new staging terraform workspace new production
terraform workspace new staging terraform workspace new production

Or use variable files

或使用变量文件

terraform apply -var-file="environments/dev.tfvars" terraform apply -var-file="environments/prod.tfvars"
undefined
terraform apply -var-file="environments/dev.tfvars" terraform apply -var-file="environments/prod.tfvars"
undefined

Remote State with S3 Backend

基于S3后端的远程状态

hcl
undefined
hcl
undefined

terraform/backend.tf

terraform/backend.tf

terraform { backend "s3" { bucket = "my-terraform-state-bucket" key = "data-engineering/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" } }
undefined
terraform { backend "s3" { bucket = "my-terraform-state-bucket" key = "data-engineering/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "terraform-state-lock" } }
undefined

IAM Role for EC2 with S3 Access

具备S3访问权限的EC2 IAM角色

hcl
resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "s3_access" {
  role       = aws_iam_role.data_processor_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}

resource "aws_instance" "data_processor" {
  ami                  = data.aws_ami.ubuntu.id
  instance_type        = var.instance_type
  iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
}
hcl
resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "s3_access" {
  role       = aws_iam_role.data_processor_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}

resource "aws_instance" "data_processor" {
  ami                  = data.aws_ami.ubuntu.id
  instance_type        = var.instance_type
  iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
}

Verification Commands

验证命令

Verify S3 Buckets

验证S3存储桶

bash
undefined
bash
undefined

List all S3 buckets

列出所有S3存储桶

aws s3 ls
aws s3 ls

Get bucket details

获取存储桶详情

aws s3api get-bucket-location --bucket my-data-lake-bucket
aws s3api get-bucket-location --bucket my-data-lake-bucket

List bucket contents

列出存储桶内容

aws s3 ls s3://my-data-lake-bucket/
undefined
aws s3 ls s3://my-data-lake-bucket/
undefined

Verify EC2 Instances

验证EC2实例

bash
undefined
bash
undefined

List running instances

列出运行中的实例

aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==
Name
].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}'
--output table
aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==
Name
].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}'
--output table

Get specific instance details

获取指定实例详情

aws ec2 describe-instances --instance-ids i-1234567890abcdef0
undefined
aws ec2 describe-instances --instance-ids i-1234567890abcdef0
undefined

Connect to EC2 Instance

连接到EC2实例

bash
undefined
bash
undefined

SSH into instance (requires key pair)

SSH连接实例(需要密钥对)

ssh -i ~/.ssh/my-key.pem ubuntu@$(terraform -chdir=terraform output -raw ec2_public_ip)
undefined
ssh -i ~/.ssh/my-key.pem ubuntu@$(terraform -chdir=terraform output -raw ec2_public_ip)
undefined

Troubleshooting

故障排查

Issue: Terraform Init Fails

问题:Terraform初始化失败

bash
undefined
bash
undefined

Clear cache and reinitialize

清除缓存并重新初始化

rm -rf terraform/.terraform rm terraform/.terraform.lock.hcl terraform -chdir=terraform init
undefined
rm -rf terraform/.terraform rm terraform/.terraform.lock.hcl terraform -chdir=terraform init
undefined

Issue: State Lock Error

问题:状态锁定错误

bash
undefined
bash
undefined

Force unlock (use with caution)

强制解锁(谨慎使用)

terraform -chdir=terraform force-unlock LOCK_ID
undefined
terraform -chdir=terraform force-unlock LOCK_ID
undefined

Issue: AWS Credentials Not Found

问题:未找到AWS凭证

bash
undefined
bash
undefined

Verify AWS configuration

验证AWS配置

aws configure list aws sts get-caller-identity
aws configure list aws sts get-caller-identity

Set credentials explicitly

显式设置凭证

export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" export AWS_DEFAULT_REGION="us-east-1"
undefined
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" export AWS_DEFAULT_REGION="us-east-1"
undefined

Issue: Resource Already Exists

问题:资源已存在

bash
undefined
bash
undefined

Import existing resource into state

将现有资源导入状态

terraform -chdir=terraform import aws_s3_bucket.data_lake my-existing-bucket
terraform -chdir=terraform import aws_s3_bucket.data_lake my-existing-bucket

Or recreate with unique name

或使用唯一名称重新创建

terraform -chdir=terraform apply -var="bucket_suffix=$(date +%s)"
undefined
terraform -chdir=terraform apply -var="bucket_suffix=$(date +%s)"
undefined

Issue: Permission Denied

问题:权限拒绝

Check IAM policies and ensure your user has required permissions:
bash
undefined
检查IAM策略,确保用户具备所需权限:
bash
undefined

Test S3 permissions

测试S3权限

aws s3 ls
aws s3 ls

Test EC2 permissions

测试EC2权限

aws ec2 describe-instances
aws ec2 describe-instances

Test IAM permissions

测试IAM权限

aws iam list-users
undefined
aws iam list-users
undefined

Debugging Terraform

Terraform调试

bash
undefined
bash
undefined

Enable debug logging

启用调试日志

export TF_LOG=DEBUG terraform -chdir=terraform apply
export TF_LOG=DEBUG terraform -chdir=terraform apply

Show detailed plan

查看详细计划

terraform -chdir=terraform plan -out=tfplan terraform -chdir=terraform show tfplan
terraform -chdir=terraform plan -out=tfplan terraform -chdir=terraform show tfplan

Refresh state from actual infrastructure

从实际基础设施刷新状态

terraform -chdir=terraform refresh
undefined
terraform -chdir=terraform refresh
undefined

Best Practices

最佳实践

  1. Always use unique bucket names: S3 bucket names must be globally unique
  2. Version your state files: Enable S3 versioning for state file backups
  3. Use remote state: Store state in S3 with locking via DynamoDB
  4. Tag all resources: Apply consistent tagging for cost tracking and organization
  5. Use variables: Parameterize configurations for reusability
  6. Run
    terraform plan
    before apply to review changes
  7. Destroy dev resources: Don't leave test infrastructure running to avoid costs
  1. 始终使用唯一存储桶名称:S3存储桶名称必须全局唯一
  2. 版本化状态文件:为状态文件启用S3版本化以实现备份
  3. 使用远程状态:将状态存储在S3中,并通过DynamoDB实现锁定
  4. 为所有资源添加标签:使用统一标签便于成本追踪和资源管理
  5. 使用变量:参数化配置以提高复用性
  6. 部署前运行
    terraform plan
    :查看变更内容
  7. 销毁开发环境资源:避免测试基础设施持续运行产生不必要的成本