terraform-data-engineering-iac
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTerraform Data Engineering IaC
Terraform 数据工程基础设施即代码(IaC)
Skill by ara.so — Data Skills collection.
This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineering using Terraform. It provisions AWS resources commonly used in data pipelines including S3 buckets for data storage and EC2 instances for data processing workloads.
由 ara.so 提供的技能 — 数据技能合集。
本项目展示了如何使用Terraform实现数据工程领域的基础设施即代码(IaC)基础操作,可部署数据管道中常用的AWS资源,包括用于数据存储的S3存储桶和用于数据处理工作负载的EC2实例。
What It Does
功能说明
- Provisions AWS S3 buckets for data lake storage
- Creates EC2 instances for data processing and pipeline execution
- Manages IAM policies for secure resource access
- Uses Terraform state to track and manage infrastructure changes
- Provides reproducible infrastructure for data engineering environments
- 部署AWS S3存储桶 用于数据湖存储
- 创建EC2实例 用于数据处理和数据管道执行
- 管理IAM策略 实现安全的资源访问控制
- 使用Terraform状态 跟踪和管理基础设施变更
- 提供可复用的基础设施 用于数据工程环境搭建
Prerequisites
前置条件
Before using this project, ensure you have:
- AWS Account with appropriate permissions
- Terraform CLI installed
- AWS CLI installed and configured
- IAM user with S3, EC2, and IAM permissions
使用本项目前,请确保你已具备:
- 拥有合适权限的AWS账号
- 已安装Terraform CLI
- 已安装并配置AWS CLI
- 拥有S3、EC2和IAM权限的IAM用户
Installation
安装步骤
1. Install Terraform
1. 安装Terraform
bash
undefinedbash
undefinedmacOS
macOS
brew install terraform
brew install terraform
Linux
Linux
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
Verify installation
验证安装
terraform version
undefinedterraform version
undefined2. Install AWS CLI
2. 安装AWS CLI
bash
undefinedbash
undefinedmacOS
macOS
brew install awscli
brew install awscli
Linux
Linux
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Configure AWS credentials
配置AWS凭证
aws configure
undefinedaws configure
undefined3. Set Up IAM Permissions
3. 配置IAM权限
Create an IAM user with the following managed policies:
AmazonS3FullAccessAmazonEC2FullAccessIAMFullAccess
Note: For production, use fine-grained permissions instead of full access.
创建具备以下托管策略的IAM用户:
AmazonS3FullAccessAmazonEC2FullAccessIAMFullAccess
注意: 生产环境请使用细粒度权限,而非全权限。
Project Structure
项目结构
terraform/
├── main.tf # Main infrastructure definitions
├── variables.tf # Input variables
├── outputs.tf # Output values
└── terraform.tfstate # State file (generated)terraform/
├── main.tf # 核心基础设施定义
├── variables.tf # 输入变量
├── outputs.tf # 输出值
└── terraform.tfstate # 状态文件(自动生成)Key Terraform Commands
Terraform核心命令
Initialize Terraform
初始化Terraform
bash
undefinedbash
undefinedInitialize backend and download providers
初始化后端并下载提供者
terraform -chdir=terraform init
undefinedterraform -chdir=terraform init
undefinedValidate Configuration
验证配置
bash
undefinedbash
undefinedCheck syntax and validate configuration
检查语法并验证配置
terraform -chdir=terraform validate
undefinedterraform -chdir=terraform validate
undefinedFormat Code
格式化代码
bash
undefinedbash
undefinedAuto-format HCL files
自动格式化HCL文件
terraform -chdir=terraform fmt
undefinedterraform -chdir=terraform fmt
undefinedPlan Infrastructure Changes
预览基础设施变更
bash
undefinedbash
undefinedPreview what will be created/changed
预览将创建/变更的资源
terraform -chdir=terraform plan
undefinedterraform -chdir=terraform plan
undefinedApply Infrastructure
部署基础设施
bash
undefinedbash
undefinedCreate or update infrastructure
创建或更新基础设施
terraform -chdir=terraform apply
terraform -chdir=terraform apply
Auto-approve without confirmation (use carefully)
自动确认无需交互(谨慎使用)
terraform -chdir=terraform apply -auto-approve
undefinedterraform -chdir=terraform apply -auto-approve
undefinedDestroy Infrastructure
销毁基础设施
bash
undefinedbash
undefinedRemove all managed infrastructure
删除所有托管的基础设施
terraform -chdir=terraform destroy
terraform -chdir=terraform destroy
Auto-approve destruction (use carefully)
自动确认销毁(谨慎使用)
terraform -chdir=terraform destroy -auto-approve
undefinedterraform -chdir=terraform destroy -auto-approve
undefinedState Management
状态管理
bash
undefinedbash
undefinedList all resources in state
列出状态中的所有资源
terraform -chdir=terraform state list
terraform -chdir=terraform state list
Show detailed resource information
查看资源详细信息
terraform -chdir=terraform state show aws_s3_bucket.data_bucket
terraform -chdir=terraform state show aws_s3_bucket.data_bucket
View state as JSON
以JSON格式查看状态
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefinedcat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefinedConfiguration Examples
配置示例
Basic S3 Bucket for Data Storage
用于数据存储的基础S3存储桶
hcl
undefinedhcl
undefinedterraform/main.tf
terraform/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
}
resource "aws_s3_bucket" "data_lake" {
bucket = "my-unique-data-lake-bucket-${var.environment}"
tags = {
Name = "Data Lake Bucket"
Environment = var.environment
Project = "data-engineering"
}
}
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "archive_old_data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}}
}
undefinedterraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
}
resource "aws_s3_bucket" "data_lake" {
bucket = "my-unique-data-lake-bucket-${var.environment}"
tags = {
Name = "Data Lake Bucket"
Environment = var.environment
Project = "data-engineering"
}
}
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "archive_old_data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}}
}
undefinedEC2 Instance for Data Processing
用于数据处理的EC2实例
hcl
undefinedhcl
undefinedterraform/main.tf (continued)
terraform/main.tf(续)
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
}
resource "aws_instance" "data_processor" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
tags = {
Name = "Data Processing Server"
Environment = var.environment
}
user_data = <<-EOF
#!/bin/bash
sudo apt-get update
sudo apt-get install -y python3-pip
pip3 install pandas boto3 apache-airflow
EOF
}
resource "aws_eip" "data_processor_eip" {
instance = aws_instance.data_processor.id
domain = "vpc"
}
undefineddata "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
}
resource "aws_instance" "data_processor" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
tags = {
Name = "Data Processing Server"
Environment = var.environment
}
user_data = <<-EOF
#!/bin/bash
sudo apt-get update
sudo apt-get install -y python3-pip
pip3 install pandas boto3 apache-airflow
EOF
}
resource "aws_eip" "data_processor_eip" {
instance = aws_instance.data_processor.id
domain = "vpc"
}
undefinedVariables Configuration
变量配置
hcl
undefinedhcl
undefinedterraform/variables.tf
terraform/variables.tf
variable "aws_region" {
description = "AWS region for resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "dev"
}
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium"
}
undefinedvariable "aws_region" {
description = "AWS资源所在区域"
type = string
default = "us-east-1"
}
variable "environment" {
description = "环境名称"
type = string
default = "dev"
}
variable "instance_type" {
description = "EC2实例类型"
type = string
default = "t3.medium"
}
undefinedOutputs Configuration
输出配置
hcl
undefinedhcl
undefinedterraform/outputs.tf
terraform/outputs.tf
output "s3_bucket_name" {
description = "Name of the S3 data lake bucket"
value = aws_s3_bucket.data_lake.id
}
output "ec2_public_ip" {
description = "Public IP of data processing EC2 instance"
value = aws_eip.data_processor_eip.public_ip
}
output "ec2_instance_id" {
description = "Instance ID of data processor"
value = aws_instance.data_processor.id
}
undefinedoutput "s3_bucket_name" {
description = "S3数据湖存储桶名称"
value = aws_s3_bucket.data_lake.id
}
output "ec2_public_ip" {
description = "数据处理EC2实例的公网IP"
value = aws_eip.data_processor_eip.public_ip
}
output "ec2_instance_id" {
description = "数据处理实例ID"
value = aws_instance.data_processor.id
}
undefinedCommon Patterns
常见模式
Multi-Environment Setup
多环境配置
hcl
undefinedhcl
undefinedUse workspace or separate state files
使用工作区或独立状态文件
terraform workspace new staging
terraform workspace new production
terraform workspace new staging
terraform workspace new production
Or use variable files
或使用变量文件
terraform apply -var-file="environments/dev.tfvars"
terraform apply -var-file="environments/prod.tfvars"
undefinedterraform apply -var-file="environments/dev.tfvars"
terraform apply -var-file="environments/prod.tfvars"
undefinedRemote State with S3 Backend
基于S3后端的远程状态
hcl
undefinedhcl
undefinedterraform/backend.tf
terraform/backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "data-engineering/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
undefinedterraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "data-engineering/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
undefinedIAM Role for EC2 with S3 Access
具备S3访问权限的EC2 IAM角色
hcl
resource "aws_iam_role" "data_processor_role" {
name = "data-processor-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}]
})
}
resource "aws_iam_role_policy_attachment" "s3_access" {
role = aws_iam_role.data_processor_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}
resource "aws_iam_instance_profile" "data_processor_profile" {
name = "data-processor-profile"
role = aws_iam_role.data_processor_role.name
}
resource "aws_instance" "data_processor" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
}hcl
resource "aws_iam_role" "data_processor_role" {
name = "data-processor-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}]
})
}
resource "aws_iam_role_policy_attachment" "s3_access" {
role = aws_iam_role.data_processor_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}
resource "aws_iam_instance_profile" "data_processor_profile" {
name = "data-processor-profile"
role = aws_iam_role.data_processor_role.name
}
resource "aws_instance" "data_processor" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
}Verification Commands
验证命令
Verify S3 Buckets
验证S3存储桶
bash
undefinedbash
undefinedList all S3 buckets
列出所有S3存储桶
aws s3 ls
aws s3 ls
Get bucket details
获取存储桶详情
aws s3api get-bucket-location --bucket my-data-lake-bucket
aws s3api get-bucket-location --bucket my-data-lake-bucket
List bucket contents
列出存储桶内容
aws s3 ls s3://my-data-lake-bucket/
undefinedaws s3 ls s3://my-data-lake-bucket/
undefinedVerify EC2 Instances
验证EC2实例
bash
undefinedbash
undefinedList running instances
列出运行中的实例
aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}'
--output table
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==
Name--output table
aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}'
--output table
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==
Name--output table
Get specific instance details
获取指定实例详情
aws ec2 describe-instances --instance-ids i-1234567890abcdef0
undefinedaws ec2 describe-instances --instance-ids i-1234567890abcdef0
undefinedConnect to EC2 Instance
连接到EC2实例
bash
undefinedbash
undefinedSSH into instance (requires key pair)
SSH连接实例(需要密钥对)
ssh -i ~/.ssh/my-key.pem ubuntu@$(terraform -chdir=terraform output -raw ec2_public_ip)
undefinedssh -i ~/.ssh/my-key.pem ubuntu@$(terraform -chdir=terraform output -raw ec2_public_ip)
undefinedTroubleshooting
故障排查
Issue: Terraform Init Fails
问题:Terraform初始化失败
bash
undefinedbash
undefinedClear cache and reinitialize
清除缓存并重新初始化
rm -rf terraform/.terraform
rm terraform/.terraform.lock.hcl
terraform -chdir=terraform init
undefinedrm -rf terraform/.terraform
rm terraform/.terraform.lock.hcl
terraform -chdir=terraform init
undefinedIssue: State Lock Error
问题:状态锁定错误
bash
undefinedbash
undefinedForce unlock (use with caution)
强制解锁(谨慎使用)
terraform -chdir=terraform force-unlock LOCK_ID
undefinedterraform -chdir=terraform force-unlock LOCK_ID
undefinedIssue: AWS Credentials Not Found
问题:未找到AWS凭证
bash
undefinedbash
undefinedVerify AWS configuration
验证AWS配置
aws configure list
aws sts get-caller-identity
aws configure list
aws sts get-caller-identity
Set credentials explicitly
显式设置凭证
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
undefinedexport AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
undefinedIssue: Resource Already Exists
问题:资源已存在
bash
undefinedbash
undefinedImport existing resource into state
将现有资源导入状态
terraform -chdir=terraform import aws_s3_bucket.data_lake my-existing-bucket
terraform -chdir=terraform import aws_s3_bucket.data_lake my-existing-bucket
Or recreate with unique name
或使用唯一名称重新创建
terraform -chdir=terraform apply -var="bucket_suffix=$(date +%s)"
undefinedterraform -chdir=terraform apply -var="bucket_suffix=$(date +%s)"
undefinedIssue: Permission Denied
问题:权限拒绝
Check IAM policies and ensure your user has required permissions:
bash
undefined检查IAM策略,确保用户具备所需权限:
bash
undefinedTest S3 permissions
测试S3权限
aws s3 ls
aws s3 ls
Test EC2 permissions
测试EC2权限
aws ec2 describe-instances
aws ec2 describe-instances
Test IAM permissions
测试IAM权限
aws iam list-users
undefinedaws iam list-users
undefinedDebugging Terraform
Terraform调试
bash
undefinedbash
undefinedEnable debug logging
启用调试日志
export TF_LOG=DEBUG
terraform -chdir=terraform apply
export TF_LOG=DEBUG
terraform -chdir=terraform apply
Show detailed plan
查看详细计划
terraform -chdir=terraform plan -out=tfplan
terraform -chdir=terraform show tfplan
terraform -chdir=terraform plan -out=tfplan
terraform -chdir=terraform show tfplan
Refresh state from actual infrastructure
从实际基础设施刷新状态
terraform -chdir=terraform refresh
undefinedterraform -chdir=terraform refresh
undefinedBest Practices
最佳实践
- Always use unique bucket names: S3 bucket names must be globally unique
- Version your state files: Enable S3 versioning for state file backups
- Use remote state: Store state in S3 with locking via DynamoDB
- Tag all resources: Apply consistent tagging for cost tracking and organization
- Use variables: Parameterize configurations for reusability
- Run before apply to review changes
terraform plan - Destroy dev resources: Don't leave test infrastructure running to avoid costs
- 始终使用唯一存储桶名称:S3存储桶名称必须全局唯一
- 版本化状态文件:为状态文件启用S3版本化以实现备份
- 使用远程状态:将状态存储在S3中,并通过DynamoDB实现锁定
- 为所有资源添加标签:使用统一标签便于成本追踪和资源管理
- 使用变量:参数化配置以提高复用性
- 部署前运行:查看变更内容
terraform plan - 销毁开发环境资源:避免测试基础设施持续运行产生不必要的成本