iac-terraform-data-engineering
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIaC for Data Engineering with Terraform
面向数据工程的Terraform基础设施即代码(IaC)实践
Skill by ara.so — Data Skills collection.
This project provides Infrastructure-as-Code (IaC) templates and patterns for data engineers using Terraform to provision and manage AWS resources. It focuses on creating reproducible, version-controlled infrastructure for data platforms including S3 storage, EC2 compute instances, and IAM permissions.
技能来自 ara.so —— 数据技能合集。
本项目为数据工程师提供使用Terraform配置和管理AWS资源的基础设施即代码(IaC)模板与实践模式,专注于为数据平台创建可复现、受版本控制的基础设施,包括S3存储、EC2计算实例和IAM权限管理。
What This Project Does
本项目功能
- Provides Terraform configurations for common data engineering infrastructure on AWS
- Demonstrates IaC best practices for S3 buckets, EC2 instances, and IAM roles
- Shows state management and lifecycle operations for data infrastructure
- Teaches reproducible infrastructure provisioning for data pipelines
- 提供AWS上常见数据工程基础设施的Terraform配置
- 展示针对S3存储桶、EC2实例和IAM角色的IaC最佳实践
- 演示数据基础设施的状态管理与生命周期操作
- 教授数据管道的可复现基础设施配置方法
Prerequisites
前置条件
Before using this project, ensure you have:
- AWS Account with root or admin access
- Terraform CLI installed (installation guide)
- AWS CLI installed and configured (setup guide)
- AWS Credentials configured via
aws configure
AWS IAM Setup
AWS IAM 设置
Create an IAM user with appropriate permissions:
- Create IAM User: Navigate to AWS Console → IAM → Users → Create user
- Create Inline Policy: Attach a custom policy to the user
- Grant Permissions: For development/learning, grant full access to:
- Amazon S3
- Amazon EC2
- AWS IAM
⚠️ Security Note: Full service access is NOT recommended for production. Use least-privilege policies in production environments.
创建具备合适权限的IAM用户:
- 创建IAM用户:进入AWS控制台 → IAM → 用户 → 创建用户
- 创建内联策略:为用户附加自定义策略
- 授予权限:开发/学习场景下,授予以下服务的完全访问权限:
- Amazon S3
- Amazon EC2
- AWS IAM
⚠️ 安全提示:生产环境不建议授予全服务访问权限,请遵循最小权限原则配置策略。
Project Structure
项目结构
terraform/
├── main.tf # Main Terraform configuration
├── variables.tf # Input variables (if present)
├── outputs.tf # Output values (if present)
└── terraform.tfstate # State file (generated)terraform/
├── main.tf # 主Terraform配置文件
├── variables.tf # 输入变量(若存在)
├── outputs.tf # 输出值(若存在)
└── terraform.tfstate # 生成的状态文件Key Terraform Commands
核心Terraform命令
Initialize Terraform
初始化Terraform
Initialize the working directory and download provider plugins:
bash
terraform -chdir=terraform init初始化工作目录并下载提供商插件:
bash
terraform -chdir=terraform initValidate Configuration
验证配置
Check if the configuration is syntactically valid:
bash
terraform -chdir=terraform validate检查配置是否符合语法规范:
bash
terraform -chdir=terraform validateFormat Code
格式化代码
Automatically format Terraform files to canonical style:
bash
terraform -chdir=terraform fmt自动将Terraform文件格式化为标准风格:
bash
terraform -chdir=terraform fmtPlan Infrastructure Changes
预览基础设施变更
Preview what Terraform will create/modify/destroy:
bash
terraform -chdir=terraform plan预览Terraform将创建/修改/销毁的资源:
bash
terraform -chdir=terraform planApply Configuration
应用配置
Create or update infrastructure:
bash
terraform -chdir=terraform applyTerraform will show a plan and ask for confirmation. Type to proceed.
yes创建或更新基础设施:
bash
terraform -chdir=terraform applyTerraform会展示变更计划并请求确认,输入 继续执行。
yesAuto-approve (for automation)
自动确认(适用于自动化场景)
bash
terraform -chdir=terraform apply -auto-approvebash
terraform -chdir=terraform apply -auto-approveDestroy Infrastructure
销毁基础设施
Remove all resources managed by Terraform:
bash
terraform -chdir=terraform destroy移除Terraform管理的所有资源:
bash
terraform -chdir=terraform destroyConfiguration
配置说明
Basic Terraform Configuration Example
基础Terraform配置示例
Before applying, modify to customize resource names:
terraform/main.tfhcl
undefined应用前,请修改 自定义资源名称:
terraform/main.tfhcl
undefinedterraform/main.tf
terraform/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
S3 bucket for data storage
用于数据存储的S3桶
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-unique-data-engineering-bucket-12345"
tags = {
Name = "Data Engineering Bucket"
Environment = "dev"
ManagedBy = "Terraform"
}
}
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-unique-data-engineering-bucket-12345"
tags = {
Name = "Data Engineering Bucket"
Environment = "dev"
ManagedBy = "Terraform"
}
}
EC2 instance for data processing
用于数据处理的EC2实例
resource "aws_instance" "data_processor" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
instance_type = "t2.micro"
tags = {
Name = "Data Processor"
Environment = "dev"
ManagedBy = "Terraform"
}
}
resource "aws_instance" "data_processor" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
instance_type = "t2.micro"
tags = {
Name = "Data Processor"
Environment = "dev"
ManagedBy = "Terraform"
}
}
IAM role for EC2 instance
用于EC2实例的IAM角色
resource "aws_iam_role" "ec2_s3_role" {
name = "ec2-s3-access-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
undefinedresource "aws_iam_role" "ec2_s3_role" {
name = "ec2-s3-access-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
undefinedVariables Configuration
变量配置
Create for reusable configurations:
terraform/variables.tfhcl
variable "aws_region" {
description = "AWS region for resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "dev"
}
variable "bucket_name" {
description = "S3 bucket name for data storage"
type = string
# Set via terraform.tfvars or -var flag
}Use variables in :
main.tfhcl
provider "aws" {
region = var.aws_region
}
resource "aws_s3_bucket" "data_bucket" {
bucket = var.bucket_name
tags = {
Environment = var.environment
}
}Create :
terraform/terraform.tfvarshcl
bucket_name = "my-unique-bucket-name-2026"
aws_region = "us-west-2"
environment = "production"创建 实现可复用配置:
terraform/variables.tfhcl
variable "aws_region" {
description = "资源所在的AWS区域"
type = string
default = "us-east-1"
}
variable "environment" {
description = "环境名称"
type = string
default = "dev"
}
variable "bucket_name" {
description = "数据存储用S3桶名称"
type = string
# 通过terraform.tfvars或-var参数设置
}在 中使用变量:
main.tfhcl
provider "aws" {
region = var.aws_region
}
resource "aws_s3_bucket" "data_bucket" {
bucket = var.bucket_name
tags = {
Environment = var.environment
}
}创建 :
terraform/terraform.tfvarshcl
bucket_name = "my-unique-bucket-name-2026"
aws_region = "us-west-2"
environment = "production"State Management
状态管理
Inspect State
查看状态
List all resources in the state:
bash
terraform -chdir=terraform state listView detailed state information:
bash
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'列出状态中的所有资源:
bash
terraform -chdir=terraform state list查看详细状态信息:
bash
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'Remote State (Production Pattern)
远程状态(生产环境模式)
For production, store state remotely in S3:
hcl
undefined生产环境下,建议将状态存储在S3远程存储中:
hcl
undefinedterraform/backend.tf
terraform/backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "data-platform/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Initialize with backend configuration:
```bash
terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "data-platform/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
使用后端配置初始化:
```bash
terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"Verification Commands
验证命令
Verify S3 Bucket Creation
验证S3桶创建
bash
aws s3 lsbash
aws s3 lsVerify EC2 Instance
验证EC2实例
bash
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
--output tablebash
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
--output tableCheck Specific Resource
检查特定资源
bash
terraform -chdir=terraform show aws_s3_bucket.data_bucketbash
terraform -chdir=terraform show aws_s3_bucket.data_bucketCommon Patterns for Data Engineering
数据工程常见实践模式
Pattern 1: Data Lake with Multiple Buckets
模式1:多桶数据湖
hcl
undefinedhcl
undefinedRaw data bucket
原始数据桶
resource "aws_s3_bucket" "raw_data" {
bucket = "my-data-lake-raw-${var.environment}"
}
resource "aws_s3_bucket" "raw_data" {
bucket = "my-data-lake-raw-${var.environment}"
}
Processed data bucket
处理后数据桶
resource "aws_s3_bucket" "processed_data" {
bucket = "my-data-lake-processed-${var.environment}"
}
resource "aws_s3_bucket" "processed_data" {
bucket = "my-data-lake-processed-${var.environment}"
}
Enable versioning for data lineage
启用版本控制以支持数据血缘
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
bucket = aws_s3_bucket.raw_data.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
bucket = aws_s3_bucket.raw_data.id
versioning_configuration {
status = "Enabled"
}
}
Lifecycle rules for cost optimization
生命周期规则优化成本
resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" {
bucket = aws_s3_bucket.raw_data.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}}
}
undefinedresource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" {
bucket = aws_s3_bucket.raw_data.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}}
}
undefinedPattern 2: EC2 with Data Processing Tools
模式2:搭载数据处理工具的EC2实例
hcl
undefinedhcl
undefinedSecurity group for data processor
数据处理器安全组
resource "aws_security_group" "data_processor_sg" {
name = "data-processor-sg"
description = "Security group for data processing instances"
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # Restrict in production
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "data_processor_sg" {
name = "data-processor-sg"
description = "数据处理实例的安全组"
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # 生产环境请限制访问范围
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
EC2 instance with user data for setup
带用户数据初始化的EC2实例
resource "aws_instance" "data_processor" {
ami = var.ami_id
instance_type = "t3.medium"
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.ec2_profile.name
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y python3 python3-pip
pip3 install pandas boto3
EOF
tags = {
Name = "Data Processor Instance"
}
}
resource "aws_instance" "data_processor" {
ami = var.ami_id
instance_type = "t3.medium"
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.ec2_profile.name
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y python3 python3-pip
pip3 install pandas boto3
EOF
tags = {
Name = "Data Processor Instance"
}
}
IAM instance profile
IAM实例配置文件
resource "aws_iam_instance_profile" "ec2_profile" {
name = "ec2-data-processor-profile"
role = aws_iam_role.ec2_s3_role.name
}
undefinedresource "aws_iam_instance_profile" "ec2_profile" {
name = "ec2-data-processor-profile"
role = aws_iam_role.ec2_s3_role.name
}
undefinedPattern 3: Outputs for Integration
模式3:用于集成的输出配置
hcl
undefinedhcl
undefinedterraform/outputs.tf
terraform/outputs.tf
output "s3_bucket_name" {
description = "Name of the S3 bucket"
value = aws_s3_bucket.data_bucket.id
}
output "s3_bucket_arn" {
description = "ARN of the S3 bucket"
value = aws_s3_bucket.data_bucket.arn
}
output "ec2_instance_id" {
description = "ID of the EC2 instance"
value = aws_instance.data_processor.id
}
output "ec2_public_ip" {
description = "Public IP of the EC2 instance"
value = aws_instance.data_processor.public_ip
}
Access outputs:
```bash
terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'output "s3_bucket_name" {
description = "S3桶名称"
value = aws_s3_bucket.data_bucket.id
}
output "s3_bucket_arn" {
description = "S3桶ARN"
value = aws_s3_bucket.data_bucket.arn
}
output "ec2_instance_id" {
description = "EC2实例ID"
value = aws_instance.data_processor.id
}
output "ec2_public_ip" {
description = "EC2实例公网IP"
value = aws_instance.data_processor.public_ip
}
访问输出值:
```bash
terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'Troubleshooting
故障排查
Issue: "Error acquiring the state lock"
问题:"Error acquiring the state lock"
Cause: Another Terraform process is running or a previous run didn't release the lock.
Solution:
bash
undefined原因:存在其他Terraform进程在运行,或上一次运行未释放锁。
解决方案:
bash
undefinedForce unlock (use with caution)
强制解锁(谨慎使用)
terraform -chdir=terraform force-unlock <LOCK_ID>
undefinedterraform -chdir=terraform force-unlock <LOCK_ID>
undefinedIssue: "bucket name already exists"
问题:"bucket name already exists"
Cause: S3 bucket names must be globally unique across all AWS accounts.
Solution: Change the bucket name in to something unique:
main.tfhcl
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}原因:S3桶名称在所有AWS账户中必须全局唯一。
解决方案:修改 中的桶名称为唯一值:
main.tfhcl
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}Issue: "insufficient IAM permissions"
问题:"insufficient IAM permissions"
Cause: The IAM user doesn't have required permissions.
Solution: Verify IAM policy includes necessary actions:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"ec2:*",
"iam:*"
],
"Resource": "*"
}
]
}原因:IAM用户缺少必要权限。
解决方案:验证IAM策略包含所需操作:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"ec2:*",
"iam:*"
],
"Resource": "*"
}
]
}Issue: State file out of sync
问题:状态文件不同步
Cause: Manual changes made outside Terraform.
Solution: Refresh the state:
bash
terraform -chdir=terraform refreshOr import existing resources:
bash
terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket原因:在Terraform之外手动修改了资源。
解决方案:刷新状态:
bash
terraform -chdir=terraform refresh或导入现有资源:
bash
terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucketWorkflow Example
工作流示例
Complete workflow for setting up data infrastructure:
bash
undefined搭建数据基础设施的完整工作流:
bash
undefined1. Configure AWS credentials
1. 配置AWS凭证
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
2. Customize configuration
2. 自定义配置
cd terraform
cd terraform
Edit main.tf to set unique bucket name
编辑main.tf设置唯一桶名称
3. Initialize Terraform
3. 初始化Terraform
terraform init
terraform init
4. Validate configuration
4. 验证配置
terraform validate
terraform validate
5. Format code
5. 格式化代码
terraform fmt
terraform fmt
6. Preview changes
6. 预览变更
terraform plan
terraform plan
7. Apply configuration
7. 应用配置
terraform apply
terraform apply
8. Verify resources
8. 验证资源
aws s3 ls
aws ec2 describe-instances --output table
aws s3 ls
aws ec2 describe-instances --output table
9. When done, clean up
9. 使用完成后清理资源
terraform destroy
undefinedterraform destroy
undefinedBest Practices for Data Engineering IaC
数据工程IaC最佳实践
- Use variables for environment-specific values
- Enable S3 versioning for data lineage and recovery
- Tag all resources for cost tracking and management
- Store state remotely in S3 with encryption and locking
- Use modules to organize reusable infrastructure components
- Never commit files or AWS credentials to version control
.tfstate - Implement lifecycle rules on S3 for cost optimization
- Use IAM roles instead of access keys for EC2 instances
- Plan before apply to review changes
- Destroy unused resources to avoid unnecessary costs
- 使用变量存储环境特定值
- 启用S3版本控制以支持数据血缘与恢复
- 为所有资源添加标签便于成本追踪与管理
- 将状态存储在远程S3中,并启用加密与锁定
- 使用模块组织可复用的基础设施组件
- 切勿提交 文件或AWS凭证到版本控制系统
.tfstate - 为S3配置生命周期规则优化成本
- 为EC2实例使用IAM角色而非访问密钥
- 应用前先执行计划以审查变更
- 销毁未使用资源避免不必要的成本