iac-data-engineering-terraform
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIaC for Data Engineering with Terraform
基于Terraform的数据工程基础设施即代码(IaC)实践
Skill by ara.so — Data Skills collection.
This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineers using Terraform to provision AWS resources including S3 buckets, EC2 instances, and IAM configurations. It provides reusable patterns for managing data infrastructure declaratively.
技能来自 ara.so —— 数据技能合集。
本项目展示了面向数据工程师的基础设施即代码(IaC)基础实践,通过Terraform来部署AWS资源,包括S3存储桶、EC2实例以及IAM配置。它提供了可复用的声明式数据基础设施管理模式。
What This Project Does
本项目功能
- Provisions AWS S3 buckets for data storage
- Creates and configures EC2 instances for data processing
- Sets up IAM roles and policies with proper permissions
- Manages infrastructure state with Terraform
- Provides reproducible data engineering environments
- 部署用于数据存储的AWS S3存储桶
- 创建并配置用于数据处理的EC2实例
- 设置具备合理权限的IAM角色与策略
- 通过Terraform管理基础设施状态
- 提供可复现的数据工程环境
Prerequisites
前置条件
Before using this project, ensure you have:
bash
undefined使用本项目前,请确保已完成以下配置:
bash
undefinedInstall Terraform
安装Terraform
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
Install AWS CLI
安装AWS CLI
brew install awscli
brew install awscli
Configure AWS credentials
配置AWS凭证
aws configure
aws configure
Enter your AWS Access Key ID, Secret Access Key, region, and output format
输入你的AWS访问密钥ID、秘密访问密钥、区域和输出格式
Set up required environment variables:
```bash
export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_KEY
export AWS_DEFAULT_REGION=us-east-1
设置所需环境变量:
```bash
export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_KEY
export AWS_DEFAULT_REGION=us-east-1Project Structure
项目结构
terraform/
├── main.tf # Main infrastructure definitions
├── variables.tf # Input variables
├── outputs.tf # Output values
└── terraform.tfstate # State file (auto-generated)terraform/
├── main.tf # 核心基础设施定义
├── variables.tf # 输入变量
├── outputs.tf # 输出值
└── terraform.tfstate # 状态文件(自动生成)Core Terraform Commands
Terraform核心命令
Initialize Terraform
初始化Terraform
bash
undefinedbash
undefinedInitialize the working directory and download providers
初始化工作目录并下载提供商
terraform -chdir=terraform init
terraform -chdir=terraform init
Validate configuration syntax
验证配置语法
terraform -chdir=terraform validate
terraform -chdir=terraform validate
Format configuration files
格式化配置文件
terraform -chdir=terraform fmt
undefinedterraform -chdir=terraform fmt
undefinedPlan and Apply Infrastructure
规划与部署基础设施
bash
undefinedbash
undefinedPreview changes without applying
预览变更而不实际部署
terraform -chdir=terraform plan
terraform -chdir=terraform plan
Apply infrastructure changes
应用基础设施变更
terraform -chdir=terraform apply
terraform -chdir=terraform apply
Auto-approve without prompts (use carefully)
自动确认无需交互提示(谨慎使用)
terraform -chdir=terraform apply -auto-approve
undefinedterraform -chdir=terraform apply -auto-approve
undefinedInspect Infrastructure
查看基础设施
bash
undefinedbash
undefinedList all resources in state
列出状态中的所有资源
terraform -chdir=terraform state list
terraform -chdir=terraform state list
Show detailed state information
显示详细状态信息
terraform -chdir=terraform show
terraform -chdir=terraform show
Output specific values
输出指定值
terraform -chdir=terraform output
undefinedterraform -chdir=terraform output
undefinedDestroy Infrastructure
销毁基础设施
bash
undefinedbash
undefinedDestroy all managed infrastructure
销毁所有托管的基础设施
terraform -chdir=terraform destroy
terraform -chdir=terraform destroy
Destroy specific resource
销毁指定资源
terraform -chdir=terraform destroy -target=aws_s3_bucket.data_bucket
undefinedterraform -chdir=terraform destroy -target=aws_s3_bucket.data_bucket
undefinedKey Configuration Patterns
核心配置模式
S3 Bucket for Data Storage
用于数据存储的S3存储桶
hcl
undefinedhcl
undefinedmain.tf
main.tf
resource "aws_s3_bucket" "data_lake" {
bucket = "my-data-engineering-bucket-${random_id.bucket_suffix.hex}"
tags = {
Environment = "dev"
Purpose = "data-engineering"
ManagedBy = "terraform"
}
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}
resource "aws_s3_bucket" "data_lake" {
bucket = "my-data-engineering-bucket-${random_id.bucket_suffix.hex}"
tags = {
Environment = "dev"
Purpose = "data-engineering"
ManagedBy = "terraform"
}
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}
Enable versioning for data protection
启用版本控制以保护数据
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
Configure lifecycle rules
配置生命周期规则
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}}
}
undefinedresource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}}
}
undefinedEC2 Instance for Data Processing
用于数据处理的EC2实例
hcl
undefinedhcl
undefinedmain.tf
main.tf
resource "aws_instance" "data_processor" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
instance_type = "t3.medium"
key_name = aws_key_pair.data_eng_key.key_name
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y python3 python3-pip
pip3 install pandas boto3 awscli
EOF
tags = {
Name = "data-processor"
Environment = "dev"
ManagedBy = "terraform"
}
root_block_device {
volume_size = 50
volume_type = "gp3"
}
}
resource "aws_key_pair" "data_eng_key" {
key_name = "data-engineering-key"
public_key = file("~/.ssh/id_rsa.pub")
}
undefinedresource "aws_instance" "data_processor" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
instance_type = "t3.medium"
key_name = aws_key_pair.data_eng_key.key_name
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y python3 python3-pip
pip3 install pandas boto3 awscli
EOF
tags = {
Name = "data-processor"
Environment = "dev"
ManagedBy = "terraform"
}
root_block_device {
volume_size = 50
volume_type = "gp3"
}
}
resource "aws_key_pair" "data_eng_key" {
key_name = "data-engineering-key"
public_key = file("~/.ssh/id_rsa.pub")
}
undefinedSecurity Group Configuration
安全组配置
hcl
resource "aws_security_group" "data_processor_sg" {
name = "data-processor-sg"
description = "Security group for data processing EC2 instances"
# SSH access
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # Restrict in production
}
# Allow all outbound traffic
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "data-processor-sg"
}
}hcl
resource "aws_security_group" "data_processor_sg" {
name = "data-processor-sg"
description = "数据处理EC2实例的安全组"
# SSH访问
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # 生产环境中请限制访问范围
}
# 允许所有出站流量
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "data-processor-sg"
}
}IAM Role for EC2 with S3 Access
具备S3访问权限的EC2 IAM角色
hcl
resource "aws_iam_role" "data_processor_role" {
name = "data-processor-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy" "s3_access_policy" {
name = "s3-access-policy"
role = aws_iam_role.data_processor_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
]
Resource = [
aws_s3_bucket.data_lake.arn,
"${aws_s3_bucket.data_lake.arn}/*"
]
}
]
})
}
resource "aws_iam_instance_profile" "data_processor_profile" {
name = "data-processor-profile"
role = aws_iam_role.data_processor_role.name
}hcl
resource "aws_iam_role" "data_processor_role" {
name = "data-processor-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy" "s3_access_policy" {
name = "s3-access-policy"
role = aws_iam_role.data_processor_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
]
Resource = [
aws_s3_bucket.data_lake.arn,
"${aws_s3_bucket.data_lake.arn}/*"
]
}
]
})
}
resource "aws_iam_instance_profile" "data_processor_profile" {
name = "data-processor-profile"
role = aws_iam_role.data_processor_role.name
}Variables and Outputs
变量与输出
Define Variables
定义变量
hcl
undefinedhcl
undefinedvariables.tf
variables.tf
variable "aws_region" {
description = "AWS region for resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "dev"
}
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium"
}
variable "bucket_prefix" {
description = "Prefix for S3 bucket names"
type = string
default = "data-engineering"
}
undefinedvariable "aws_region" {
description = "资源所在的AWS区域"
type = string
default = "us-east-1"
}
variable "environment" {
description = "环境名称"
type = string
default = "dev"
}
variable "instance_type" {
description = "EC2实例类型"
type = string
default = "t3.medium"
}
variable "bucket_prefix" {
description = "S3存储桶名称前缀"
type = string
default = "data-engineering"
}
undefinedConfigure Outputs
配置输出
hcl
undefinedhcl
undefinedoutputs.tf
outputs.tf
output "s3_bucket_name" {
description = "Name of the created S3 bucket"
value = aws_s3_bucket.data_lake.id
}
output "s3_bucket_arn" {
description = "ARN of the S3 bucket"
value = aws_s3_bucket.data_lake.arn
}
output "ec2_instance_id" {
description = "ID of the EC2 instance"
value = aws_instance.data_processor.id
}
output "ec2_public_ip" {
description = "Public IP of the EC2 instance"
value = aws_instance.data_processor.public_ip
}
output "ec2_private_ip" {
description = "Private IP of the EC2 instance"
value = aws_instance.data_processor.private_ip
}
undefinedoutput "s3_bucket_name" {
description = "创建的S3存储桶名称"
value = aws_s3_bucket.data_lake.id
}
output "s3_bucket_arn" {
description = "S3存储桶的ARN"
value = aws_s3_bucket.data_lake.arn
}
output "ec2_instance_id" {
description = "EC2实例ID"
value = aws_instance.data_processor.id
}
output "ec2_public_ip" {
description = "EC2实例的公网IP"
value = aws_instance.data_processor.public_ip
}
output "ec2_private_ip" {
description = "EC2实例的内网IP"
value = aws_instance.data_processor.private_ip
}
undefinedRemote State Management
远程状态管理
For team collaboration, use S3 backend for state:
hcl
undefined针对团队协作场景,使用S3后端存储状态:
hcl
undefinedbackend.tf
backend.tf
terraform {
backend "s3" {
bucket = "terraform-state-bucket-name"
key = "data-engineering/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Create the backend resources:
```hcl
resource "aws_s3_bucket" "terraform_state" {
bucket = "terraform-state-bucket-name"
lifecycle {
prevent_destroy = true
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}terraform {
backend "s3" {
bucket = "terraform-state-bucket-name"
key = "data-engineering/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
创建后端资源:
```hcl
resource "aws_s3_bucket" "terraform_state" {
bucket = "terraform-state-bucket-name"
lifecycle {
prevent_destroy = true
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}Verification Commands
验证命令
After applying infrastructure:
bash
undefined部署基础设施后,可执行以下命令验证:
bash
undefinedVerify S3 buckets
验证S3存储桶
aws s3 ls
aws s3 ls
Verify EC2 instances
验证EC2实例
aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==].Value,Type:InstanceType,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}'
--output table
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==
Name--output table
aws ec2 describe-instances
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==].Value,Type:InstanceType,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}'
--output table
--filters "Name=instance-state-name,Values=running"
--query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==
Name--output table
Check IAM roles
检查IAM角色
aws iam list-roles --query 'Roles[?contains(RoleName, )].RoleName'
data-processoraws iam list-roles --query 'Roles[?contains(RoleName, )].RoleName'
data-processorInspect Terraform state
查看Terraform状态
terraform -chdir=terraform state list
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefinedterraform -chdir=terraform state list
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
undefinedCommon Patterns
常见模式
Multi-Environment Setup
多环境配置
hcl
undefinedhcl
undefinedenvironments/dev/main.tf
environments/dev/main.tf
module "data_infrastructure" {
source = "../../modules/data-infra"
environment = "dev"
instance_type = "t3.small"
bucket_prefix = "dev-data"
}
module "data_infrastructure" {
source = "../../modules/data-infra"
environment = "dev"
instance_type = "t3.small"
bucket_prefix = "dev-data"
}
environments/prod/main.tf
environments/prod/main.tf
module "data_infrastructure" {
source = "../../modules/data-infra"
environment = "prod"
instance_type = "t3.large"
bucket_prefix = "prod-data"
}
undefinedmodule "data_infrastructure" {
source = "../../modules/data-infra"
environment = "prod"
instance_type = "t3.large"
bucket_prefix = "prod-data"
}
undefinedUsing terraform.tfvars
使用terraform.tfvars
hcl
undefinedhcl
undefinedterraform.tfvars
terraform.tfvars
aws_region = "us-west-2"
environment = "staging"
instance_type = "t3.medium"
bucket_prefix = "staging-data-lake"
Apply with variables:
```bash
terraform -chdir=terraform apply -var-file="terraform.tfvars"aws_region = "us-west-2"
environment = "staging"
instance_type = "t3.medium"
bucket_prefix = "staging-data-lake"
通过变量文件部署:
```bash
terraform -chdir=terraform apply -var-file="terraform.tfvars"Troubleshooting
故障排查
State Lock Issues
状态锁定问题
bash
undefinedbash
undefinedForce unlock if state is stuck
若状态被卡住,强制解锁
terraform -chdir=terraform force-unlock LOCK_ID
terraform -chdir=terraform force-unlock LOCK_ID
View current state
查看当前状态
terraform -chdir=terraform show
undefinedterraform -chdir=terraform show
undefinedS3 Bucket Name Conflicts
S3存储桶名称冲突
If bucket name is taken:
hcl
undefined若存储桶名称已被占用:
hcl
undefinedUse random suffix
使用随机后缀
resource "random_id" "bucket_suffix" {
byte_length = 8
}
resource "aws_s3_bucket" "data_lake" {
bucket = "${var.bucket_prefix}-${random_id.bucket_suffix.hex}"
}
undefinedresource "random_id" "bucket_suffix" {
byte_length = 8
}
resource "aws_s3_bucket" "data_lake" {
bucket = "${var.bucket_prefix}-${random_id.bucket_suffix.hex}"
}
undefinedImport Existing Resources
导入已有资源
bash
undefinedbash
undefinedImport existing S3 bucket
导入已有S3存储桶
terraform -chdir=terraform import aws_s3_bucket.data_lake existing-bucket-name
terraform -chdir=terraform import aws_s3_bucket.data_lake existing-bucket-name
Import EC2 instance
导入EC2实例
terraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0
undefinedterraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0
undefinedDebugging Terraform
Terraform调试
bash
undefinedbash
undefinedEnable detailed logging
启用详细日志
export TF_LOG=DEBUG
terraform -chdir=terraform apply
export TF_LOG=DEBUG
terraform -chdir=terraform apply
Disable logging
关闭日志
unset TF_LOG
undefinedunset TF_LOG
undefinedRefresh State
刷新状态
bash
undefinedbash
undefinedSync state with real infrastructure
同步状态与实际基础设施
terraform -chdir=terraform refresh
terraform -chdir=terraform refresh
Replace corrupted resource
替换损坏的资源
terraform -chdir=terraform apply -replace=aws_instance.data_processor
undefinedterraform -chdir=terraform apply -replace=aws_instance.data_processor
undefinedBest Practices
最佳实践
- Always use variables for environment-specific values
- Enable S3 versioning for data protection
- Use IAM roles instead of access keys for EC2
- Tag all resources for cost tracking and management
- Store state remotely for team collaboration
- Use modules for reusable infrastructure patterns
- Run before every apply
terraform plan - Never commit files or sensitive variables to Git
.tfstate - Use for Terraform files:
.gitignore
gitignore
undefined- 始终使用变量存储环境特定值
- 启用S3版本控制以保护数据
- 为EC2使用IAM角色而非访问密钥
- 为所有资源添加标签以便成本追踪与管理
- 远程存储状态支持团队协作
- 使用模块实现可复用的基础设施模式
- 每次部署前运行
terraform plan - 切勿将 文件或敏感变量提交至Git
.tfstate - 为Terraform文件配置:
.gitignore
gitignore
undefined.gitignore
.gitignore
.terraform/
*.tfstate
*.tfstate.backup
.terraform.lock.hcl
terraform.tfvars
*.auto.tfvars
undefined.terraform/
*.tfstate
*.tfstate.backup
.terraform.lock.hcl
terraform.tfvars
*.auto.tfvars
undefined