creating-data-lake-table

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Create Data Lake Tables with Amazon S3 Tables

使用Amazon S3 Tables创建数据湖表

Overview

概述

Amazon S3 Tables provides managed Iceberg tables with automatic compaction and snapshot management. Queryable via Athena and Iceberg-compatible engines.
Amazon S3 Tables提供托管Iceberg表,支持自动压缩和快照管理。可通过Athena及兼容Iceberg的引擎进行查询。

Common Tasks

常见任务

You MUST use AWS MCP server tools when connected, they provide command validation, sandboxed execution, and audit logging. Fall back to AWS CLI if MCP unavailable.
连接时必须使用AWS MCP服务器工具,它们提供命令验证、沙箱执行和审计日志功能。如果MCP不可用,则使用AWS CLI作为替代。

Decision Guide

决策指南

Before creating, You MUST check what exists:
You MUST run
aws glue get-tables --database-name <NAME>
when user mentions a database.
What you findAction
Fuzzy database name ("our analytics db")You MUST STOP. Delegate to
finding-data-lake-assets
to resolve.
Non-S3-Tables table with matching nameYou MUST STOP. Delegate to
finding-data-lake-assets
. You MUST NOT create until user confirms.
Existing S3 Tables table with matching nameYou MUST check schema match. Reuse if compatible, recreate only if user confirms.
No matching tablesProceed with creation (Steps 1-8).
User explicitly requests new S3 Tables tableSkip checks, proceed with creation.
Creation paths:
  • Existing data in S3: Create empty table (Steps 1-8), then use
    ingesting-into-data-lake
    skill.
  • Glue ETL pipeline: Read
    references/table-creation-glue-etl.md
    first, then Steps 1-6.
  • Lake Formation access control: Search AWS docs for
    "S3 Tables integration with Lake Formation"
    .
创建前必须检查现有资源:
当用户提及数据库时,必须执行
aws glue get-tables --database-name <NAME>
命令。
检查结果操作
模糊数据库名称(如“我们的分析数据库”)必须停止操作。委托给
finding-data-lake-assets
技能来确认具体信息。
存在同名非S3-Tables表必须停止操作。委托给
finding-data-lake-assets
技能。在用户确认前不得创建新表。
存在同名S3 Tables表必须检查Schema是否匹配。若兼容则复用,仅在用户确认后重新创建。
无匹配表继续创建流程(步骤1-8)。
用户明确要求创建新的S3 Tables表跳过检查,直接进入创建流程。
创建路径:
  • S3中已有数据:创建空表(步骤1-8),然后使用
    ingesting-into-data-lake
    技能导入数据。
  • Glue ETL管道:先阅读
    references/table-creation-glue-etl.md
    ,再执行步骤1-6。
  • Lake Formation访问控制:在AWS文档中搜索
    "S3 Tables integration with Lake Formation"

1. Verify Dependencies

1. 验证依赖项

Constraints:
  • You MUST check whether AWS MCP server tools or AWS CLI are available and inform user if missing
  • You MUST confirm target AWS region and verify credentials with
    aws sts get-caller-identity
约束条件:
  • 必须检查AWS MCP服务器工具或AWS CLI是否可用,若缺失需告知用户
  • 必须确认目标AWS区域,并通过
    aws sts get-caller-identity
    验证凭证

2. Understand the Schema

2. 明确Schema

  • Explicit schema: Validate Iceberg types.
  • Loose description: Ask columns, types, grain. Propose and confirm.
  • Existing S3 data: Infer schema from file headers only. Create empty table first, then use
    ingesting-into-data-lake
    skill.
Constraints:
  • You MUST read
    references/best-practices.md
    for Iceberg type mapping, partitions, and naming.
  • You MUST ask for all required parameters upfront: table name, columns, types, partition strategy. For schema evolution, see
    references/athena-ddl-path.md
    .
  • You MUST use all lowercase names -- Glue rejects mixed case with
    GENERIC_INTERNAL_ERROR
    . Namespace and table names MUST NOT contain hyphens.
  • You SHOULD suggest partition columns based on access patterns.
  • 显式Schema:验证Iceberg类型。
  • 模糊描述:询问列名、类型、粒度。提出方案并确认。
  • S3中已有数据:仅从文件头推断Schema。先创建空表,再使用
    ingesting-into-data-lake
    技能导入数据。
约束条件:
  • 必须阅读
    references/best-practices.md
    ,了解Iceberg类型映射、分区和命名规范。
  • 必须提前获取所有必填参数:表名、列、类型、分区策略。关于Schema演进,请查看
    references/athena-ddl-path.md
  • 必须使用全小写名称——Glue会拒绝混合大小写名称并返回
    GENERIC_INTERNAL_ERROR
    。命名空间和表名不得包含连字符。
  • 应根据访问模式建议分区列。

3. Create Table Bucket

3. 创建表存储桶

Names: 3-63 chars, lowercase, numbers, hyphens.
bash
aws s3tables create-table-bucket --name <BUCKET_NAME> --region <REGION>
Capture
table-bucket-arn
. Encryption (SSE-S3 default, SSE-KMS) and storage class (STANDARD, INTELLIGENT_TIERING) set at creation. See
references/best-practices.md
.
Constraints:
  • You MUST check existing buckets with
    aws s3tables list-table-buckets
    and ask user to select or create new.
  • If using SSE-KMS, KMS key policy MUST allow S3 Tables maintenance service principal to read data. Search AWS docs for
    "S3 Tables KMS key policy"
    for required policy.
  • If bucket creation fails, see
    references/best-practices.md
    for common errors.
命名规则:3-63个字符,小写字母、数字、连字符。
bash
aws s3tables create-table-bucket --name <BUCKET_NAME> --region <REGION>
记录
table-bucket-arn
。加密方式(默认SSE-S3,可选SSE-KMS)和存储类别(STANDARD、INTELLIGENT_TIERING)在创建时设置。详情请查看
references/best-practices.md
约束条件:
  • 必须通过
    aws s3tables list-table-buckets
    检查现有存储桶,让用户选择或创建新存储桶。
  • 如果使用SSE-KMS,KMS密钥策略必须允许S3 Tables维护服务主体读取数据。请在AWS文档中搜索
    "S3 Tables KMS key policy"
    获取所需策略。
  • 如果存储桶创建失败,请查看
    references/best-practices.md
    中的常见错误说明。

4. Create Namespace

4. 创建命名空间

bash
aws s3tables create-namespace --table-bucket-arn <ARN> --namespace <NAMESPACE>
Constraints:
  • You MUST list existing namespaces first and suggest reusing if relevant
  • You MUST use lowercase names with no hyphens
bash
aws s3tables create-namespace --table-bucket-arn <ARN> --namespace <NAMESPACE>
约束条件:
  • 必须先列出现有命名空间,如有相关建议复用
  • 必须使用全小写名称,不得包含连字符

5. Create Glue Data Catalog Integration

5. 配置Glue Data Catalog集成

Check if
s3tablescatalog
exists (create once per region per account):
bash
aws glue get-catalog --catalog-id s3tablescatalog
If not found, create (requires
glue:CreateCatalog
,
glue:passConnection
):
bash
aws glue create-catalog --name "s3tablescatalog" --catalog-input '{
  "FederatedCatalog": {
    "Identifier": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/*",
    "ConnectionName": "aws:s3tables"
  },
  "CreateDatabaseDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "CreateTableDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "AllowFullTableExternalDataAccess": "True"
}'
Verify with
aws glue get-catalogs --parent-catalog-id s3tablescatalog
.
检查
s3tablescatalog
是否存在(每个账户每个区域只需创建一次):
bash
aws glue get-catalog --catalog-id s3tablescatalog
如果未找到,则创建(需要
glue:CreateCatalog
glue:passConnection
权限):
bash
aws glue create-catalog --name "s3tablescatalog" --catalog-input '{
  "FederatedCatalog": {
    "Identifier": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/*",
    "ConnectionName": "aws:s3tables"
  },
  "CreateDatabaseDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "CreateTableDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "AllowFullTableExternalDataAccess": "True"
}'
通过
aws glue get-catalogs --parent-catalog-id s3tablescatalog
验证创建结果。

6. Configure Access Control

6. 配置访问控制

S3 Tables uses
s3tables:*
IAM namespace (not
s3:*
).
Querying principal permissions (bucket policy):
  • s3tables:GetTableBucket
    ,
    s3tables:GetNamespace
    ,
    s3tables:GetTable
    ,
    s3tables:GetTableMetadataLocation
    ,
    s3tables:GetTableData
Querying principal permissions (IAM policy):
  • glue:GetCatalog
    ,
    glue:GetDatabase
    ,
    glue:GetTable
You MUST scope to correct ARN patterns. You MUST read
references/access-control.md
for exact resource ARNs.
Constraints:
  • You MUST ask user for querying principal ARN
  • You MUST NOT grant broader permissions than necessary
  • You MUST NOT create IAM roles automatically, verify existing and guide user
S3 Tables使用
s3tables:*
IAM命名空间(而非
s3:*
)。
查询主体权限(存储桶策略):
  • s3tables:GetTableBucket
    ,
    s3tables:GetNamespace
    ,
    s3tables:GetTable
    ,
    s3tables:GetTableMetadataLocation
    ,
    s3tables:GetTableData
查询主体权限(IAM策略):
  • glue:GetCatalog
    ,
    glue:GetDatabase
    ,
    glue:GetTable
必须限定正确的ARN模式。必须阅读
references/access-control.md
获取准确的资源ARN。
约束条件:
  • 必须向用户索要查询主体ARN
  • 不得授予超出必要范围的权限
  • 不得自动创建IAM角色,需验证现有角色并引导用户操作

7. Create the Table

7. 创建表

ContextPath
Default (any user)S3 Tables API (below)
User specifically wants SQL DDLAthena DDL (see
references/athena-ddl-path.md
)
Glue ETL pipelineSpark DDL via
--conf
job args (not
spark.conf.set()
). You MUST read
references/table-creation-glue-etl.md
for the
--conf
string.
Default: S3 Tables API:
bash
aws s3tables create-table \
  --table-bucket-arn <ARN> \
  --namespace <NAMESPACE> \
  --name <TABLE_NAME> \
  --format ICEBERG \
  --metadata '<METADATA_JSON>'
Metadata JSON MUST nest under
"iceberg"
key:
json
{"iceberg":{"schema":{"fields":[
  {"name":"order_date","type":"date","required":true},
  {"name":"customer_id","type":"string","required":true},
  {"name":"amount","type":"double","required":false}
]},
"partitionSpec":{"fields":[
  {"sourceId":1,"fieldId":1000,"transform":"month","name":"order_date_month"}
]}}}
Constraints:
  • partitionSpec.sourceId
    MUST reference a valid schema field ID
  • For schema evolution after creation, use Athena DDL. See
    references/athena-ddl-path.md
  • You MUST use
    schemaV2
    for complex types (list, map, struct) with explicit field IDs. See
    references/best-practices.md
    .
  • You SHOULD search AWS docs for
    "IcebergPartitionField S3 Tables"
    for supported partition transforms
场景实现方式
默认(任意用户)S3 Tables API(如下)
用户明确要求使用SQL DDLAthena DDL(查看
references/athena-ddl-path.md
Glue ETL管道通过
--conf
作业参数使用Spark DDL(不得使用
spark.conf.set()
)。必须阅读
references/table-creation-glue-etl.md
获取
--conf
参数字符串。
默认方式:S3 Tables API:
bash
aws s3tables create-table \
  --table-bucket-arn <ARN> \
  --namespace <NAMESPACE> \
  --name <TABLE_NAME> \
  --format ICEBERG \
  --metadata '<METADATA_JSON>'
Metadata JSON必须嵌套在
"iceberg"
键下:
json
{"iceberg":{"schema":{"fields":[
  {"name":"order_date","type":"date","required":true},
  {"name":"customer_id","type":"string","required":true},
  {"name":"amount","type":"double","required":false}
]},
"partitionSpec":{"fields":[
  {"sourceId":1,"fieldId":1000,"transform":"month","name":"order_date_month"}
]}}}
约束条件:
  • partitionSpec.sourceId
    必须引用有效的Schema字段ID
  • 创建后的Schema演进需使用Athena DDL。详情查看
    references/athena-ddl-path.md
  • 若使用复杂类型(list、map、struct),必须使用
    schemaV2
    并指定显式字段ID。详情查看
    references/best-practices.md
  • 应在AWS文档中搜索
    "IcebergPartitionField S3 Tables"
    了解支持的分区转换方式

8. Verify and Confirm

8. 验证与确认

You MUST verify with
aws s3tables get-table
and confirm queryability with
DESCRIBE <table_name>
via Athena using
--query-execution-context '{"Catalog":"s3tablescatalog/<BUCKET_NAME>","Database":"<NAMESPACE>"}'
. Do NOT put catalog in SQL. Present summary: bucket ARN, namespace, table, schema, partitions.
必须通过
aws s3tables get-table
验证表创建结果,并通过Athena执行
DESCRIBE <table_name>
确认可查询性,执行时需指定
--query-execution-context '{"Catalog":"s3tablescatalog/<BUCKET_NAME>","Database":"<NAMESPACE>"}'
。不得在SQL中指定catalog。向用户展示汇总信息:存储桶ARN、命名空间、表名、Schema、分区。

Troubleshooting

故障排查

ErrorCauseFix
"Table location can not be specified"LOCATION in CREATE TABLERemove LOCATION clause. S3 Tables manages storage automatically.
AccessDeniedException
with
s3:*
policy
Using
s3:*
not
s3tables:*
S3 Tables uses
s3tables:*
namespace. Update IAM policy.
错误信息原因修复方案
"Table location can not be specified"CREATE TABLE语句中包含LOCATION删除LOCATION子句。S3 Tables会自动管理存储位置。
AccessDeniedException
且使用
s3:*
策略
使用了
s3:*
而非
s3tables:*
命名空间
S3 Tables使用
s3tables:*
命名空间。更新IAM策略。

Additional Resources

额外资源

  • access-control.md -- IAM permissions, ARN patterns, permission errors
  • best-practices.md -- Iceberg types, partitions, naming, common errors
  • athena-ddl-path.md -- Athena DDL, schema evolution
  • table-creation-glue-etl.md -- Spark DDL via Glue ETL
  • Loading data:
    ingesting-into-data-lake
    skill
  • access-control.md -- IAM权限、ARN模式、权限错误排查
  • best-practices.md -- Iceberg类型、分区、命名规范、常见错误
  • athena-ddl-path.md -- Athena DDL、Schema演进
  • table-creation-glue-etl.md -- 通过Glue ETL使用Spark DDL
  • 数据加载:
    ingesting-into-data-lake
    技能