creating-data-lake-table

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Create Data Lake Tables with Amazon S3 Tables

使用Amazon S3 Tables创建数据湖表

Overview

概述

Amazon S3 Tables provides managed Iceberg tables with automatic compaction and snapshot management. Queryable via Athena and Iceberg-compatible engines.

Amazon S3 Tables提供托管Iceberg表，支持自动压缩和快照管理。可通过Athena及兼容Iceberg的引擎进行查询。

Common Tasks

常见任务

You MUST use AWS MCP server tools when connected, they provide command validation, sandboxed execution, and audit logging. Fall back to AWS CLI if MCP unavailable.

连接时必须使用AWS MCP服务器工具，它们提供命令验证、沙箱执行和审计日志功能。如果MCP不可用，则使用AWS CLI作为替代。

Decision Guide

决策指南

Before creating, You MUST check what exists:

You MUST run

aws glue get-tables --database-name <NAME>

when user mentions a database.

What you find	Action
Fuzzy database name ("our analytics db")	You MUST STOP. Delegate to `finding-data-lake-assets` to resolve.
Non-S3-Tables table with matching name	You MUST STOP. Delegate to `finding-data-lake-assets` . You MUST NOT create until user confirms.
Existing S3 Tables table with matching name	You MUST check schema match. Reuse if compatible, recreate only if user confirms.
No matching tables	Proceed with creation (Steps 1-8).
User explicitly requests new S3 Tables table	Skip checks, proceed with creation.

Creation paths:

Existing data in S3: Create empty table (Steps 1-8), then use
```
ingesting-into-data-lake
```
skill.
Glue ETL pipeline: Read
```
references/table-creation-glue-etl.md
```
first, then Steps 1-6.
Lake Formation access control: Search AWS docs for
```
"S3 Tables integration with Lake Formation"
```
.

创建前必须检查现有资源：

当用户提及数据库时，必须执行

aws glue get-tables --database-name <NAME>

命令。

检查结果	操作
模糊数据库名称（如“我们的分析数据库”）	必须停止操作。委托给 `finding-data-lake-assets` 技能来确认具体信息。
存在同名非S3-Tables表	必须停止操作。委托给 `finding-data-lake-assets` 技能。在用户确认前不得创建新表。
存在同名S3 Tables表	必须检查Schema是否匹配。若兼容则复用，仅在用户确认后重新创建。
无匹配表	继续创建流程（步骤1-8）。
用户明确要求创建新的S3 Tables表	跳过检查，直接进入创建流程。

创建路径：

S3中已有数据：创建空表（步骤1-8），然后使用
```
ingesting-into-data-lake
```
技能导入数据。
Glue ETL管道：先阅读
```
references/table-creation-glue-etl.md
```
，再执行步骤1-6。
Lake Formation访问控制：在AWS文档中搜索
```
"S3 Tables integration with Lake Formation"
```
。

1. Verify Dependencies

1. 验证依赖项

Constraints:

You MUST check whether AWS MCP server tools or AWS CLI are available and inform user if missing
You MUST confirm target AWS region and verify credentials with
```
aws sts get-caller-identity
```

约束条件：

必须检查AWS MCP服务器工具或AWS CLI是否可用，若缺失需告知用户
必须确认目标AWS区域，并通过
```
aws sts get-caller-identity
```
验证凭证

2. Understand the Schema

2. 明确Schema

Explicit schema: Validate Iceberg types.
Loose description: Ask columns, types, grain. Propose and confirm.
Existing S3 data: Infer schema from file headers only. Create empty table first, then use
```
ingesting-into-data-lake
```
skill.

Constraints:

You MUST read
```
references/best-practices.md
```
for Iceberg type mapping, partitions, and naming.
You MUST ask for all required parameters upfront: table name, columns, types, partition strategy. For schema evolution, see
```
references/athena-ddl-path.md
```
.
You MUST use all lowercase names -- Glue rejects mixed case with
```
GENERIC_INTERNAL_ERROR
```
. Namespace and table names MUST NOT contain hyphens.
You SHOULD suggest partition columns based on access patterns.

显式Schema：验证Iceberg类型。
模糊描述：询问列名、类型、粒度。提出方案并确认。
S3中已有数据：仅从文件头推断Schema。先创建空表，再使用
```
ingesting-into-data-lake
```
技能导入数据。

约束条件：

必须阅读
```
references/best-practices.md
```
，了解Iceberg类型映射、分区和命名规范。
必须提前获取所有必填参数：表名、列、类型、分区策略。关于Schema演进，请查看
```
references/athena-ddl-path.md
```
。
必须使用全小写名称——Glue会拒绝混合大小写名称并返回
```
GENERIC_INTERNAL_ERROR
```
。命名空间和表名不得包含连字符。
应根据访问模式建议分区列。

3. Create Table Bucket

3. 创建表存储桶

Names: 3-63 chars, lowercase, numbers, hyphens.

bash

aws s3tables create-table-bucket --name <BUCKET_NAME> --region <REGION>

Capture

table-bucket-arn

. Encryption (SSE-S3 default, SSE-KMS) and storage class (STANDARD, INTELLIGENT_TIERING) set at creation. See

references/best-practices.md

Constraints:

You MUST check existing buckets with
```
aws s3tables list-table-buckets
```
and ask user to select or create new.
If using SSE-KMS, KMS key policy MUST allow S3 Tables maintenance service principal to read data. Search AWS docs for
```
"S3 Tables KMS key policy"
```
for required policy.
If bucket creation fails, see
```
references/best-practices.md
```
for common errors.

命名规则：3-63个字符，小写字母、数字、连字符。

bash

aws s3tables create-table-bucket --name <BUCKET_NAME> --region <REGION>

记录

table-bucket-arn

。加密方式（默认SSE-S3，可选SSE-KMS）和存储类别（STANDARD、INTELLIGENT_TIERING）在创建时设置。详情请查看

references/best-practices.md

。

约束条件：

必须通过
```
aws s3tables list-table-buckets
```
检查现有存储桶，让用户选择或创建新存储桶。
如果使用SSE-KMS，KMS密钥策略必须允许S3 Tables维护服务主体读取数据。请在AWS文档中搜索
```
"S3 Tables KMS key policy"
```
获取所需策略。
如果存储桶创建失败，请查看
```
references/best-practices.md
```
中的常见错误说明。

4. Create Namespace

4. 创建命名空间

bash

aws s3tables create-namespace --table-bucket-arn <ARN> --namespace <NAMESPACE>

Constraints:

You MUST list existing namespaces first and suggest reusing if relevant
You MUST use lowercase names with no hyphens

bash

aws s3tables create-namespace --table-bucket-arn <ARN> --namespace <NAMESPACE>

约束条件：

必须先列出现有命名空间，如有相关建议复用
必须使用全小写名称，不得包含连字符

5. Create Glue Data Catalog Integration

5. 配置Glue Data Catalog集成

Check if

s3tablescatalog

exists (create once per region per account):

bash

aws glue get-catalog --catalog-id s3tablescatalog

If not found, create (requires

glue:CreateCatalog

glue:passConnection

bash

aws glue create-catalog --name "s3tablescatalog" --catalog-input '{
  "FederatedCatalog": {
    "Identifier": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/*",
    "ConnectionName": "aws:s3tables"
  },
  "CreateDatabaseDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "CreateTableDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "AllowFullTableExternalDataAccess": "True"
}'

Verify with

aws glue get-catalogs --parent-catalog-id s3tablescatalog

检查

s3tablescatalog

是否存在（每个账户每个区域只需创建一次）：

bash

aws glue get-catalog --catalog-id s3tablescatalog

如果未找到，则创建（需要

glue:CreateCatalog

、

glue:passConnection

权限）：

bash

aws glue create-catalog --name "s3tablescatalog" --catalog-input '{
  "FederatedCatalog": {
    "Identifier": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/*",
    "ConnectionName": "aws:s3tables"
  },
  "CreateDatabaseDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "CreateTableDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
  "AllowFullTableExternalDataAccess": "True"
}'

通过

aws glue get-catalogs --parent-catalog-id s3tablescatalog

验证创建结果。

6. Configure Access Control

6. 配置访问控制

S3 Tables uses

s3tables:*

IAM namespace (not

s3:*

Querying principal permissions (bucket policy):

s3tables:GetTableBucket

s3tables:GetNamespace

s3tables:GetTable

s3tables:GetTableMetadataLocation

s3tables:GetTableData

Querying principal permissions (IAM policy):

glue:GetCatalog

glue:GetDatabase

glue:GetTable

You MUST scope to correct ARN patterns. You MUST read

references/access-control.md

for exact resource ARNs.

Constraints:

You MUST ask user for querying principal ARN
You MUST NOT grant broader permissions than necessary
You MUST NOT create IAM roles automatically, verify existing and guide user

S3 Tables使用

s3tables:*

IAM命名空间（而非

s3:*

）。

查询主体权限（存储桶策略）：

s3tables:GetTableBucket

s3tables:GetNamespace

s3tables:GetTable

s3tables:GetTableMetadataLocation

s3tables:GetTableData

查询主体权限（IAM策略）：

glue:GetCatalog

glue:GetDatabase

glue:GetTable

必须限定正确的ARN模式。必须阅读

references/access-control.md

获取准确的资源ARN。

约束条件：

必须向用户索要查询主体ARN
不得授予超出必要范围的权限
不得自动创建IAM角色，需验证现有角色并引导用户操作

7. Create the Table

7. 创建表

Context	Path
Default (any user)	S3 Tables API (below)
User specifically wants SQL DDL	Athena DDL (see `references/athena-ddl-path.md` )
Glue ETL pipeline	Spark DDL via `--conf` job args (not `spark.conf.set()` ). You MUST read `references/table-creation-glue-etl.md` for the `--conf` string.

Default: S3 Tables API:

bash

aws s3tables create-table \
  --table-bucket-arn <ARN> \
  --namespace <NAMESPACE> \
  --name <TABLE_NAME> \
  --format ICEBERG \
  --metadata '<METADATA_JSON>'

Metadata JSON MUST nest under

"iceberg"

key:

json

{"iceberg":{"schema":{"fields":[
  {"name":"order_date","type":"date","required":true},
  {"name":"customer_id","type":"string","required":true},
  {"name":"amount","type":"double","required":false}
]},
"partitionSpec":{"fields":[
  {"sourceId":1,"fieldId":1000,"transform":"month","name":"order_date_month"}
]}}}

Constraints:

```
partitionSpec.sourceId
```
MUST reference a valid schema field ID
For schema evolution after creation, use Athena DDL. See
```
references/athena-ddl-path.md
```
You MUST use
```
schemaV2
```
for complex types (list, map, struct) with explicit field IDs. See
```
references/best-practices.md
```
.
You SHOULD search AWS docs for
```
"IcebergPartitionField S3 Tables"
```
for supported partition transforms

场景	实现方式
默认（任意用户）	S3 Tables API（如下）
用户明确要求使用SQL DDL	Athena DDL（查看 `references/athena-ddl-path.md` ）
Glue ETL管道	通过 `--conf` 作业参数使用Spark DDL（不得使用 `spark.conf.set()` ）。必须阅读 `references/table-creation-glue-etl.md` 获取 `--conf` 参数字符串。

默认方式：S3 Tables API：

bash

aws s3tables create-table \
  --table-bucket-arn <ARN> \
  --namespace <NAMESPACE> \
  --name <TABLE_NAME> \
  --format ICEBERG \
  --metadata '<METADATA_JSON>'

Metadata JSON必须嵌套在

"iceberg"

键下：

json

{"iceberg":{"schema":{"fields":[
  {"name":"order_date","type":"date","required":true},
  {"name":"customer_id","type":"string","required":true},
  {"name":"amount","type":"double","required":false}
]},
"partitionSpec":{"fields":[
  {"sourceId":1,"fieldId":1000,"transform":"month","name":"order_date_month"}
]}}}

约束条件：

```
partitionSpec.sourceId
```
必须引用有效的Schema字段ID
创建后的Schema演进需使用Athena DDL。详情查看
```
references/athena-ddl-path.md
```
若使用复杂类型（list、map、struct），必须使用
```
schemaV2
```
并指定显式字段ID。详情查看
```
references/best-practices.md
```
。
应在AWS文档中搜索
```
"IcebergPartitionField S3 Tables"
```
了解支持的分区转换方式

8. Verify and Confirm

8. 验证与确认

You MUST verify with

aws s3tables get-table

and confirm queryability with

DESCRIBE <table_name>

via Athena using

--query-execution-context '{"Catalog":"s3tablescatalog/<BUCKET_NAME>","Database":"<NAMESPACE>"}'

. Do NOT put catalog in SQL. Present summary: bucket ARN, namespace, table, schema, partitions.

必须通过

aws s3tables get-table

验证表创建结果，并通过Athena执行

DESCRIBE <table_name>

确认可查询性，执行时需指定

--query-execution-context '{"Catalog":"s3tablescatalog/<BUCKET_NAME>","Database":"<NAMESPACE>"}'

。不得在SQL中指定catalog。向用户展示汇总信息：存储桶ARN、命名空间、表名、Schema、分区。

Troubleshooting

故障排查

Error	Cause	Fix
"Table location can not be specified"	LOCATION in CREATE TABLE	Remove LOCATION clause. S3 Tables manages storage automatically.
`AccessDeniedException` with `s3:*` policy	Using `s3:` not `s3tables:`	S3 Tables uses `s3tables:*` namespace. Update IAM policy.

错误信息	原因	修复方案
"Table location can not be specified"	CREATE TABLE语句中包含LOCATION	删除LOCATION子句。S3 Tables会自动管理存储位置。
`AccessDeniedException` 且使用 `s3:*` 策略	使用了 `s3:` 而非 `s3tables:` 命名空间	S3 Tables使用 `s3tables:*` 命名空间。更新IAM策略。

Additional Resources

额外资源

access-control.md -- IAM permissions, ARN patterns, permission errors
best-practices.md -- Iceberg types, partitions, naming, common errors
athena-ddl-path.md -- Athena DDL, schema evolution
table-creation-glue-etl.md -- Spark DDL via Glue ETL
Loading data:
```
ingesting-into-data-lake
```
skill

access-control.md -- IAM权限、ARN模式、权限错误排查
best-practices.md -- Iceberg类型、分区、命名规范、常见错误
athena-ddl-path.md -- Athena DDL、Schema演进
table-creation-glue-etl.md -- 通过Glue ETL使用Spark DDL
数据加载：
```
ingesting-into-data-lake
```
技能