exploring-data-catalog
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStructured inventory and cataloging across your AWS data landscape: Glue Data Catalog with S3 Tables, Redshift-federated, and remote Iceberg catalogs.
针对您的AWS数据全景进行结构化盘点与编目:涵盖Glue Data Catalog、S3 Tables、Redshift联邦目录及远程Iceberg目录。
Overview
概述
Maps data in an AWS account. Starts with catalog landscape (Glue, S3 Tables, federated), then drills into databases and tables. Read-only — no query execution.
Constraints for parameter acquisition:
- You MUST ask for the target AWS region upfront if not provided
- You MUST support a single optional argument: search term, catalog name, database name, S3 path, or table name
- You MUST accept the argument as direct input or a pointer to a file containing the spec
- You MUST confirm the scope (full landscape vs. targeted deep dive) before making API calls
- You MUST respect the user's decision to abort at any step
梳理AWS账户中的数据。从目录全景(Glue、S3 Tables、联邦目录)入手,再深入到数据库和表层面。仅支持只读操作——不执行查询。
参数获取约束:
- 若未提供目标AWS区域,必须先向用户询问
- 必须支持单个可选参数:搜索词、目录名称、数据库名称、S3路径或表名称
- 必须接受直接输入的参数,或指向包含参数规范的文件的指针
- 在调用API之前,必须确认范围(全景盘点 vs 定向深度分析)
- 必须尊重用户在任何步骤中止操作的决定
Common Tasks
常见任务
Pagination: All list and search calls in this workflow may return paginated results. You MUST pass from the previous response until no more tokens are returned. You MUST NOT assume a single page contains all results.
--next-token分页处理: 此工作流中的所有列表和搜索调用可能返回分页结果。必须传递上一次响应中的,直到没有更多令牌返回。不得假设单页包含所有结果。
--next-token1. Verify Dependencies
1. 验证依赖项
Check for required tools and AWS access before discovery.
Constraints:
- You MUST verify AWS MCP server tools are available (,
aws___call_aws) and fall back to AWS CLI if notaws___search_documentation - You MUST confirm credentials are valid:
aws sts get-caller-identity - You MUST inform the user about any missing tools and ask whether to proceed
在开始发现操作前,检查所需工具和AWS访问权限。
约束:
- 必须验证AWS MCP服务器工具(、
aws___call_aws)是否可用,若不可用则回退到AWS CLIaws___search_documentation - 必须通过确认凭证有效
aws sts get-caller-identity - 必须告知用户任何缺失的工具,并询问是否继续
2. Discover Catalogs
2. 发现目录
List catalogs in account:
bash
aws glue get-catalogs --recursive --include-rootClassify each catalog by type:
| Field Present | Catalog Type | What It Contains |
|---|---|---|
Neither | Default (Glue) | Standard Glue databases and tables |
| S3 Tables | Managed Iceberg table buckets |
| Redshift-federated | Redshift databases exposed as Glue catalogs |
| Remote Iceberg | External catalogs (Snowflake, Databricks, Iceberg REST) |
Constraints:
- You MUST include to capture default account catalog
--include-root - You MUST present summary of catalog counts by type
- If only default catalog exists, You SHOULD skip catalog overview and go to step 3
列出账户中的目录:
bash
aws glue get-catalogs --recursive --include-root按类型对每个目录进行分类:
| 存在的字段 | 目录类型 | 包含内容 |
|---|---|---|
既无 | 默认(Glue) | 标准Glue数据库和表 |
| S3 Tables | 托管Iceberg表的存储桶 |
| Redshift联邦目录 | 以Glue目录形式暴露的Redshift数据库 |
| 远程Iceberg目录 | 外部目录(Snowflake、Databricks、Iceberg REST) |
约束:
- 必须包含参数以捕获账户默认目录
--include-root - 必须按类型呈现目录数量的摘要
- 若仅存在默认目录,应跳过目录概览直接进入步骤3
3. Enumerate Databases and Tables
3. 枚举数据库和表
For each catalog (or the user-specified one):
bash
aws glue get-databases --catalog-id <catalog-id>
aws glue get-tables --database-name <db> --catalog-id <catalog-id>For S3 Tables catalogs, also enumerate via the S3 Tables API:
bash
aws s3tables list-table-buckets
aws s3tables list-namespaces --table-bucket-arn <arn>
aws s3tables list-tables --table-bucket-arn <arn> --namespace <ns>Constraints:
- You MUST flag S3 Tables not registered in Glue; You SHOULD suggest registration
- For sub-catalogs, accepts the catalog name (not the ARN)
--catalog-id - For the default catalog, omit or pass the account ID
--catalog-id
针对每个目录(或用户指定的目录):
bash
aws glue get-databases --catalog-id <catalog-id>
aws glue get-tables --database-name <db> --catalog-id <catalog-id>对于S3 Tables目录,还需通过S3 Tables API进行枚举:
bash
aws s3tables list-table-buckets
aws s3tables list-namespaces --table-bucket-arn <arn>
aws s3tables list-tables --table-bucket-arn <arn> --namespace <ns>约束:
- 必须标记未在Glue中注册的S3 Tables;应建议用户进行注册
- 对于子目录,接受目录名称(而非ARN)
--catalog-id - 对于默认目录,省略或传入账户ID
--catalog-id
4. Capture Details and Analyze
4. 捕获详情并分析
For each database, capture table count, formats, partitioning, and S3 locations. For each table of interest, capture column schemas, types, partition keys, SerDe format, and last access time.
You MUST report data formats in human-readable terms (Parquet, CSV, JSON), not raw SerDe class names.
See discovery-checklist.md for analysis framework.
针对每个数据库,捕获表数量、格式、分区情况及S3存储位置。针对每个重点关注的表,捕获列 schema、类型、分区键、SerDe格式及最后访问时间。
必须以易读的术语(如Parquet、CSV、JSON)报告数据格式,而非原始SerDe类名。
分析框架请参考discovery-checklist.md。
Argument Routing
参数路由
Resolve the argument in this order; stop at the first match:
- Starts with — S3 path (explore unregistered data, detect formats)
s3:// - Matches a known catalog from step 2 () — deep dive into that catalog
get-catalogs - Matches a known database () — deep dive into that database
get-databases - Matches a known table () — detailed table analysis with schema and partitions
get-tables - No match — treat as search term (Glue )
search-tables - No args — full landscape discovery (catalogs, then databases and tables)
按以下顺序解析参数,匹配到第一个即停止:
- 以开头 — S3路径(探索未注册数据,检测格式)
s3:// - 匹配步骤2中发现的已知目录() — 深度分析该目录
get-catalogs - 匹配已知数据库() — 深度分析该数据库
get-databases - 匹配已知表() — 对表进行包含schema和分区的详细分析
get-tables - 无匹配项 — 视为搜索词(使用Glue的)
search-tables - 无参数 — 进行全景发现(先目录,再数据库和表)
Principles
原则
- Start with catalog landscape, then narrow based on user interest
- Always report catalog types — users need to know where data lives
- Always report data formats — they drive cost and performance decisions
- Flag stale tables and missing descriptions
- Suggest partitioning for large unpartitioned tables
- Summary first, details on request
- You MUST NOT execute Athena queries () during discovery; query execution belongs to
start-query-executionquerying-data-lake
- 从目录全景入手,再根据用户兴趣缩小范围
- 始终报告目录类型——用户需要了解数据存储位置
- 始终报告数据格式——它们会影响成本和性能决策
- 标记长期未使用的表和缺失描述的表
- 为大型未分区表建议分区方案
- 先提供摘要,按需提供详情
- 发现过程中不得执行Athena查询();查询执行属于
start-query-execution的功能范围querying-data-lake
Troubleshooting
故障排除
| Error | Cause | Fix |
|---|---|---|
| Only sub-catalogs returned, default missing | | Re-run |
| Federated catalog query slow or failing | Network call to remote source; connection misconfigured | Report connection errors clearly rather than silently skipping |
| S3 Tables not queryable via Athena | Tables exist in S3 Tables API but not registered in Glue | Flag as "not queryable"; suggest registration |
| Default catalog requires omit or account ID | Omit |
| 错误 | 原因 | 解决方法 |
|---|---|---|
| 仅返回子目录,缺失默认目录 | 省略了 | 使用 |
| 联邦目录查询缓慢或失败 | 与远程源的网络调用;连接配置错误 | 清晰报告连接错误,而非静默跳过 |
| S3 Tables无法通过Athena查询 | 表存在于S3 Tables API中但未在Glue注册 | 标记为“不可查询”;建议注册 |
使用catalog-id时 | 默认目录需省略该参数或传入账户ID | 对于默认目录,省略 |
Additional Resources
额外资源
- Discovery checklist
- AWS Glue Data Catalog API
- S3 Tables list operations