connecting-to-data-source

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Connect to Data Source

连接数据源

Register an external data source with AWS Glue so downstream skills (ingesting-into-data-lake) can move data from it. A Glue connection stores the network config, driver, and credential reference for one source. Create once per source, reuse across jobs.

使用AWS Glue注册外部数据源，以便下游技能（ingesting-into-data-lake）可以从中迁移数据。一个Glue连接存储了单个数据源的网络配置、驱动程序和凭证引用。每个数据源创建一次，可在多个作业中复用。

Philosophy

理念

A connection is a named pipe, not a pipeline. This skill produces a tested, reusable Glue connection. It does not move data.

连接是命名管道，而非数据管道。 本技能生成经过测试、可复用的Glue连接，不负责数据迁移。

Common Tasks

常见任务

You MUST execute commands using AWS MCP server tools when connected -- they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing.

连接后必须使用AWS MCP服务器工具执行命令——这些工具提供验证、沙箱执行和审计日志记录。仅当MCP不可用时才使用AWS CLI替代。执行每个步骤前必须向用户说明。

Workflow

工作流

1. Verify Dependencies and Context

1. 验证依赖项和上下文

You MUST check whether AWS MCP tools or AWS CLI are available and inform the user if missing
You MUST confirm target AWS region and verify credentials with
```
aws sts get-caller-identity
```

必须检查AWS MCP工具或AWS CLI是否可用，若缺失需告知用户
必须确认目标AWS区域，并使用
```
aws sts get-caller-identity
```
验证凭证

2. Classify the Source

2. 分类数据源

Ask the user which source type they want to connect to, or infer from hints:

User says...	Source type	Connection type	Reference
"Oracle", "SQL Server", "Postgres", "MySQL", "RDS <engine>"	JDBC database	`JDBC`	jdbc-setup.md
"Redshift", "my cluster", "my data warehouse on AWS"	Redshift	`JDBC`	jdbc-setup.md (Redshift section)
"Snowflake"	Snowflake	`SNOWFLAKE`	snowflake-setup.md
"BigQuery", "Google analytics warehouse"	BigQuery	`BIGQUERY`	bigquery-setup.md

If the user names DynamoDB or a local file, stop and tell them: DynamoDB is read directly by Glue without a connection, and local files belong in the ingesting-into-data-lake skill's local-upload workflow.

询问用户想要连接的数据源类型，或根据提示推断：

用户表述...	数据源类型	连接类型	参考文档
"Oracle"、"SQL Server"、"Postgres"、"MySQL"、"RDS <引擎>"	JDBC数据库	`JDBC`	jdbc-setup.md
"Redshift"、"我的集群"、"我在AWS上的数据仓库"	Redshift	`JDBC`	jdbc-setup.md（Redshift章节）
"Snowflake"	Snowflake	`SNOWFLAKE`	snowflake-setup.md
"BigQuery"、"Google分析数据仓库"	BigQuery	`BIGQUERY`	bigquery-setup.md

如果用户提到DynamoDB或本地文件，请停止操作并告知：DynamoDB可由Glue直接读取，无需创建连接；本地文件属于ingesting-into-data-lake技能的本地上传工作流范畴。

3. Gather Connection Hints from the User

3. 收集用户提供的连接提示

You MUST ask for hints the user can provide -- do not guess.

For all sources:

Desired connection name (lowercase, hyphens:
```
oracle-prod-sales
```
,
```
snowflake-analytics
```
)
Existing Secrets Manager secret, or create one
Is source reachable from a Glue VPC (same, peered, VPN, Direct Connect)

JDBC: hostname/endpoint, port, database, whether RDS/Aurora/self-managed, IAM DB auth enabled (Aurora/RDS MySQL/Postgres), SSL required.

Snowflake: account identifier, warehouse, role, default database, auth (password, key-pair, OAuth).

BigQuery: GCP project ID, location, whether service account JSON is provisioned.

必须询问用户可提供的提示信息，不得猜测。

所有数据源通用信息：

所需的连接名称（小写，连字符分隔：
```
oracle-prod-sales
```
、
```
snowflake-analytics
```
）
是否已有Secrets Manager密钥，还是需要创建新密钥
数据源是否可从Glue VPC访问（同VPC、对等VPC、VPN、Direct Connect）

JDBC数据源额外信息： 主机名/端点、端口、数据库名称、是否为RDS/Aurora/自托管、是否启用IAM DB auth（Aurora/RDS MySQL/PostgreSQL）、是否需要SSL。

Snowflake额外信息： 账户标识符、仓库、角色、默认数据库、认证方式（密码、密钥对、OAuth）。

BigQuery额外信息： GCP项目ID、区域、是否已配置服务账户JSON文件。

4. Discover Existing Connections and Candidate Sources

4. 发现现有连接和候选数据源

Check what exists before creating.

Existing Glue connections:

bash

aws glue get-connections --filter ConnectionType=<TYPE> --region <REGION>

If a suitable one exists, confirm and skip to Step 7.

Candidate sources in account (JDBC/Redshift only):

RDS:
```
aws rds describe-db-instances
```
Aurora:
```
aws rds describe-db-clusters
```
Redshift:
```
aws redshift describe-clusters
```

Present candidates to user; let them pick. See discovery.md.

创建前先检查已有的资源。

现有Glue连接：

bash

aws glue get-connections --filter ConnectionType=<TYPE> --region <REGION>

如果存在合适的连接，确认后跳至步骤7。

账户内的候选数据源（仅JDBC/Redshift）：

RDS：
```
aws rds describe-db-instances
```
Aurora：
```
aws rds describe-db-clusters
```
Redshift：
```
aws redshift describe-clusters
```

将候选资源展示给用户，让其选择。详见discovery.md。

5. Register Credentials

5. 注册凭证

You MUST encourage AWS Secrets Manager over plaintext passwords. You SHOULD prefer IAM database authentication where supported (Aurora/RDS MySQL and PostgreSQL, Redshift). See credential-security.md.

You MUST confirm with user before creating a new Secrets Manager secret
You MUST NOT write plaintext credentials into chat or logs
For IAM DB auth, no secret is needed

必须优先推荐使用AWS Secrets Manager，而非明文密码。在支持的场景下（Aurora/RDS MySQL和PostgreSQL、Redshift），应优先选择IAM数据库认证。详见credential-security.md。

创建新的Secrets Manager密钥前必须征得用户确认
不得在聊天或日志中写入明文凭证
使用IAM DB auth时无需密钥

6. Create the Glue Connection

6. 创建Glue连接

Follow the source-specific reference for connection properties:

bash

aws glue create-connection --connection-input '<JSON>' --region <REGION>

Private sources require

PhysicalConnectionRequirements

(SubnetId, SecurityGroupIdList, AvailabilityZone). See network-setup.md.

遵循特定数据源的参考文档设置连接属性：

bash

aws glue create-connection --connection-input '<JSON>' --region <REGION>

私有数据源需要配置

PhysicalConnectionRequirements

（子网ID、安全组ID列表、可用区）。详见network-setup.md。

7. Test the Connection

7. 测试连接

You MUST test before handing off. Testing is two-phase: a quick API check, then an engine-level verification.

移交前必须进行测试。测试分为两个阶段：快速API检查，然后是引擎级验证。

Phase A: Glue TestConnection (network and credential sanity check)

阶段A：Glue TestConnection（网络和凭证合理性检查）

bash

aws glue test-connection --connection-name <NAME> --region <REGION>

This validates that Glue can reach the source and authenticate. It does NOT prove the connection works end-to-end with the query engine the user plans to use.

bash

aws glue test-connection --connection-name <NAME> --region <REGION>

此步骤验证Glue能否访问数据源并完成认证，但无法证明连接在用户计划使用的查询引擎中能端到端正常工作。

Phase B: Engine-level verification

阶段B：引擎级验证

After TestConnection passes, verify the connection works with the user's intended engine by running a minimal query through it:

Glue ETL (default): Run a smoke-test Glue job that reads one row via the connection. See troubleshooting.md.
Athena: If the user plans to query via Athena with a federated connector, run a
```
SELECT 1
```
through the Athena connection to confirm the Lambda-based connector can reach the source.
Glue Crawler: If the user plans to crawl the source, run a test crawl on a single table.

Phase B catches issues that TestConnection misses: driver compatibility at job runtime, catalog configuration, Spark-level serialization, and engine-specific auth flows (e.g., Snowflake SNOWFLAKE type works in ETL but not via JDBC crawlers).

On success in both phases, tell user the connection name is ready for

ingesting-into-data-lake

. On failure in either phase, Step 8.

TestConnection通过后，需通过连接运行最小化查询，验证其在用户目标引擎中的可用性：

Glue ETL（默认）： 运行一个冒烟测试Glue作业，通过连接读取一行数据。详见troubleshooting.md。
Athena： 如果用户计划通过Athena联合连接器进行查询，需通过Athena连接运行
```
SELECT 1
```
，确认基于Lambda的连接器能访问数据源。
Glue Crawler： 如果用户计划爬取数据源，需对单个表运行测试爬取。

阶段B能捕获TestConnection未发现的问题：作业运行时的驱动兼容性、目录配置、Spark级序列化，以及引擎特定的认证流程（例如，Snowflake的SNOWFLAKE类型在ETL中可用，但在JDBC爬虫中不可用）。

两个阶段均成功后，告知用户连接名称已准备好供

ingesting-into-data-lake

使用。任一阶段失败则进入步骤8。

8. Troubleshoot (only if test failed)

8. 排查问题（仅当测试失败时）

Diagnose in order: network, credentials, driver. See troubleshooting.md.

Constraints:

You MUST check VPC routing, security groups, and S3 VPC endpoint before blaming credentials
You MUST verify Glue role can read the Secrets Manager secret
You MUST NOT rotate credentials without user confirmation

按以下顺序诊断：网络、凭证、驱动。详见troubleshooting.md。

约束条件：

在怀疑凭证问题前，必须检查VPC路由、安全组和NAT网关
必须验证Glue角色能否读取Secrets Manager密钥
未经用户确认不得轮换凭证

Argument Routing

参数路由

No args: Walk through Steps 1-7 interactively
Source type keyword (e.g.,
```
snowflake
```
,
```
oracle
```
): Skip to Step 2 with the type prefilled
Existing connection name: Skip to Step 7 (test) then Step 8 if failing
Hostname or RDS endpoint: Skip to Step 4 with the candidate prefilled

无参数：交互式执行步骤1-7
数据源类型关键字（如
```
snowflake
```
、
```
oracle
```
）：直接跳至步骤2，预填入对应类型
现有连接名称：跳至步骤7（测试），失败则进入步骤8
主机名或RDS端点：跳至步骤4，预填入对应候选资源

Gotchas

注意事项

Glue's
```
SNOWFLAKE
```
connection type is distinct from
```
JDBC
```
configured for Snowflake. You MUST use
```
SNOWFLAKE
```
for Spark ETL jobs; do not use JDBC.
Connection names are immutable. Choose carefully.
```
PhysicalConnectionRequirements.AvailabilityZone
```
MUST match the subnet's AZ or the connection fails at job runtime, not creation time.
IAM database authentication tokens expire in 15 minutes. The Glue job generates a fresh token on each connection; do not cache.
An S3 VPC gateway endpoint MUST exist in the VPC used by private-source connections. Without it, Glue jobs cannot read their scripts or write results to S3.

Glue的
```
SNOWFLAKE
```
连接类型与配置为Snowflake的
```
JDBC
```
类型不同。Spark ETL作业必须使用
```
SNOWFLAKE
```
类型，请勿使用JDBC。
连接名称不可修改，请谨慎选择。
```
PhysicalConnectionRequirements.AvailabilityZone
```
必须与子网的可用区匹配，否则连接在作业运行时（而非创建时）会失败。
IAM数据库认证令牌15分钟后过期。Glue作业每次连接时都会生成新令牌，请勿缓存。
私有源连接使用的VPC中必须存在S3 VPC网关端点。否则Glue作业无法读取脚本或向S3写入结果。

Troubleshooting

故障排除

Error	Likely cause	Fix
`Connect timed out`	VPC routing, SG rule, or NAT gateway missing	See troubleshooting.md
`Access denied for user` / `ORA-01017`	Credentials wrong, Secrets Manager access missing, or IAM DB auth misconfigured	See troubleshooting.md
`No suitable driver found`	Custom driver JAR not set or wrong class name	See troubleshooting.md
`SSL handshake failed`	`JDBC_ENFORCE_SSL` mismatch between Glue and source	See troubleshooting.md
`UnableToFindVpcEndpoint`	S3 VPC endpoint missing	Create S3 gateway endpoint in the connection's VPC

错误	可能原因	修复方案
`Connect timed out`	VPC路由、安全组规则缺失或NAT网关缺失	详见troubleshooting.md
`Access denied for user` / `ORA-01017`	凭证错误、Secrets Manager权限缺失或IAM DB auth配置错误	详见troubleshooting.md
`No suitable driver found`	未设置自定义驱动JAR或类名称错误	详见troubleshooting.md
`SSL handshake failed`	Glue与数据源之间的 `JDBC_ENFORCE_SSL` 配置不匹配	详见troubleshooting.md
`UnableToFindVpcEndpoint`	S3 VPC端点缺失	在连接使用的VPC中创建S3网关端点

References

参考文档

jdbc-setup.md -- Oracle, SQL Server, PostgreSQL, MySQL, RDS, Redshift
snowflake-setup.md -- Glue
```
SNOWFLAKE
```
type, auth modes
bigquery-setup.md -- Glue
```
BIGQUERY
```
type, GCP service accounts
discovery.md -- Finding existing connections and candidate sources
credential-security.md -- Secrets Manager and IAM DB auth
network-setup.md -- VPC, subnets, security groups, endpoints
troubleshooting.md -- Connection errors and diagnostic flow

jdbc-setup.md —— Oracle、SQL Server、PostgreSQL、MySQL、RDS、Redshift
snowflake-setup.md —— Glue
```
SNOWFLAKE
```
类型、认证模式
bigquery-setup.md —— Glue
```
BIGQUERY
```
类型、GCP服务账户
discovery.md —— 查找现有连接和候选数据源
credential-security.md —— Secrets Manager和IAM DB auth
network-setup.md —— VPC、子网、安全组、端点
troubleshooting.md —— 连接错误和诊断流程