wren-dlt-connector
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesewren-dlt-connector
wren-dlt-connector
Connect SaaS data to Wren Engine for SQL analysis — from zero to a verified, queryable project in one conversation.
将SaaS数据连接至Wren Engine以进行SQL分析——只需一次对话,即可从零搭建出可验证、可查询的项目。
Who this is for
适用人群
Data analysts who know SQL and some Python, but may not have used dlt or Wren before. Explain concepts briefly when they first appear, but don't over-explain things a SQL-literate person would already know.
熟悉SQL和基础Python,但可能从未使用过dlt或Wren的数据分析师。首次出现相关概念时可简要解释,但无需过度讲解懂SQL的用户已了解的内容。
Overview
流程概述
This skill walks through a four-phase workflow:
- Extract — Use dlt (data load tool) to pull data from a SaaS API into a local DuckDB file
- Model — Introspect the DuckDB schema and auto-generate a Wren semantic project (YAML models, relationships, profile)
- Build & Verify — Build the project and run actual SQL queries to confirm everything works end-to-end
- Handoff — Show the user their data and next steps
The user might enter at any phase. Ask which phase they're starting from — they may already have a file and just need phases 2–4.
.duckdbThe goal is a project that actually queries successfully, not just files that look correct. Always run the verification step before declaring success.
此技能会引导用户完成四阶段工作流:
- 提取 — 使用dlt(data load tool)从SaaS API拉取数据到本地DuckDB文件
- 建模 — 探查DuckDB的Schema并自动生成Wren语义项目(YAML模型、关联关系、配置文件)
- 构建与验证 — 构建项目并运行实际SQL查询,确认端到端流程正常运行
- 交付 — 向用户展示其数据及后续步骤
用户可能从任意阶段开始操作。询问用户的起始阶段——他们可能已经拥有文件,只需完成第2至第4阶段。
.duckdb**目标是打造一个能成功执行查询的项目,而非仅生成看似正确的文件。**在宣布成功前务必执行验证步骤。
Critical: DuckDB catalog naming
关键注意事项:DuckDB catalog命名规则
When wren engine connects to a DuckDB file, it ATTACHes it using the filename (without extension) as the catalog alias:
.duckdbATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)This means every model's must equal the DuckDB filename stem. If the file is , the catalog is . If it's , the catalog is .
table_reference.cataloghubspot.duckdbhubspotmy_pipeline.duckdbmy_pipelineGetting this wrong causes "table not found" errors at query time. The script handles this automatically.
introspect_dlt.py当Wren Engine连接DuckDB文件时,会使用文件名(不含扩展名)作为catalog别名来ATTACH数据库:
.duckdbATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)这意味着每个模型的必须与DuckDB文件名的主干部分一致。如果文件名为,则catalog为;如果文件名为,则catalog为。
table_reference.cataloghubspot.duckdbhubspotmy_pipeline.duckdbmy_pipeline此处设置错误会导致查询时出现“表未找到”错误。脚本会自动处理此问题。
introspect_dlt.pyCritical: Type normalization
关键注意事项:类型标准化
Column types must be normalized using wren SDK's function, which uses sqlglot to convert database-specific types (like DuckDB's , ) into canonical SQL types that wren-core understands. Do not hardcode type mappings — always delegate to .
type_mapping.parse_type()HUGEINTTIMESTAMP WITH TIME ZONEparse_type(raw_type, "duckdb")The script does this automatically when wren SDK is installed.
introspect_dlt.py必须使用Wren SDK的函数对列类型进行标准化,该函数通过sqlglot将数据库特定类型(如DuckDB的、)转换为Wren-core可识别的标准SQL类型。请勿硬编码类型映射——始终委托给处理。
type_mapping.parse_type()HUGEINTTIMESTAMP WITH TIME ZONEparse_type(raw_type, "duckdb")安装Wren SDK后,脚本会自动完成此操作。
introspect_dlt.pyPhase 1: Extract — dlt Pipeline Setup
阶段1:提取 — dlt管道设置
Step 1: Pick the SaaS source
步骤1:选择SaaS数据源
Ask the user which SaaS service they want to connect. Read for a list of popular verified sources and their auth requirements. If the source isn't listed, check whether dlt has a verified source for it by searching .
references/dlt_sources.mddlthub.com/docs/dlt-ecosystem/verified-sources询问用户想要连接的SaaS服务。查看获取热门已验证数据源及其认证要求的列表。如果该数据源未在列表中,可通过搜索确认dlt是否有对应的已验证数据源。
references/dlt_sources.mddlthub.com/docs/dlt-ecosystem/verified-sourcesStep 2: Install dlt
步骤2:安装dlt
bash
pip install "dlt[duckdb]" --break-system-packagesbash
pip install "dlt[duckdb]" --break-system-packagesStep 3: Write the pipeline script
步骤3:编写管道脚本
Create a Python script that:
- Imports the dlt source function for the chosen SaaS
- Configures the pipeline with and a local file path
destination='duckdb' - Runs the pipeline with
pipeline.run(source)
Here's the general pattern — adapt it per source (check for source-specific templates):
references/dlt_sources.mdpython
import dlt
pipeline = dlt.pipeline(
pipeline_name="<source>_pipeline",
destination="duckdb",
dataset_name="<source>_data",
)创建Python脚本,完成以下操作:
- 导入所选SaaS对应的dlt源函数
- 配置管道,设置及本地文件路径
destination='duckdb' - 运行执行管道
pipeline.run(source)
以下是通用模板——可根据数据源调整(查看获取数据源特定模板):
references/dlt_sources.mdpython
import dlt
pipeline = dlt.pipeline(
pipeline_name="<source>_pipeline",
destination="duckdb",
dataset_name="<source>_data",
)Source-specific: check references/dlt_sources.md for auth patterns
数据源特定:查看references/dlt_sources.md获取认证模式
source = <source_function>(api_key=dlt.secrets.value)
info = pipeline.run(source)
print(info)
undefinedsource = <source_function>(api_key=dlt.secrets.value)
info = pipeline.run(source)
print(info)
undefinedStep 4: Set up credentials
步骤4:配置凭证
dlt reads credentials from environment variables or . The simplest approach for a one-time run:
.dlt/secrets.tomlbash
undefineddlt从环境变量或读取凭证。一次性运行的最简方式:
.dlt/secrets.tomlbash
undefinedSet the credential as an environment variable
将凭证设置为环境变量
The exact variable name depends on the source — check references/dlt_sources.md
具体变量名取决于数据源——查看references/dlt_sources.md
export SOURCES__<SOURCE>__API_KEY="the-actual-key"
Ask the user for their API key or token. Remind them:
- Never commit credentials to git
- Environment variables are the simplest way for a one-time run
- For repeated use, they can create `.dlt/secrets.toml`export SOURCES__<SOURCE>__API_KEY="the-actual-key"
向用户索要其API密钥或令牌。提醒用户:
- 切勿将凭证提交至git
- 环境变量是一次性运行的最简方式
- 如需重复使用,可创建`.dlt/secrets.toml`文件Step 5: Run the pipeline
步骤5:运行管道
bash
python <pipeline_script>.pyAfter the run, confirm:
- The pipeline completed without errors
- A file was created (usually at
.duckdb)<pipeline_name>.duckdb - Print discovered tables and their column counts
python
import duckdb
con = duckdb.connect("<pipeline_name>.duckdb", read_only=True)
for row in con.execute("""
SELECT table_schema, table_name,
(SELECT COUNT(*) FROM information_schema.columns c
WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_count
FROM information_schema.tables t
WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
AND table_name NOT LIKE '_dlt_%'
ORDER BY table_schema, table_name
""").fetchall():
print(f" {row[0]}.{row[1]} ({row[2]} columns)")
con.close()bash
python <pipeline_script>.py运行完成后,确认:
- 管道无错误完成运行
- 已创建文件(通常位于
.duckdb路径)<pipeline_name>.duckdb - 打印已发现的表及其列数
python
import duckdb
con = duckdb.connect("<pipeline_name>.duckdb", read_only=True)
for row in con.execute("""
SELECT table_schema, table_name,
(SELECT COUNT(*) FROM information_schema.columns c
WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_count
FROM information_schema.tables t
WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
AND table_name NOT LIKE '_dlt_%'
ORDER BY table_schema, table_name
""").fetchall():
print(f" {row[0]}.{row[1]} ({row[2]} columns)")
con.close()Phase 2: Model — Generate Wren Project
阶段2:建模 — 生成Wren项目
Run the introspection script to auto-generate a complete Wren project from the DuckDB file:
bash
python <path-to-this-skill>/scripts/introspect_dlt.py \
--duckdb-path <path-to-duckdb-file> \
--output-dir <project-directory> \
--project-name <name>This script:
- Connects to the DuckDB file (read-only)
- Sets to the DuckDB filename stem (matching wren engine's ATTACH behavior)
table_reference.catalog - Discovers all tables and columns via
information_schema - Filters out dlt internal tables (,
_dlt_loads, etc.)_dlt_pipeline_state - Filters out dlt metadata columns (,
_dlt_id,_dlt_load_id) from model definitions_dlt_list_idx - Detects parent-child relationships from columns and table naming conventions
_dlt_parent_id - Normalizes column types using (sqlglot-based)
wren.type_mapping.parse_type() - Generates a complete v2 YAML project (wren_project.yml, models/, relationships.yml, instructions.md)
After running, show the user what was generated:
bash
undefined运行探查脚本,从DuckDB文件自动生成完整的Wren项目:
bash
python <path-to-this-skill>/scripts/introspect_dlt.py \
--duckdb-path <path-to-duckdb-file> \
--output-dir <project-directory> \
--project-name <name>此脚本会:
- 以只读模式连接DuckDB文件
- 将设置为DuckDB文件名的主干部分(匹配Wren Engine的ATTACH行为)
table_reference.catalog - 通过发现所有表和列
information_schema - 过滤掉dlt内部表(、
_dlt_loads等)_dlt_pipeline_state - 从模型定义中过滤掉dlt元数据列(、
_dlt_id、_dlt_load_id)_dlt_list_idx - 从列和表命名规则中检测父子关联关系
_dlt_parent_id - 使用标准化列类型(基于sqlglot)
wren.type_mapping.parse_type() - 生成完整的v2 YAML项目(wren_project.yml、models/、relationships.yml、instructions.md)
运行完成后,向用户展示生成的内容:
bash
undefinedShow project summary
展示项目摘要
cat <project-directory>/wren_project.yml
echo "---"
ls <project-directory>/models/
echo "---"
cat <project-directory>/relationships.yml
undefinedcat <project-directory>/wren_project.yml
echo "---"
ls <project-directory>/models/
echo "---"
cat <project-directory>/relationships.yml
undefinedVerify model correctness
验证模型正确性
Spot-check one generated model to confirm:
- matches the DuckDB filename (e.g.,
table_reference.catalogforstripe_data)stripe_data.duckdb - matches the DuckDB schema (usually
table_reference.schema)main - No columns appear in the columns list
_dlt_* - Column types look reasonable (VARCHAR, BIGINT, BOOLEAN, TIMESTAMP, etc.)
抽查一个生成的模型,确认:
- 与DuckDB文件名匹配(例如
table_reference.catalog对应stripe_data.duckdb)stripe_data - 与DuckDB的schema匹配(通常为
table_reference.schema)main - 列列表中无列
_dlt_* - 列类型合理(VARCHAR、BIGINT、BOOLEAN、TIMESTAMP等)
Set up the connection profile
设置连接配置文件
Create a Wren profile so the user can query without specifying connection details every time. The must point to the directory containing the file (not the file itself):
url.duckdbpython
import yaml
from pathlib import Path
wren_home = Path.home() / ".wren"
wren_home.mkdir(exist_ok=True)
profiles_file = wren_home / "profiles.yml"
existing = (yaml.safe_load(profiles_file.read_text()) or {}) if profiles_file.exists() else {}
existing.setdefault("profiles", {})
profile_name = "<source>_dlt"
existing["profiles"][profile_name] = {
"datasource": "duckdb",
"url": str(Path("<duckdb-path>").resolve().parent),
"format": "duckdb",
}
existing["active"] = profile_name
profiles_file.write_text(yaml.dump(existing, default_flow_style=False, sort_keys=False))创建Wren配置文件,让用户无需每次查询都指定连接详情。必须指向包含文件的目录(而非文件本身):
url.duckdbpython
import yaml
from pathlib import Path
wren_home = Path.home() / ".wren"
wren_home.mkdir(exist_ok=True)
profiles_file = wren_home / "profiles.yml"
existing = (yaml.safe_load(profiles_file.read_text()) or {}) if profiles_file.exists() else {}
existing.setdefault("profiles", {})
profile_name = "<source>_dlt"
existing["profiles"][profile_name] = {
"datasource": "duckdb",
"url": str(Path("<duckdb-path>").resolve().parent),
"format": "duckdb",
}
existing["active"] = profile_name
profiles_file.write_text(yaml.dump(existing, default_flow_style=False, sort_keys=False))Phase 3: Build & Verify — The Project Must Actually Work
阶段3:构建与验证 — 项目必须能实际运行
This phase is not optional. A project that generates YAML but fails at query time is not a success.
此阶段为必填项。仅生成YAML但查询失败的项目不算成功。
Step 1: Build the MDL
步骤1:构建MDL
bash
cd <project-directory>
wren context buildThis compiles the YAML models into . If this fails, fix the issues before proceeding (see Troubleshooting below).
target/mdl.jsonbash
cd <project-directory>
wren context build此操作会将YAML模型编译为。如果编译失败,需先修复问题再继续(见下方故障排除)。
target/mdl.jsonStep 2: Validate with a real query
步骤2:通过真实查询验证
Run at least one query per generated model to confirm the project is functional:
bash
undefined为每个生成的模型至少运行一次查询,确认项目可用:
bash
undefinedFor each model, verify it resolves correctly
针对每个模型,验证其可正确解析
wren --sql 'SELECT COUNT(*) as total FROM "<table_name>"'
If any query fails, debug and fix the model before moving on. Common issues:
- Wrong catalog in table_reference → "table not found"
- Type mismatch → fix the column type in metadata.yml
- Missing profile → check `wren profile list`wren --sql 'SELECT COUNT(*) as total FROM "<table_name>"'
如果任何查询失败,需先调试并修复模型再继续。常见问题:
- `table_reference.catalog`设置错误 → “表未找到”
- 类型不匹配 → 在metadata.yml中修复列类型
- 缺少配置文件 → 检查`wren profile list`Step 3: Run interesting queries
步骤3:运行实用查询
Once basic queries pass, run 2–3 more interesting queries to show the user what their data looks like:
bash
undefined基础查询通过后,运行2-3个更实用的查询,向用户展示其数据情况:
bash
undefinedPreview data
预览数据
wren --sql 'SELECT * FROM "<table_name>" LIMIT 5'
wren --sql 'SELECT * FROM "<table_name>" LIMIT 5'
If there's a relationship, verify both models are queryable
如果有关联关系,验证两个模型均可查询
wren --sql 'SELECT * FROM "<parent>" LIMIT 5'
wren --sql 'SELECT * FROM "<child>" LIMIT 5'
Show the results to the user and explain what they're seeing. This is their first look at the data through Wren — make it count.wren --sql 'SELECT * FROM "<parent>" LIMIT 5'
wren --sql 'SELECT * FROM "<child>" LIMIT 5'
向用户展示结果并解释内容。这是用户首次通过Wren查看数据,需确保体验良好。Step 4: Confirm success
步骤4:确认成功
Only after queries return real data, tell the user the setup is complete. Summarize:
- How many models were created
- What relationships were detected
- Which profile is active
- Example queries they can try next
只有当查询返回真实数据后,才可告知用户设置已完成。总结:
- 创建的模型数量
- 检测到的关联关系
- 当前激活的配置文件
- 可供尝试的示例查询
Troubleshooting
故障排除
If fails:
wren context build- Check that is set in
data_source: duckdbwren_project.yml - Verify the DuckDB file path in the profile is correct
- Run for detailed error messages
wren context validate
If queries fail with "table not found":
- Most likely cause: doesn't match the DuckDB filename. If the file is
table_reference.catalog, the catalog must bepipeline.duckdb, not empty string.pipeline - Check the profile's points to the directory containing the
urlfile.duckdb - Table names with double underscores need quoting:
"hubspot__contacts"
If queries fail with type errors:
- Check column types in the model YAML — they should be canonical SQL types (VARCHAR, BIGINT, etc.)
- Re-run with wren SDK installed to get proper type normalization
introspect_dlt.py
General:
- Check that the profile is active:
wren profile list - The DuckDB file might be locked if a dlt pipeline is running — wait for it to finish
如果失败:
wren context build- 检查中是否设置了
wren_project.ymldata_source: duckdb - 验证配置文件中的DuckDB文件路径是否正确
- 运行获取详细错误信息
wren context validate
如果查询失败并提示“表未找到”:
- 最可能原因:与DuckDB文件名不匹配。如果文件名为
table_reference.catalog,则catalog必须为pipeline.duckdb,而非空字符串。pipeline - 检查配置文件的是否指向包含
url文件的目录.duckdb - 含双下划线的表名需加引号:
"hubspot__contacts"
如果查询失败并提示类型错误:
- 检查模型YAML中的列类型——应为标准SQL类型(VARCHAR、BIGINT等)
- 安装Wren SDK后重新运行以获取正确的类型标准化结果
introspect_dlt.py
通用问题:
- 检查配置文件是否激活:
wren profile list - 如果dlt管道正在运行,DuckDB文件可能被锁定——等待管道运行完成
Important notes
重要说明
- dlt's /
_dlt_parent_idcolumns are kept in the actual DuckDB tables but hidden from Wren model definitions. They're only used in relationship conditions._dlt_id - DuckDB has a single-writer limitation. Don't run a dlt sync while querying. For concurrent access, dlt should write to a separate file and swap atomically.
- The generated models use (not
table_reference) since they map directly to DuckDB tables created by dlt.ref_sql - Column types are normalized using wren SDK's with sqlglot's DuckDB dialect. If a type looks wrong, the user can edit the model's
parse_type()directly.metadata.yml
- dlt的/
_dlt_parent_id列会保留在实际DuckDB表中,但会从Wren模型定义中隐藏。它们仅用于关联关系条件。_dlt_id - DuckDB存在单写入者限制。查询时请勿运行dlt同步。如需并发访问,dlt应写入单独文件并以原子方式替换。
- 生成的模型使用(而非
table_reference),因为它们直接映射到dlt创建的DuckDB表。ref_sql - 列类型通过Wren SDK的结合sqlglot的DuckDB方言进行标准化。如果类型显示错误,用户可直接编辑模型的
parse_type()文件。metadata.yml