wren-dlt-connector

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

wren-dlt-connector

wren-dlt-connector

Connect SaaS data to Wren Engine for SQL analysis — from zero to a verified, queryable project in one conversation.
将SaaS数据连接至Wren Engine以进行SQL分析——只需一次对话,即可从零搭建出可验证、可查询的项目。

Who this is for

适用人群

Data analysts who know SQL and some Python, but may not have used dlt or Wren before. Explain concepts briefly when they first appear, but don't over-explain things a SQL-literate person would already know.
熟悉SQL和基础Python,但可能从未使用过dlt或Wren的数据分析师。首次出现相关概念时可简要解释,但无需过度讲解懂SQL的用户已了解的内容。

Overview

流程概述

This skill walks through a four-phase workflow:
  1. Extract — Use dlt (data load tool) to pull data from a SaaS API into a local DuckDB file
  2. Model — Introspect the DuckDB schema and auto-generate a Wren semantic project (YAML models, relationships, profile)
  3. Build & Verify — Build the project and run actual SQL queries to confirm everything works end-to-end
  4. Handoff — Show the user their data and next steps
The user might enter at any phase. Ask which phase they're starting from — they may already have a
.duckdb
file and just need phases 2–4.
The goal is a project that actually queries successfully, not just files that look correct. Always run the verification step before declaring success.
此技能会引导用户完成四阶段工作流:
  1. 提取 — 使用dlt(data load tool)从SaaS API拉取数据到本地DuckDB文件
  2. 建模 — 探查DuckDB的Schema并自动生成Wren语义项目(YAML模型、关联关系、配置文件)
  3. 构建与验证 — 构建项目并运行实际SQL查询,确认端到端流程正常运行
  4. 交付 — 向用户展示其数据及后续步骤
用户可能从任意阶段开始操作。询问用户的起始阶段——他们可能已经拥有
.duckdb
文件,只需完成第2至第4阶段。
**目标是打造一个能成功执行查询的项目,而非仅生成看似正确的文件。**在宣布成功前务必执行验证步骤。

Critical: DuckDB catalog naming

关键注意事项:DuckDB catalog命名规则

When wren engine connects to a DuckDB file, it ATTACHes it using the filename (without
.duckdb
extension) as the catalog alias:
ATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)
This means every model's
table_reference.catalog
must equal the DuckDB filename stem
. If the file is
hubspot.duckdb
, the catalog is
hubspot
. If it's
my_pipeline.duckdb
, the catalog is
my_pipeline
.
Getting this wrong causes "table not found" errors at query time. The
introspect_dlt.py
script handles this automatically.
当Wren Engine连接DuckDB文件时,会使用文件名(不含
.duckdb
扩展名)作为catalog别名来ATTACH数据库:
ATTACH DATABASE 'stripe_data.duckdb' AS "stripe_data" (READ_ONLY)
这意味着每个模型的
table_reference.catalog
必须与DuckDB文件名的主干部分一致
。如果文件名为
hubspot.duckdb
,则catalog为
hubspot
;如果文件名为
my_pipeline.duckdb
,则catalog为
my_pipeline
此处设置错误会导致查询时出现“表未找到”错误。
introspect_dlt.py
脚本会自动处理此问题。

Critical: Type normalization

关键注意事项:类型标准化

Column types must be normalized using wren SDK's
type_mapping.parse_type()
function, which uses sqlglot to convert database-specific types (like DuckDB's
HUGEINT
,
TIMESTAMP WITH TIME ZONE
) into canonical SQL types that wren-core understands. Do not hardcode type mappings — always delegate to
parse_type(raw_type, "duckdb")
.
The
introspect_dlt.py
script does this automatically when wren SDK is installed.
必须使用Wren SDK的
type_mapping.parse_type()
函数对列类型进行标准化,该函数通过sqlglot将数据库特定类型(如DuckDB的
HUGEINT
TIMESTAMP WITH TIME ZONE
)转换为Wren-core可识别的标准SQL类型。请勿硬编码类型映射——始终委托给
parse_type(raw_type, "duckdb")
处理。
安装Wren SDK后,
introspect_dlt.py
脚本会自动完成此操作。

Phase 1: Extract — dlt Pipeline Setup

阶段1:提取 — dlt管道设置

Step 1: Pick the SaaS source

步骤1:选择SaaS数据源

Ask the user which SaaS service they want to connect. Read
references/dlt_sources.md
for a list of popular verified sources and their auth requirements. If the source isn't listed, check whether dlt has a verified source for it by searching
dlthub.com/docs/dlt-ecosystem/verified-sources
.
询问用户想要连接的SaaS服务。查看
references/dlt_sources.md
获取热门已验证数据源及其认证要求的列表。如果该数据源未在列表中,可通过搜索
dlthub.com/docs/dlt-ecosystem/verified-sources
确认dlt是否有对应的已验证数据源。

Step 2: Install dlt

步骤2:安装dlt

bash
pip install "dlt[duckdb]" --break-system-packages
bash
pip install "dlt[duckdb]" --break-system-packages

Step 3: Write the pipeline script

步骤3:编写管道脚本

Create a Python script that:
  1. Imports the dlt source function for the chosen SaaS
  2. Configures the pipeline with
    destination='duckdb'
    and a local file path
  3. Runs the pipeline with
    pipeline.run(source)
Here's the general pattern — adapt it per source (check
references/dlt_sources.md
for source-specific templates):
python
import dlt

pipeline = dlt.pipeline(
    pipeline_name="<source>_pipeline",
    destination="duckdb",
    dataset_name="<source>_data",
)
创建Python脚本,完成以下操作:
  1. 导入所选SaaS对应的dlt源函数
  2. 配置管道,设置
    destination='duckdb'
    及本地文件路径
  3. 运行
    pipeline.run(source)
    执行管道
以下是通用模板——可根据数据源调整(查看
references/dlt_sources.md
获取数据源特定模板):
python
import dlt

pipeline = dlt.pipeline(
    pipeline_name="<source>_pipeline",
    destination="duckdb",
    dataset_name="<source>_data",
)

Source-specific: check references/dlt_sources.md for auth patterns

数据源特定:查看references/dlt_sources.md获取认证模式

source = <source_function>(api_key=dlt.secrets.value)
info = pipeline.run(source) print(info)
undefined
source = <source_function>(api_key=dlt.secrets.value)
info = pipeline.run(source) print(info)
undefined

Step 4: Set up credentials

步骤4:配置凭证

dlt reads credentials from environment variables or
.dlt/secrets.toml
. The simplest approach for a one-time run:
bash
undefined
dlt从环境变量或
.dlt/secrets.toml
读取凭证。一次性运行的最简方式:
bash
undefined

Set the credential as an environment variable

将凭证设置为环境变量

The exact variable name depends on the source — check references/dlt_sources.md

具体变量名取决于数据源——查看references/dlt_sources.md

export SOURCES__<SOURCE>__API_KEY="the-actual-key"

Ask the user for their API key or token. Remind them:
- Never commit credentials to git
- Environment variables are the simplest way for a one-time run
- For repeated use, they can create `.dlt/secrets.toml`
export SOURCES__<SOURCE>__API_KEY="the-actual-key"

向用户索要其API密钥或令牌。提醒用户:
- 切勿将凭证提交至git
- 环境变量是一次性运行的最简方式
- 如需重复使用,可创建`.dlt/secrets.toml`文件

Step 5: Run the pipeline

步骤5:运行管道

bash
python <pipeline_script>.py
After the run, confirm:
  1. The pipeline completed without errors
  2. A
    .duckdb
    file was created (usually at
    <pipeline_name>.duckdb
    )
  3. Print discovered tables and their column counts
python
import duckdb
con = duckdb.connect("<pipeline_name>.duckdb", read_only=True)
for row in con.execute("""
    SELECT table_schema, table_name,
           (SELECT COUNT(*) FROM information_schema.columns c
            WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_count
    FROM information_schema.tables t
    WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
      AND table_name NOT LIKE '_dlt_%'
    ORDER BY table_schema, table_name
""").fetchall():
    print(f"  {row[0]}.{row[1]} ({row[2]} columns)")
con.close()
bash
python <pipeline_script>.py
运行完成后,确认:
  1. 管道无错误完成运行
  2. 已创建
    .duckdb
    文件(通常位于
    <pipeline_name>.duckdb
    路径)
  3. 打印已发现的表及其列数
python
import duckdb
con = duckdb.connect("<pipeline_name>.duckdb", read_only=True)
for row in con.execute("""
    SELECT table_schema, table_name,
           (SELECT COUNT(*) FROM information_schema.columns c
            WHERE c.table_schema = t.table_schema AND c.table_name = t.table_name) as col_count
    FROM information_schema.tables t
    WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
      AND table_name NOT LIKE '_dlt_%'
    ORDER BY table_schema, table_name
""").fetchall():
    print(f"  {row[0]}.{row[1]} ({row[2]} columns)")
con.close()

Phase 2: Model — Generate Wren Project

阶段2:建模 — 生成Wren项目

Run the introspection script to auto-generate a complete Wren project from the DuckDB file:
bash
python <path-to-this-skill>/scripts/introspect_dlt.py \
    --duckdb-path <path-to-duckdb-file> \
    --output-dir <project-directory> \
    --project-name <name>
This script:
  • Connects to the DuckDB file (read-only)
  • Sets
    table_reference.catalog
    to the DuckDB filename stem
    (matching wren engine's ATTACH behavior)
  • Discovers all tables and columns via
    information_schema
  • Filters out dlt internal tables (
    _dlt_loads
    ,
    _dlt_pipeline_state
    , etc.)
  • Filters out dlt metadata columns (
    _dlt_id
    ,
    _dlt_load_id
    ,
    _dlt_list_idx
    ) from model definitions
  • Detects parent-child relationships from
    _dlt_parent_id
    columns and table naming conventions
  • Normalizes column types using
    wren.type_mapping.parse_type()
    (sqlglot-based)
  • Generates a complete v2 YAML project (wren_project.yml, models/, relationships.yml, instructions.md)
After running, show the user what was generated:
bash
undefined
运行探查脚本,从DuckDB文件自动生成完整的Wren项目:
bash
python <path-to-this-skill>/scripts/introspect_dlt.py \
    --duckdb-path <path-to-duckdb-file> \
    --output-dir <project-directory> \
    --project-name <name>
此脚本会:
  • 以只读模式连接DuckDB文件
  • table_reference.catalog
    设置为DuckDB文件名的主干部分
    (匹配Wren Engine的ATTACH行为)
  • 通过
    information_schema
    发现所有表和列
  • 过滤掉dlt内部表(
    _dlt_loads
    _dlt_pipeline_state
    等)
  • 从模型定义中过滤掉dlt元数据列(
    _dlt_id
    _dlt_load_id
    _dlt_list_idx
  • _dlt_parent_id
    列和表命名规则中检测父子关联关系
  • 使用
    wren.type_mapping.parse_type()
    标准化列类型
    (基于sqlglot)
  • 生成完整的v2 YAML项目(wren_project.yml、models/、relationships.yml、instructions.md)
运行完成后,向用户展示生成的内容:
bash
undefined

Show project summary

展示项目摘要

cat <project-directory>/wren_project.yml echo "---" ls <project-directory>/models/ echo "---" cat <project-directory>/relationships.yml
undefined
cat <project-directory>/wren_project.yml echo "---" ls <project-directory>/models/ echo "---" cat <project-directory>/relationships.yml
undefined

Verify model correctness

验证模型正确性

Spot-check one generated model to confirm:
  1. table_reference.catalog
    matches the DuckDB filename (e.g.,
    stripe_data
    for
    stripe_data.duckdb
    )
  2. table_reference.schema
    matches the DuckDB schema (usually
    main
    )
  3. No
    _dlt_*
    columns appear in the columns list
  4. Column types look reasonable (VARCHAR, BIGINT, BOOLEAN, TIMESTAMP, etc.)
抽查一个生成的模型,确认:
  1. table_reference.catalog
    与DuckDB文件名匹配(例如
    stripe_data.duckdb
    对应
    stripe_data
  2. table_reference.schema
    与DuckDB的schema匹配(通常为
    main
  3. 列列表中无
    _dlt_*
  4. 列类型合理(VARCHAR、BIGINT、BOOLEAN、TIMESTAMP等)

Set up the connection profile

设置连接配置文件

Create a Wren profile so the user can query without specifying connection details every time. The
url
must point to the directory containing the
.duckdb
file (not the file itself):
python
import yaml
from pathlib import Path

wren_home = Path.home() / ".wren"
wren_home.mkdir(exist_ok=True)
profiles_file = wren_home / "profiles.yml"

existing = (yaml.safe_load(profiles_file.read_text()) or {}) if profiles_file.exists() else {}
existing.setdefault("profiles", {})

profile_name = "<source>_dlt"
existing["profiles"][profile_name] = {
    "datasource": "duckdb",
    "url": str(Path("<duckdb-path>").resolve().parent),
    "format": "duckdb",
}
existing["active"] = profile_name

profiles_file.write_text(yaml.dump(existing, default_flow_style=False, sort_keys=False))
创建Wren配置文件,让用户无需每次查询都指定连接详情。
url
必须指向包含
.duckdb
文件的目录
(而非文件本身):
python
import yaml
from pathlib import Path

wren_home = Path.home() / ".wren"
wren_home.mkdir(exist_ok=True)
profiles_file = wren_home / "profiles.yml"

existing = (yaml.safe_load(profiles_file.read_text()) or {}) if profiles_file.exists() else {}
existing.setdefault("profiles", {})

profile_name = "<source>_dlt"
existing["profiles"][profile_name] = {
    "datasource": "duckdb",
    "url": str(Path("<duckdb-path>").resolve().parent),
    "format": "duckdb",
}
existing["active"] = profile_name

profiles_file.write_text(yaml.dump(existing, default_flow_style=False, sort_keys=False))

Phase 3: Build & Verify — The Project Must Actually Work

阶段3:构建与验证 — 项目必须能实际运行

This phase is not optional. A project that generates YAML but fails at query time is not a success.
此阶段为必填项。仅生成YAML但查询失败的项目不算成功。

Step 1: Build the MDL

步骤1:构建MDL

bash
cd <project-directory>
wren context build
This compiles the YAML models into
target/mdl.json
. If this fails, fix the issues before proceeding (see Troubleshooting below).
bash
cd <project-directory>
wren context build
此操作会将YAML模型编译为
target/mdl.json
。如果编译失败,需先修复问题再继续(见下方故障排除)。

Step 2: Validate with a real query

步骤2:通过真实查询验证

Run at least one query per generated model to confirm the project is functional:
bash
undefined
为每个生成的模型至少运行一次查询,确认项目可用:
bash
undefined

For each model, verify it resolves correctly

针对每个模型,验证其可正确解析

wren --sql 'SELECT COUNT(*) as total FROM "<table_name>"'

If any query fails, debug and fix the model before moving on. Common issues:
- Wrong catalog in table_reference → "table not found"
- Type mismatch → fix the column type in metadata.yml
- Missing profile → check `wren profile list`
wren --sql 'SELECT COUNT(*) as total FROM "<table_name>"'

如果任何查询失败,需先调试并修复模型再继续。常见问题:
- `table_reference.catalog`设置错误 → “表未找到”
- 类型不匹配 → 在metadata.yml中修复列类型
- 缺少配置文件 → 检查`wren profile list`

Step 3: Run interesting queries

步骤3:运行实用查询

Once basic queries pass, run 2–3 more interesting queries to show the user what their data looks like:
bash
undefined
基础查询通过后,运行2-3个更实用的查询,向用户展示其数据情况:
bash
undefined

Preview data

预览数据

wren --sql 'SELECT * FROM "<table_name>" LIMIT 5'
wren --sql 'SELECT * FROM "<table_name>" LIMIT 5'

If there's a relationship, verify both models are queryable

如果有关联关系,验证两个模型均可查询

wren --sql 'SELECT * FROM "<parent>" LIMIT 5' wren --sql 'SELECT * FROM "<child>" LIMIT 5'

Show the results to the user and explain what they're seeing. This is their first look at the data through Wren — make it count.
wren --sql 'SELECT * FROM "<parent>" LIMIT 5' wren --sql 'SELECT * FROM "<child>" LIMIT 5'

向用户展示结果并解释内容。这是用户首次通过Wren查看数据,需确保体验良好。

Step 4: Confirm success

步骤4:确认成功

Only after queries return real data, tell the user the setup is complete. Summarize:
  • How many models were created
  • What relationships were detected
  • Which profile is active
  • Example queries they can try next
只有当查询返回真实数据后,才可告知用户设置已完成。总结:
  • 创建的模型数量
  • 检测到的关联关系
  • 当前激活的配置文件
  • 可供尝试的示例查询

Troubleshooting

故障排除

If
wren context build
fails:
  • Check that
    data_source: duckdb
    is set in
    wren_project.yml
  • Verify the DuckDB file path in the profile is correct
  • Run
    wren context validate
    for detailed error messages
If queries fail with "table not found":
  • Most likely cause:
    table_reference.catalog
    doesn't match the DuckDB filename. If the file is
    pipeline.duckdb
    , the catalog must be
    pipeline
    , not empty string.
  • Check the profile's
    url
    points to the directory containing the
    .duckdb
    file
  • Table names with double underscores need quoting:
    "hubspot__contacts"
If queries fail with type errors:
  • Check column types in the model YAML — they should be canonical SQL types (VARCHAR, BIGINT, etc.)
  • Re-run
    introspect_dlt.py
    with wren SDK installed to get proper type normalization
General:
  • Check that the profile is active:
    wren profile list
  • The DuckDB file might be locked if a dlt pipeline is running — wait for it to finish
如果
wren context build
失败:
  • 检查
    wren_project.yml
    中是否设置了
    data_source: duckdb
  • 验证配置文件中的DuckDB文件路径是否正确
  • 运行
    wren context validate
    获取详细错误信息
如果查询失败并提示“表未找到”:
  • 最可能原因
    table_reference.catalog
    与DuckDB文件名不匹配。如果文件名为
    pipeline.duckdb
    ,则catalog必须为
    pipeline
    ,而非空字符串。
  • 检查配置文件的
    url
    是否指向包含
    .duckdb
    文件的目录
  • 含双下划线的表名需加引号:
    "hubspot__contacts"
如果查询失败并提示类型错误:
  • 检查模型YAML中的列类型——应为标准SQL类型(VARCHAR、BIGINT等)
  • 安装Wren SDK后重新运行
    introspect_dlt.py
    以获取正确的类型标准化结果
通用问题:
  • 检查配置文件是否激活:
    wren profile list
  • 如果dlt管道正在运行,DuckDB文件可能被锁定——等待管道运行完成

Important notes

重要说明

  • dlt's
    _dlt_parent_id
    /
    _dlt_id
    columns are kept in the actual DuckDB tables but hidden from Wren model definitions. They're only used in relationship conditions.
  • DuckDB has a single-writer limitation. Don't run a dlt sync while querying. For concurrent access, dlt should write to a separate file and swap atomically.
  • The generated models use
    table_reference
    (not
    ref_sql
    ) since they map directly to DuckDB tables created by dlt.
  • Column types are normalized using wren SDK's
    parse_type()
    with sqlglot's DuckDB dialect. If a type looks wrong, the user can edit the model's
    metadata.yml
    directly.
  • dlt的
    _dlt_parent_id
    /
    _dlt_id
    列会保留在实际DuckDB表中,但会从Wren模型定义中隐藏。它们仅用于关联关系条件。
  • DuckDB存在单写入者限制。查询时请勿运行dlt同步。如需并发访问,dlt应写入单独文件并以原子方式替换。
  • 生成的模型使用
    table_reference
    (而非
    ref_sql
    ),因为它们直接映射到dlt创建的DuckDB表。
  • 列类型通过Wren SDK的
    parse_type()
    结合sqlglot的DuckDB方言进行标准化。如果类型显示错误,用户可直接编辑模型的
    metadata.yml
    文件。