query

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
You are helping the user query data using DuckDB.
Input:
$@
Follow these steps in order.
你将协助用户使用DuckDB查询数据。
输入:
$@
请按顺序遵循以下步骤操作:

Step 1 — Resolve state and determine the mode

步骤1 — 解析状态并确定模式

Look for an existing state file in either location:
bash
STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"
If found, verify the databases it references are still accessible:
bash
duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"
Now determine the mode:
  • Ad-hoc mode if: the
    --file
    flag is present, or the SQL references file paths/literals (e.g.
    FROM 'data.csv'
    ), or
    STATE_DIR
    is empty.
  • Session mode if:
    STATE_DIR
    is set and the input references table names, is natural language, or is SQL without file references.
If no state file exists and no file is referenced, fall back to ad-hoc mode against
:memory:
— the user must reference files directly in their SQL.
If the state file exists but any ATTACH in it fails, warn the user and fall back to ad-hoc mode.
在以下两个位置查找已有的状态文件:
bash
STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"
如果找到状态文件,验证其引用的数据库仍可访问:
bash
duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"
接下来确定运行模式:
  • 即席模式适用场景:存在
    --file
    参数,或SQL中引用了文件路径/字面量(例如
    FROM 'data.csv'
    ),或
    STATE_DIR
    为空。
  • 会话模式适用场景:
    STATE_DIR
    已设置,且输入内容为表名引用、自然语言,或不含文件引用的SQL。
如果不存在状态文件且没有引用任何文件,默认启用基于
:memory:
的即席模式,用户必须在SQL中直接引用文件。
如果状态文件存在但其中的ATTACH语句执行失败,需向用户发出警告并回退到即席模式。

Step 2 — Check DuckDB is installed

步骤2 — 检查DuckDB是否已安装

bash
command -v duckdb
If not found, delegate to
/duckdb-skills:install-duckdb
and then continue.
bash
command -v duckdb
如果未找到DuckDB,调用
/duckdb-skills:install-duckdb
完成安装后再继续后续步骤。

Step 3 — Generate SQL if needed

步骤3 — 按需生成SQL

If the input is natural language (not valid SQL), generate SQL using the Friendly SQL reference below.
In session mode, first retrieve the schema to inform query generation:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"
Then for relevant tables:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"
Use the schema context and the Friendly SQL reference to generate the most appropriate query.
如果输入内容为自然语言(不是合法SQL),请参考下方的友好SQL参考生成对应SQL。
会话模式下,首先获取表结构用于支撑查询生成:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"
然后查询相关表的结构:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"
结合表结构上下文和友好SQL参考生成最合适的查询语句。

Step 4 — Estimate result size

步骤4 — 预估结果大小

Before executing, estimate whether the query could produce a very large result that would consume excessive tokens when returned to this conversation.
Session mode — check row counts for the tables involved:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"
Ad-hoc mode — probe the source:
bash
duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"
Evaluate:
  • If the query already has a
    LIMIT
    ,
    count()
    , or other aggregation that bounds the output -> safe, proceed.
  • If the source has >1M rows and the query has no LIMIT or aggregation -> tell the user: "This query would return a very large result set. Displaying it here would consume a lot of tokens and increase cost. I'd recommend adding
    LIMIT 1000
    or an aggregation to keep the output manageable."
    Ask for confirmation before running as-is.
  • If the data size is >10 GB -> additionally warn: "This table is over 10 GB — the query may take a while to complete." Proceed if the user confirms.
Skip this step for queries that are intrinsically bounded (e.g.
DESCRIBE
,
SUMMARIZE
, aggregations,
count()
).
执行查询前,预估查询是否会返回超大规模结果,避免返回对话时消耗过多token。
会话模式 — 检查涉及的表的行数:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"
即席模式 — 探测数据源规模:
bash
duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"
评估规则
  • 如果查询已经包含
    LIMIT
    count()
    或其他限制输出的聚合函数 -> 安全,可以直接执行。
  • 如果数据源行数超过100万且查询没有LIMIT或聚合逻辑 -> 告知用户: "该查询会返回非常大的结果集,在此处展示会消耗大量token并增加成本。建议添加
    LIMIT 1000
    或使用聚合逻辑控制输出规模。"
    直接执行前需要获得用户确认。
  • 如果数据规模超过10 GB -> 额外发出警告: "该表大小超过10 GB,查询可能需要较长时间才能完成。" 获得用户确认后再继续执行。
如果查询本身天然有输出限制(例如
DESCRIBE
SUMMARIZE
、聚合查询、
count()
),可跳过本步骤。

Step 5 — Execute the query

步骤5 — 执行查询

Ad-hoc mode (sandboxed — only the referenced file is accessible):
bash
duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQL
Replace
FILE_PATH
with the actual file path extracted from the query or
--file
argument. If multiple files are referenced, include all paths in the
allowed_paths
list.
Session mode (user-trusted database):
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"
For multi-line queries, use a heredoc with
-init
:
bash
duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQL
Always use heredocs (
<<'SQL'
) for multi-line queries to avoid shell quoting issues.
即席模式(沙箱环境 — 仅可访问引用的文件):
bash
duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQL
FILE_PATH
替换为从查询或
--file
参数中提取的实际文件路径。如果引用了多个文件,将所有路径都加入
allowed_paths
列表。
会话模式(用户可信数据库):
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"
针对多行查询,搭配heredoc使用
-init
参数:
bash
duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQL
多行查询必须使用heredoc(
<<'SQL'
)避免Shell转义问题。

Step 6 — Handle errors

步骤6 — 错误处理

  • Syntax error: show the error, suggest a corrected query, and re-run.
  • Missing extension (e.g.
    Extension "X" not loaded
    ): delegate to
    /duckdb-skills:install-duckdb <ext>
    , then retry.
  • Table not found (session mode): list available tables with
    FROM duckdb_tables()
    and suggest corrections.
  • File not found (ad-hoc mode): use
    find "$PWD" -name "<filename>" 2>/dev/null
    to locate the file and suggest the corrected path.
  • Persistent or unclear DuckDB error: use
    /duckdb-skills:duckdb-docs <error message or relevant keywords>
    to search the documentation for guidance, then apply the fix and retry.
  • 语法错误:展示错误信息,给出修正后的查询建议后重新运行。
  • 缺少扩展(例如
    Extension "X" not loaded
    ):调用
    /duckdb-skills:install-duckdb <ext>
    安装对应扩展后重试。
  • 表不存在(会话模式):使用
    FROM duckdb_tables()
    列出可用表并给出修正建议。
  • 文件不存在(即席模式):使用
    find "$PWD" -name "<filename>" 2>/dev/null
    定位文件并给出修正后的路径建议。
  • 持续存在或含义不明的DuckDB错误:使用
    /duckdb-skills:duckdb-docs <错误信息或相关关键词>
    搜索文档获取解决方案,修复问题后重试。

Step 7 — Present results

步骤7 — 展示结果

Show the query output to the user. If the result has more than 100 rows, note the truncation and suggest adding
LIMIT
to the query.
For natural language questions, also provide a brief interpretation of the results.

向用户展示查询输出。如果结果超过100行,说明内容已截断并建议在查询中添加
LIMIT
针对自然语言提问,还需要提供对结果的简要解读。

DuckDB Friendly SQL Reference

DuckDB友好SQL参考

When generating SQL, prefer these idiomatic DuckDB constructs:
生成SQL时,优先使用以下符合DuckDB惯用写法的语法:

Compact clauses

精简子句

  • FROM-first:
    FROM table WHERE x > 10
    (implicit
    SELECT *
    )
  • GROUP BY ALL: auto-groups by all non-aggregate columns
  • ORDER BY ALL: orders by all columns for deterministic results
  • SELECT * EXCLUDE (col1, col2): drop columns from wildcard
  • SELECT * REPLACE (expr AS col): transform a column in-place
  • UNION ALL BY NAME: combine tables with different column orders
  • Percentage LIMIT:
    LIMIT 10%
    returns a percentage of rows
  • Prefix aliases:
    SELECT x: 42
    instead of
    SELECT 42 AS x
  • Trailing commas allowed in SELECT lists
  • FROM优先
    FROM table WHERE x > 10
    (隐式
    SELECT *
  • GROUP BY ALL:自动按所有非聚合列分组
  • ORDER BY ALL:按所有列排序保证结果确定性
  • SELECT * EXCLUDE (col1, col2):从通配符匹配中排除指定列
  • SELECT * REPLACE (expr AS col):原地转换列的取值
  • UNION ALL BY NAME:按列名合并列顺序不同的表
  • 百分比LIMIT
    LIMIT 10%
    返回指定比例的行数
  • 前缀别名
    SELECT x: 42
    替代
    SELECT 42 AS x
  • SELECT列表允许使用** trailing commas(尾随逗号)**

Query features

查询特性

  • count(): no need for
    count(*)
  • Reusable aliases: use column aliases in WHERE / GROUP BY / HAVING
  • Lateral column aliases:
    SELECT i+1 AS j, j+2 AS k
  • COLUMNS(*): apply expressions across columns; supports regex, EXCLUDE, REPLACE, lambdas
  • FILTER clause:
    count() FILTER (WHERE x > 10)
    for conditional aggregation
  • GROUPING SETS / CUBE / ROLLUP: advanced multi-level aggregation
  • Top-N per group:
    max(col, 3)
    returns top 3 as a list; also
    arg_max(arg, val, n)
    ,
    min_by(arg, val, n)
  • DESCRIBE table_name: schema summary (column names and types)
  • SUMMARIZE table_name: instant statistical profile
  • PIVOT / UNPIVOT: reshape between wide and long formats
  • SET VARIABLE x = expr: define SQL-level variables, reference with
    getvariable('x')
  • count():不需要写
    count(*)
  • 别名复用:可以在WHERE / GROUP BY / HAVING中使用列别名
  • 横向列别名
    SELECT i+1 AS j, j+2 AS k
  • COLUMNS(*): 对多列应用统一表达式;支持正则、EXCLUDE、REPLACE、lambda函数
  • FILTER子句
    count() FILTER (WHERE x > 10)
    用于条件聚合
  • GROUPING SETS / CUBE / ROLLUP:高级多层聚合
  • 分组Top-N
    max(col, 3)
    返回前3个值组成的列表;还有
    arg_max(arg, val, n)
    min_by(arg, val, n)
  • DESCRIBE table_name:表结构摘要(列名和类型)
  • SUMMARIZE table_name:快速生成统计画像
  • PIVOT / UNPIVOT:在宽表和长表格式间转换
  • SET VARIABLE x = expr:定义SQL层面的变量,通过
    getvariable('x')
    引用

Data import

数据导入

  • Direct file queries:
    FROM 'file.csv'
    ,
    FROM 'data.parquet'
  • Globbing:
    FROM 'data/part-*.parquet'
    reads multiple files
  • Auto-detection: CSV headers and schemas are inferred automatically
  • 直接查询文件
    FROM 'file.csv'
    FROM 'data.parquet'
  • 通配符匹配
    FROM 'data/part-*.parquet'
    读取多个文件
  • 自动探测:自动推断CSV表头和Schema

Expressions and types

表达式和类型

  • Dot operator chaining:
    'hello'.upper()
    or
    col.trim().lower()
  • List comprehensions:
    [x*2 FOR x IN list_col]
  • List/string slicing:
    col[1:3]
    , negative indexing
    col[-1]
  • STRUCT. notation*:
    SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s)
  • Square bracket lists:
    [1, 2, 3]
  • format():
    format('{}->{}', a, b)
    for string formatting
  • 点运算符链式调用
    'hello'.upper()
    col.trim().lower()
  • 列表推导式
    [x*2 FOR x IN list_col]
  • 列表/字符串切片
    col[1:3]
    ,支持负索引
    col[-1]
  • *STRUCT.写法
    SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s)
  • 方括号列表
    [1, 2, 3]
  • format()
    format('{}->{}', a, b)
    用于字符串格式化

Joins

连接操作

  • ASOF joins: approximate matching on ordered data (e.g. timestamps)
  • POSITIONAL joins: match rows by position, not keys
  • LATERAL joins: reference prior table expressions in subqueries
  • ASOF连接:对有序数据(例如时间戳)做近似匹配
  • POSITIONAL连接:按行位置而非键匹配行
  • LATERAL连接:在子查询中引用前置的表表达式

Data modification

数据修改

  • CREATE OR REPLACE TABLE: no need for
    DROP TABLE IF EXISTS
    first
  • CREATE TABLE ... AS SELECT (CTAS): create tables from query results
  • INSERT INTO ... BY NAME: match columns by name, not position
  • INSERT OR IGNORE INTO / INSERT OR REPLACE INTO: upsert patterns
  • CREATE OR REPLACE TABLE:不需要先写
    DROP TABLE IF EXISTS
  • CREATE TABLE ... AS SELECT (CTAS):从查询结果创建表
  • INSERT INTO ... BY NAME:按列名而非位置匹配插入列
  • INSERT OR IGNORE INTO / INSERT OR REPLACE INTO:Upsert模式