query

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

You are helping the user query data using DuckDB.

Input:

$@

Follow these steps in order.

你将协助用户使用DuckDB查询数据。

输入：

$@

请按顺序遵循以下步骤操作：

Step 1 — Resolve state and determine the mode

步骤1 — 解析状态并确定模式

Look for an existing state file in either location:

bash

STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"

If found, verify the databases it references are still accessible:

bash

duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"

Now determine the mode:

Ad-hoc mode if: the
```
--file
```
flag is present, or the SQL references file paths/literals (e.g.
```
FROM 'data.csv'
```
), or
```
STATE_DIR
```
is empty.
Session mode if:
```
STATE_DIR
```
is set and the input references table names, is natural language, or is SQL without file references.

If no state file exists and no file is referenced, fall back to ad-hoc mode against

:memory:

— the user must reference files directly in their SQL.

If the state file exists but any ATTACH in it fails, warn the user and fall back to ad-hoc mode.

在以下两个位置查找已有的状态文件：

bash

STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"

如果找到状态文件，验证其引用的数据库仍可访问：

bash

duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"

接下来确定运行模式：

即席模式适用场景：存在
```
--file
```
参数，或SQL中引用了文件路径/字面量（例如
```
FROM 'data.csv'
```
），或
```
STATE_DIR
```
为空。
会话模式适用场景：
```
STATE_DIR
```
已设置，且输入内容为表名引用、自然语言，或不含文件引用的SQL。

如果不存在状态文件且没有引用任何文件，默认启用基于

:memory:

的即席模式，用户必须在SQL中直接引用文件。

如果状态文件存在但其中的ATTACH语句执行失败，需向用户发出警告并回退到即席模式。

Step 2 — Check DuckDB is installed

步骤2 — 检查DuckDB是否已安装

bash

command -v duckdb

If not found, delegate to

/duckdb-skills:install-duckdb

and then continue.

bash

command -v duckdb

如果未找到DuckDB，调用

/duckdb-skills:install-duckdb

完成安装后再继续后续步骤。

Step 3 — Generate SQL if needed

步骤3 — 按需生成SQL

If the input is natural language (not valid SQL), generate SQL using the Friendly SQL reference below.

In session mode, first retrieve the schema to inform query generation:

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"

Then for relevant tables:

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"

Use the schema context and the Friendly SQL reference to generate the most appropriate query.

如果输入内容为自然语言（不是合法SQL），请参考下方的友好SQL参考生成对应SQL。

在会话模式下，首先获取表结构用于支撑查询生成：

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"

然后查询相关表的结构：

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"

结合表结构上下文和友好SQL参考生成最合适的查询语句。

Step 4 — Estimate result size

步骤4 — 预估结果大小

Before executing, estimate whether the query could produce a very large result that would consume excessive tokens when returned to this conversation.

Session mode — check row counts for the tables involved:

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"

Ad-hoc mode — probe the source:

bash

duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"

Evaluate:

If the query already has a
```
LIMIT
```
,
```
count()
```
, or other aggregation that bounds the output -> safe, proceed.
If the source has >1M rows and the query has no LIMIT or aggregation -> tell the user: "This query would return a very large result set. Displaying it here would consume a lot of tokens and increase cost. I'd recommend adding
LIMIT 1000
or an aggregation to keep the output manageable." Ask for confirmation before running as-is.
If the data size is >10 GB -> additionally warn: "This table is over 10 GB — the query may take a while to complete." Proceed if the user confirms.

Skip this step for queries that are intrinsically bounded (e.g.

DESCRIBE

SUMMARIZE

, aggregations,

count()

执行查询前，预估查询是否会返回超大规模结果，避免返回对话时消耗过多token。

会话模式 — 检查涉及的表的行数：

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"

即席模式 — 探测数据源规模：

bash

duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"

评估规则：

如果查询已经包含
```
LIMIT
```
、
```
count()
```
或其他限制输出的聚合函数 -> 安全，可以直接执行。
如果数据源行数超过100万且查询没有LIMIT或聚合逻辑 -> 告知用户： "该查询会返回非常大的结果集，在此处展示会消耗大量token并增加成本。建议添加
LIMIT 1000
或使用聚合逻辑控制输出规模。" 直接执行前需要获得用户确认。
如果数据规模超过10 GB -> 额外发出警告： "该表大小超过10 GB，查询可能需要较长时间才能完成。" 获得用户确认后再继续执行。

如果查询本身天然有输出限制（例如

DESCRIBE

、

SUMMARIZE

、聚合查询、

count()

），可跳过本步骤。

Step 5 — Execute the query

步骤5 — 执行查询

Ad-hoc mode (sandboxed — only the referenced file is accessible):

bash

duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQL

Replace

FILE_PATH

with the actual file path extracted from the query or

--file

argument. If multiple files are referenced, include all paths in the

allowed_paths

list.

Session mode (user-trusted database):

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"

For multi-line queries, use a heredoc with

-init

bash

duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQL

Always use heredocs (

<<'SQL'

) for multi-line queries to avoid shell quoting issues.

即席模式（沙箱环境 — 仅可访问引用的文件）：

bash

duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQL

将

FILE_PATH

替换为从查询或

--file

参数中提取的实际文件路径。如果引用了多个文件，将所有路径都加入

allowed_paths

列表。

会话模式（用户可信数据库）：

bash

duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"

针对多行查询，搭配heredoc使用

-init

参数：

bash

duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQL

多行查询必须使用heredoc（

<<'SQL'

）避免Shell转义问题。

Step 6 — Handle errors

步骤6 — 错误处理

Syntax error: show the error, suggest a corrected query, and re-run.

Missing extension (e.g.

Extension "X" not loaded

): delegate to

/duckdb-skills:install-duckdb <ext>

, then retry.

Table not found (session mode): list available tables with
```
FROM duckdb_tables()
```
and suggest corrections.
File not found (ad-hoc mode): use
```
find "$PWD" -name "<filename>" 2>/dev/null
```
to locate the file and suggest the corrected path.
Persistent or unclear DuckDB error: use
```
/duckdb-skills:duckdb-docs <error message or relevant keywords>
```
to search the documentation for guidance, then apply the fix and retry.

语法错误：展示错误信息，给出修正后的查询建议后重新运行。
缺少扩展（例如
```
Extension "X" not loaded
```
）：调用
```
/duckdb-skills:install-duckdb <ext>
```
安装对应扩展后重试。
表不存在（会话模式）：使用
```
FROM duckdb_tables()
```
列出可用表并给出修正建议。
文件不存在（即席模式）：使用
```
find "$PWD" -name "<filename>" 2>/dev/null
```
定位文件并给出修正后的路径建议。
持续存在或含义不明的DuckDB错误：使用
```
/duckdb-skills:duckdb-docs <错误信息或相关关键词>
```
搜索文档获取解决方案，修复问题后重试。

Step 7 — Present results

步骤7 — 展示结果

Show the query output to the user. If the result has more than 100 rows, note the truncation and suggest adding

LIMIT

to the query.

For natural language questions, also provide a brief interpretation of the results.

向用户展示查询输出。如果结果超过100行，说明内容已截断并建议在查询中添加

LIMIT

。

针对自然语言提问，还需要提供对结果的简要解读。

DuckDB Friendly SQL Reference

DuckDB友好SQL参考

When generating SQL, prefer these idiomatic DuckDB constructs:

生成SQL时，优先使用以下符合DuckDB惯用写法的语法：

Compact clauses

精简子句

FROM-first:
```
FROM table WHERE x > 10
```
(implicit
```
SELECT *
```
)
GROUP BY ALL: auto-groups by all non-aggregate columns
ORDER BY ALL: orders by all columns for deterministic results
SELECT * EXCLUDE (col1, col2): drop columns from wildcard
SELECT * REPLACE (expr AS col): transform a column in-place
UNION ALL BY NAME: combine tables with different column orders
Percentage LIMIT:
```
LIMIT 10%
```
returns a percentage of rows
Prefix aliases:
```
SELECT x: 42
```
instead of
```
SELECT 42 AS x
```
Trailing commas allowed in SELECT lists

FROM优先：
```
FROM table WHERE x > 10
```
（隐式
```
SELECT *
```
）
GROUP BY ALL：自动按所有非聚合列分组
ORDER BY ALL：按所有列排序保证结果确定性
SELECT * EXCLUDE (col1, col2)：从通配符匹配中排除指定列
SELECT * REPLACE (expr AS col)：原地转换列的取值
UNION ALL BY NAME：按列名合并列顺序不同的表
百分比LIMIT：
```
LIMIT 10%
```
返回指定比例的行数
前缀别名：
```
SELECT x: 42
```
替代
```
SELECT 42 AS x
```
SELECT列表允许使用** trailing commas（尾随逗号）**

Query features

查询特性

count(): no need for
```
count(*)
```
Reusable aliases: use column aliases in WHERE / GROUP BY / HAVING
Lateral column aliases:
```
SELECT i+1 AS j, j+2 AS k
```
COLUMNS(*): apply expressions across columns; supports regex, EXCLUDE, REPLACE, lambdas
FILTER clause:
```
count() FILTER (WHERE x > 10)
```
for conditional aggregation
GROUPING SETS / CUBE / ROLLUP: advanced multi-level aggregation

Top-N per group:

max(col, 3)

returns top 3 as a list; also

arg_max(arg, val, n)

min_by(arg, val, n)

DESCRIBE table_name: schema summary (column names and types)
SUMMARIZE table_name: instant statistical profile
PIVOT / UNPIVOT: reshape between wide and long formats
SET VARIABLE x = expr: define SQL-level variables, reference with
```
getvariable('x')
```

count()：不需要写
```
count(*)
```
别名复用：可以在WHERE / GROUP BY / HAVING中使用列别名
横向列别名：
```
SELECT i+1 AS j, j+2 AS k
```
COLUMNS(*): 对多列应用统一表达式；支持正则、EXCLUDE、REPLACE、lambda函数
FILTER子句：
```
count() FILTER (WHERE x > 10)
```
用于条件聚合
GROUPING SETS / CUBE / ROLLUP：高级多层聚合
分组Top-N：
```
max(col, 3)
```
返回前3个值组成的列表；还有
```
arg_max(arg, val, n)
```
、
```
min_by(arg, val, n)
```
DESCRIBE table_name：表结构摘要（列名和类型）
SUMMARIZE table_name：快速生成统计画像
PIVOT / UNPIVOT：在宽表和长表格式间转换
SET VARIABLE x = expr：定义SQL层面的变量，通过
```
getvariable('x')
```
引用

Data import

数据导入

Direct file queries:
```
FROM 'file.csv'
```
,
```
FROM 'data.parquet'
```
Globbing:
```
FROM 'data/part-*.parquet'
```
reads multiple files
Auto-detection: CSV headers and schemas are inferred automatically

直接查询文件：
```
FROM 'file.csv'
```
、
```
FROM 'data.parquet'
```
通配符匹配：
```
FROM 'data/part-*.parquet'
```
读取多个文件
自动探测：自动推断CSV表头和Schema

Expressions and types

表达式和类型

Dot operator chaining:
```
'hello'.upper()
```
or
```
col.trim().lower()
```
List comprehensions:
```
[x*2 FOR x IN list_col]
```
List/string slicing:
```
col[1:3]
```
, negative indexing
```
col[-1]
```

STRUCT. notation*:

SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s)

Square bracket lists:
```
[1, 2, 3]
```
format():
```
format('{}->{}', a, b)
```
for string formatting

点运算符链式调用：
```
'hello'.upper()
```
或
```
col.trim().lower()
```
列表推导式：
```
[x*2 FOR x IN list_col]
```
列表/字符串切片：
```
col[1:3]
```
，支持负索引
```
col[-1]
```

*STRUCT.写法：

SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s)

方括号列表：
```
[1, 2, 3]
```
format()：
```
format('{}->{}', a, b)
```
用于字符串格式化

Joins

连接操作

ASOF joins: approximate matching on ordered data (e.g. timestamps)
POSITIONAL joins: match rows by position, not keys
LATERAL joins: reference prior table expressions in subqueries

ASOF连接：对有序数据（例如时间戳）做近似匹配
POSITIONAL连接：按行位置而非键匹配行
LATERAL连接：在子查询中引用前置的表表达式

Data modification

数据修改

CREATE OR REPLACE TABLE: no need for
```
DROP TABLE IF EXISTS
```
first
CREATE TABLE ... AS SELECT (CTAS): create tables from query results
INSERT INTO ... BY NAME: match columns by name, not position
INSERT OR IGNORE INTO / INSERT OR REPLACE INTO: upsert patterns

CREATE OR REPLACE TABLE：不需要先写
```
DROP TABLE IF EXISTS
```
CREATE TABLE ... AS SELECT (CTAS)：从查询结果创建表
INSERT INTO ... BY NAME：按列名而非位置匹配插入列
INSERT OR IGNORE INTO / INSERT OR REPLACE INTO：Upsert模式