query
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseYou are helping the user query data using DuckDB.
Input:
$@Follow these steps in order.
你将协助用户使用DuckDB查询数据。
输入:
$@请按顺序遵循以下步骤操作:
Step 1 — Resolve state and determine the mode
步骤1 — 解析状态并确定模式
Look for an existing state file in either location:
bash
STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"If found, verify the databases it references are still accessible:
bash
duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"Now determine the mode:
- Ad-hoc mode if: the flag is present, or the SQL references file paths/literals (e.g.
--file), orFROM 'data.csv'is empty.STATE_DIR - Session mode if: is set and the input references table names, is natural language, or is SQL without file references.
STATE_DIR
If no state file exists and no file is referenced, fall back to ad-hoc mode against — the user must reference files directly in their SQL.
:memory:If the state file exists but any ATTACH in it fails, warn the user and fall back to ad-hoc mode.
在以下两个位置查找已有的状态文件:
bash
STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"如果找到状态文件,验证其引用的数据库仍可访问:
bash
duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"接下来确定运行模式:
- 即席模式适用场景:存在参数,或SQL中引用了文件路径/字面量(例如
--file),或FROM 'data.csv'为空。STATE_DIR - 会话模式适用场景:已设置,且输入内容为表名引用、自然语言,或不含文件引用的SQL。
STATE_DIR
如果不存在状态文件且没有引用任何文件,默认启用基于的即席模式,用户必须在SQL中直接引用文件。
:memory:如果状态文件存在但其中的ATTACH语句执行失败,需向用户发出警告并回退到即席模式。
Step 2 — Check DuckDB is installed
步骤2 — 检查DuckDB是否已安装
bash
command -v duckdbIf not found, delegate to and then continue.
/duckdb-skills:install-duckdbbash
command -v duckdb如果未找到DuckDB,调用完成安装后再继续后续步骤。
/duckdb-skills:install-duckdbStep 3 — Generate SQL if needed
步骤3 — 按需生成SQL
If the input is natural language (not valid SQL), generate SQL using the Friendly SQL reference below.
In session mode, first retrieve the schema to inform query generation:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"Then for relevant tables:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"Use the schema context and the Friendly SQL reference to generate the most appropriate query.
如果输入内容为自然语言(不是合法SQL),请参考下方的友好SQL参考生成对应SQL。
在会话模式下,首先获取表结构用于支撑查询生成:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"然后查询相关表的结构:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"结合表结构上下文和友好SQL参考生成最合适的查询语句。
Step 4 — Estimate result size
步骤4 — 预估结果大小
Before executing, estimate whether the query could produce a very large result that would
consume excessive tokens when returned to this conversation.
Session mode — check row counts for the tables involved:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"Ad-hoc mode — probe the source:
bash
duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"Evaluate:
- If the query already has a ,
LIMIT, or other aggregation that bounds the output -> safe, proceed.count() - If the source has >1M rows and the query has no LIMIT or aggregation -> tell the user:
"This query would return a very large result set. Displaying it here would consume a lot of tokens and increase cost. I'd recommend adding or an aggregation to keep the output manageable." Ask for confirmation before running as-is.
LIMIT 1000 - If the data size is >10 GB -> additionally warn: "This table is over 10 GB — the query may take a while to complete." Proceed if the user confirms.
Skip this step for queries that are intrinsically bounded (e.g. , , aggregations, ).
DESCRIBESUMMARIZEcount()执行查询前,预估查询是否会返回超大规模结果,避免返回对话时消耗过多token。
会话模式 — 检查涉及的表的行数:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"即席模式 — 探测数据源规模:
bash
duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"评估规则:
- 如果查询已经包含、
LIMIT或其他限制输出的聚合函数 -> 安全,可以直接执行。count() - 如果数据源行数超过100万且查询没有LIMIT或聚合逻辑 -> 告知用户:
"该查询会返回非常大的结果集,在此处展示会消耗大量token并增加成本。建议添加或使用聚合逻辑控制输出规模。" 直接执行前需要获得用户确认。
LIMIT 1000 - 如果数据规模超过10 GB -> 额外发出警告: "该表大小超过10 GB,查询可能需要较长时间才能完成。" 获得用户确认后再继续执行。
如果查询本身天然有输出限制(例如、、聚合查询、),可跳过本步骤。
DESCRIBESUMMARIZEcount()Step 5 — Execute the query
步骤5 — 执行查询
Ad-hoc mode (sandboxed — only the referenced file is accessible):
bash
duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQLReplace with the actual file path extracted from the query or argument.
If multiple files are referenced, include all paths in the list.
FILE_PATH--fileallowed_pathsSession mode (user-trusted database):
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"For multi-line queries, use a heredoc with :
-initbash
duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQLAlways use heredocs () for multi-line queries to avoid shell quoting issues.
<<'SQL'即席模式(沙箱环境 — 仅可访问引用的文件):
bash
duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQL将替换为从查询或参数中提取的实际文件路径。如果引用了多个文件,将所有路径都加入列表。
FILE_PATH--fileallowed_paths会话模式(用户可信数据库):
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"针对多行查询,搭配heredoc使用参数:
-initbash
duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQL多行查询必须使用heredoc()避免Shell转义问题。
<<'SQL'Step 6 — Handle errors
步骤6 — 错误处理
- Syntax error: show the error, suggest a corrected query, and re-run.
- Missing extension (e.g. ): delegate to
Extension "X" not loaded, then retry./duckdb-skills:install-duckdb <ext> - Table not found (session mode): list available tables with and suggest corrections.
FROM duckdb_tables() - File not found (ad-hoc mode): use to locate the file and suggest the corrected path.
find "$PWD" -name "<filename>" 2>/dev/null - Persistent or unclear DuckDB error: use to search the documentation for guidance, then apply the fix and retry.
/duckdb-skills:duckdb-docs <error message or relevant keywords>
- 语法错误:展示错误信息,给出修正后的查询建议后重新运行。
- 缺少扩展(例如):调用
Extension "X" not loaded安装对应扩展后重试。/duckdb-skills:install-duckdb <ext> - 表不存在(会话模式):使用列出可用表并给出修正建议。
FROM duckdb_tables() - 文件不存在(即席模式):使用定位文件并给出修正后的路径建议。
find "$PWD" -name "<filename>" 2>/dev/null - 持续存在或含义不明的DuckDB错误:使用搜索文档获取解决方案,修复问题后重试。
/duckdb-skills:duckdb-docs <错误信息或相关关键词>
Step 7 — Present results
步骤7 — 展示结果
Show the query output to the user. If the result has more than 100 rows, note the truncation and suggest adding to the query.
LIMITFor natural language questions, also provide a brief interpretation of the results.
向用户展示查询输出。如果结果超过100行,说明内容已截断并建议在查询中添加。
LIMIT针对自然语言提问,还需要提供对结果的简要解读。
DuckDB Friendly SQL Reference
DuckDB友好SQL参考
When generating SQL, prefer these idiomatic DuckDB constructs:
生成SQL时,优先使用以下符合DuckDB惯用写法的语法:
Compact clauses
精简子句
- FROM-first: (implicit
FROM table WHERE x > 10)SELECT * - GROUP BY ALL: auto-groups by all non-aggregate columns
- ORDER BY ALL: orders by all columns for deterministic results
- SELECT * EXCLUDE (col1, col2): drop columns from wildcard
- SELECT * REPLACE (expr AS col): transform a column in-place
- UNION ALL BY NAME: combine tables with different column orders
- Percentage LIMIT: returns a percentage of rows
LIMIT 10% - Prefix aliases: instead of
SELECT x: 42SELECT 42 AS x - Trailing commas allowed in SELECT lists
- FROM优先:(隐式
FROM table WHERE x > 10)SELECT * - GROUP BY ALL:自动按所有非聚合列分组
- ORDER BY ALL:按所有列排序保证结果确定性
- SELECT * EXCLUDE (col1, col2):从通配符匹配中排除指定列
- SELECT * REPLACE (expr AS col):原地转换列的取值
- UNION ALL BY NAME:按列名合并列顺序不同的表
- 百分比LIMIT:返回指定比例的行数
LIMIT 10% - 前缀别名:替代
SELECT x: 42SELECT 42 AS x - SELECT列表允许使用** trailing commas(尾随逗号)**
Query features
查询特性
- count(): no need for
count(*) - Reusable aliases: use column aliases in WHERE / GROUP BY / HAVING
- Lateral column aliases:
SELECT i+1 AS j, j+2 AS k - COLUMNS(*): apply expressions across columns; supports regex, EXCLUDE, REPLACE, lambdas
- FILTER clause: for conditional aggregation
count() FILTER (WHERE x > 10) - GROUPING SETS / CUBE / ROLLUP: advanced multi-level aggregation
- Top-N per group: returns top 3 as a list; also
max(col, 3),arg_max(arg, val, n)min_by(arg, val, n) - DESCRIBE table_name: schema summary (column names and types)
- SUMMARIZE table_name: instant statistical profile
- PIVOT / UNPIVOT: reshape between wide and long formats
- SET VARIABLE x = expr: define SQL-level variables, reference with
getvariable('x')
- count():不需要写
count(*) - 别名复用:可以在WHERE / GROUP BY / HAVING中使用列别名
- 横向列别名:
SELECT i+1 AS j, j+2 AS k - COLUMNS(*): 对多列应用统一表达式;支持正则、EXCLUDE、REPLACE、lambda函数
- FILTER子句:用于条件聚合
count() FILTER (WHERE x > 10) - GROUPING SETS / CUBE / ROLLUP:高级多层聚合
- 分组Top-N:返回前3个值组成的列表;还有
max(col, 3)、arg_max(arg, val, n)min_by(arg, val, n) - DESCRIBE table_name:表结构摘要(列名和类型)
- SUMMARIZE table_name:快速生成统计画像
- PIVOT / UNPIVOT:在宽表和长表格式间转换
- SET VARIABLE x = expr:定义SQL层面的变量,通过引用
getvariable('x')
Data import
数据导入
- Direct file queries: ,
FROM 'file.csv'FROM 'data.parquet' - Globbing: reads multiple files
FROM 'data/part-*.parquet' - Auto-detection: CSV headers and schemas are inferred automatically
- 直接查询文件:、
FROM 'file.csv'FROM 'data.parquet' - 通配符匹配:读取多个文件
FROM 'data/part-*.parquet' - 自动探测:自动推断CSV表头和Schema
Expressions and types
表达式和类型
- Dot operator chaining: or
'hello'.upper()col.trim().lower() - List comprehensions:
[x*2 FOR x IN list_col] - List/string slicing: , negative indexing
col[1:3]col[-1] - STRUCT. notation*:
SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s) - Square bracket lists:
[1, 2, 3] - format(): for string formatting
format('{}->{}', a, b)
- 点运算符链式调用:或
'hello'.upper()col.trim().lower() - 列表推导式:
[x*2 FOR x IN list_col] - 列表/字符串切片:,支持负索引
col[1:3]col[-1] - *STRUCT.写法:
SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s) - 方括号列表:
[1, 2, 3] - format():用于字符串格式化
format('{}->{}', a, b)
Joins
连接操作
- ASOF joins: approximate matching on ordered data (e.g. timestamps)
- POSITIONAL joins: match rows by position, not keys
- LATERAL joins: reference prior table expressions in subqueries
- ASOF连接:对有序数据(例如时间戳)做近似匹配
- POSITIONAL连接:按行位置而非键匹配行
- LATERAL连接:在子查询中引用前置的表表达式
Data modification
数据修改
- CREATE OR REPLACE TABLE: no need for first
DROP TABLE IF EXISTS - CREATE TABLE ... AS SELECT (CTAS): create tables from query results
- INSERT INTO ... BY NAME: match columns by name, not position
- INSERT OR IGNORE INTO / INSERT OR REPLACE INTO: upsert patterns
- CREATE OR REPLACE TABLE:不需要先写
DROP TABLE IF EXISTS - CREATE TABLE ... AS SELECT (CTAS):从查询结果创建表
- INSERT INTO ... BY NAME:按列名而非位置匹配插入列
- INSERT OR IGNORE INTO / INSERT OR REPLACE INTO:Upsert模式