You are helping the user query data using DuckDB.
Follow these steps in order.
Step 1 — Resolve state and determine the mode
Look for an existing state file in either location:
bash
STATE_DIR=""
test -f .duckdb-skills/state.sql && STATE_DIR=".duckdb-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.duckdb-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.duckdb-skills/$PROJECT_ID"
If found, verify the databases it references are still accessible:
bash
duckdb -init "$STATE_DIR/state.sql" -c "SHOW DATABASES;"
Now determine the mode:
- Ad-hoc mode if: the flag is present, or the SQL references file paths/literals (e.g. ), or is empty.
- Session mode if: is set and the input references table names, is natural language, or is SQL without file references.
If no state file exists and no file is referenced, fall back to ad-hoc mode against
— the user must reference files directly in their SQL.
If the state file exists but any ATTACH in it fails, warn the user and fall back to ad-hoc mode.
Step 2 — Check DuckDB is installed
If not found, delegate to
/duckdb-skills:install-duckdb
and then continue.
Step 3 — Generate SQL if needed
If the input is natural language (not valid SQL), generate SQL using the Friendly SQL reference below.
In session mode, first retrieve the schema to inform query generation:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name FROM duckdb_tables() ORDER BY table_name;
"
Then for relevant tables:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "DESCRIBE <table_name>;"
Use the schema context and the Friendly SQL reference to generate the most appropriate query.
Step 4 — Estimate result size
Before executing, estimate whether the query could produce a very large result that would
consume excessive tokens when returned to this conversation.
Session mode — check row counts for the tables involved:
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "
SELECT table_name, estimated_size, column_count
FROM duckdb_tables()
WHERE table_name IN ('<table1>', '<table2>');
"
Ad-hoc mode — probe the source:
bash
duckdb :memory: -csv -c "
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
SELECT count() AS row_count FROM 'FILE_PATH';
"
Evaluate:
- If the query already has a , , or other aggregation that bounds the output -> safe, proceed.
- If the source has >1M rows and the query has no LIMIT or aggregation -> tell the user:
"This query would return a very large result set. Displaying it here would consume a lot of tokens and increase cost. I'd recommend adding or an aggregation to keep the output manageable."
Ask for confirmation before running as-is.
- If the data size is >10 GB -> additionally warn:
"This table is over 10 GB — the query may take a while to complete."
Proceed if the user confirms.
Skip this step for queries that are intrinsically bounded (e.g.
,
, aggregations,
).
Step 5 — Execute the query
Ad-hoc mode (sandboxed — only the referenced file is accessible):
bash
duckdb :memory: -csv <<'SQL'
SET allowed_paths=['FILE_PATH'];
SET enable_external_access=false;
SET allow_persistent_secrets=false;
SET lock_configuration=true;
<QUERY>;
SQL
Replace
with the actual file path extracted from the query or
argument.
If multiple files are referenced, include all paths in the
list.
Session mode (user-trusted database):
bash
duckdb -init "$STATE_DIR/state.sql" -csv -c "<QUERY>"
For multi-line queries, use a heredoc with
:
bash
duckdb -init "$STATE_DIR/state.sql" -csv <<'SQL'
<QUERY>;
SQL
Always use heredocs (
) for multi-line queries to avoid shell quoting issues.
Step 6 — Handle errors
- Syntax error: show the error, suggest a corrected query, and re-run.
- Missing extension (e.g. ): delegate to
/duckdb-skills:install-duckdb <ext>
, then retry.
- Table not found (session mode): list available tables with and suggest corrections.
- File not found (ad-hoc mode): use
find "$PWD" -name "<filename>" 2>/dev/null
to locate the file and suggest the corrected path.
- Persistent or unclear DuckDB error: use
/duckdb-skills:duckdb-docs <error message or relevant keywords>
to search the documentation for guidance, then apply the fix and retry.
Step 7 — Present results
Show the query output to the user. If the result has more than 100 rows, note the truncation and suggest adding
to the query.
For natural language questions, also provide a brief interpretation of the results.
DuckDB Friendly SQL Reference
When generating SQL, prefer these idiomatic DuckDB constructs:
Compact clauses
- FROM-first: (implicit )
- GROUP BY ALL: auto-groups by all non-aggregate columns
- ORDER BY ALL: orders by all columns for deterministic results
- SELECT * EXCLUDE (col1, col2): drop columns from wildcard
- SELECT * REPLACE (expr AS col): transform a column in-place
- UNION ALL BY NAME: combine tables with different column orders
- Percentage LIMIT: returns a percentage of rows
- Prefix aliases: instead of
- Trailing commas allowed in SELECT lists
Query features
- count(): no need for
- Reusable aliases: use column aliases in WHERE / GROUP BY / HAVING
- Lateral column aliases:
SELECT i+1 AS j, j+2 AS k
- COLUMNS(*): apply expressions across columns; supports regex, EXCLUDE, REPLACE, lambdas
- FILTER clause:
count() FILTER (WHERE x > 10)
for conditional aggregation
- GROUPING SETS / CUBE / ROLLUP: advanced multi-level aggregation
- Top-N per group: returns top 3 as a list; also ,
- DESCRIBE table_name: schema summary (column names and types)
- SUMMARIZE table_name: instant statistical profile
- PIVOT / UNPIVOT: reshape between wide and long formats
- SET VARIABLE x = expr: define SQL-level variables, reference with
Data import
- Direct file queries: ,
- Globbing:
FROM 'data/part-*.parquet'
reads multiple files
- Auto-detection: CSV headers and schemas are inferred automatically
Expressions and types
- Dot operator chaining: or
- List comprehensions:
- List/string slicing: , negative indexing
- STRUCT. notation*:
SELECT s.* FROM (SELECT {'a': 1, 'b': 2} AS s)
- Square bracket lists:
- format(): for string formatting
Joins
- ASOF joins: approximate matching on ordered data (e.g. timestamps)
- POSITIONAL joins: match rows by position, not keys
- LATERAL joins: reference prior table expressions in subqueries
Data modification
- CREATE OR REPLACE TABLE: no need for first
- CREATE TABLE ... AS SELECT (CTAS): create tables from query results
- INSERT INTO ... BY NAME: match columns by name, not position
- INSERT OR IGNORE INTO / INSERT OR REPLACE INTO: upsert patterns