huggingface-datasets
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHugging Face Dataset Viewer
Hugging Face Dataset Viewer
Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
本技能用于执行只读的Dataset Viewer API调用,以进行数据集探索与提取。
Core workflow
核心工作流
- Optionally validate dataset availability with .
/is-valid - Resolve +
configwithsplit./splits - Preview with .
/first-rows - Paginate content with using
/rowsandoffset(max 100).length - Use for text matching and
/searchfor row predicates./filter - Retrieve parquet links via and totals/metadata via
/parquetand/size./statistics
- (可选)通过验证数据集是否可用。
/is-valid - 通过解析
/splits+config。split - 通过预览数据。
/first-rows - 使用和
offset(最大值为100),通过length实现内容分页。/rows - 使用进行文本匹配,使用
/search进行行数据谓词筛选。/filter - 通过获取Parquet文件链接,通过
/parquet和/size获取数据总量及元数据。/statistics
Defaults
默认配置
- Base URL:
https://datasets-server.huggingface.co - Default API method:
GET - Query params should be URL-encoded.
- is 0-based.
offset - max is usually
lengthfor row-like endpoints.100 - Gated/private datasets require .
Authorization: Bearer <HF_TOKEN>
- 基础URL:
https://datasets-server.huggingface.co - 默认API请求方法:
GET - 查询参数需进行URL编码。
- 从0开始计数。
offset - 类行数据接口的最大值通常为
length。100 - 受限制/私有数据集需要携带请求头。
Authorization: Bearer <HF_TOKEN>
Dataset Viewer
Dataset Viewer 接口说明
- :
Validate dataset/is-valid?dataset=<namespace/repo> - :
List subsets and splits/splits?dataset=<namespace/repo> - :
Preview first rows/first-rows?dataset=<namespace/repo>&config=<config>&split=<split> - :
Paginate rows/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int> - :
Search text/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int> - :
Filter with predicates/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int> - :
List parquet shards/parquet?dataset=<namespace/repo> - :
Get size totals/size?dataset=<namespace/repo> - :
Get column statistics/statistics?dataset=<namespace/repo>&config=<config>&split=<split> - :
Get Croissant metadata (if available)/croissant?dataset=<namespace/repo>
Pagination pattern:
bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"When pagination is partial, use response fields such as , , and to drive continuation logic.
num_rows_totalnum_rows_per_pagepartialSearch/filter notes:
- matches string columns (full-text style behavior is internal to the API).
/search - requires predicate syntax in
/filterand optional sort inwhere.orderby - Keep filtering and searches read-only and side-effect free.
- :
验证数据集可用性/is-valid?dataset=<namespace/repo> - :
列出子集与拆分/splits?dataset=<namespace/repo> - :
预览前几行数据/first-rows?dataset=<namespace/repo>&config=<config>&split=<split> - :
分页查看行数据/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int> - :
文本搜索/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int> - :
谓词筛选/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int> - :
列出Parquet分片/parquet?dataset=<namespace/repo> - :
获取数据总量/size?dataset=<namespace/repo> - :
获取列统计信息/statistics?dataset=<namespace/repo>&config=<config>&split=<split> - :
获取Croissant元数据(若可用)/croissant?dataset=<namespace/repo>
分页示例:
bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"当分页结果不完整时,可使用响应字段如、和来驱动后续的分页逻辑。
num_rows_totalnum_rows_per_pagepartial搜索/筛选注意事项:
- 用于匹配字符串类型列(全文检索行为由API内部实现)。
/search - 需要在
/filter参数中指定谓词语法,还可通过where参数设置排序规则。orderby - 确保筛选与搜索操作均为只读,无副作用。
Querying Datasets
数据集查询
Use with Hub parquet alias paths for SQL querying.
npx parquetlensParquet alias shape:
text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquetDerive , , and from Dataset Viewer :
<config><split><shard>/parquetbash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
| jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'Run SQL query:
bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
"hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
--sql "SELECT * FROM data LIMIT 20"可结合Hub的Parquet别名路径,使用进行SQL查询。
npx parquetlensParquet别名格式:
text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet从Dataset Viewer的接口获取、和的值:
/parquet<config><split><shard>bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
| jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'执行SQL查询:
bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
"hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
--sql "SELECT * FROM data LIMIT 20"SQL export
SQL导出
- CSV:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" - JSON:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" - Parquet:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
- CSV格式:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" - JSON格式:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" - Parquet格式:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
Creating and Uploading Datasets
数据集创建与上传
Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
- Create dataset repo in browser:
https://huggingface.co/new-dataset - Upload parquet files in the repo "Files and versions" page.
- Verify shards appear in Dataset Viewer:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"Low dependency CLI flow ( / ):
npx @huggingface/hubhfjs- Set auth token:
bash
export HF_TOKEN=<your_hf_token>- Upload parquet folder to a dataset repo (auto-creates repo if missing):
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data- Upload as private repo on creation:
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --privateAfter upload, call to discover values for querying with .
/parquet<config>/<split>/<shard>@~parquet可根据依赖约束选择以下任一流程。
零本地依赖(Hub UI):
- 在浏览器中创建数据集仓库:
https://huggingface.co/new-dataset - 在仓库的“Files and versions”页面上传Parquet文件。
- 验证分片是否在Dataset Viewer中显示:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"低依赖CLI流程( / ):
npx @huggingface/hubhfjs- 设置认证令牌:
bash
export HF_TOKEN=<your_hf_token>- 将Parquet文件夹上传至数据集仓库(若仓库不存在则自动创建):
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data- 创建私有仓库并上传:
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private上传完成后,调用接口获取的值,以便使用进行查询。
/parquet<config>/<split>/<shard>@~parquet