hugging-face-dataset-viewer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHugging Face Dataset Viewer
Hugging Face Dataset Viewer
Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
使用此技能执行只读的Dataset Viewer API调用,用于数据集探索和提取。
Core workflow
核心工作流
- Optionally validate dataset availability with .
/is-valid - Resolve +
configwithsplit./splits - Preview with .
/first-rows - Paginate content with using
/rowsandoffset(max 100).length - Use for text matching and
/searchfor row predicates./filter - Retrieve parquet links via and totals/metadata via
/parquetand/size./statistics
- 可选择通过接口验证数据集可用性。
/is-valid - 通过接口解析
/splits+config参数。split - 通过接口预览数据。
/first-rows - 使用携带和
offset参数的length接口对内容进行分页(最多返回100条)。/rows - 文本匹配使用接口,行筛选使用
/search接口。/filter - 通过接口获取parquet文件链接,通过
/parquet和/size接口获取数据总量/元数据。/statistics
Defaults
默认配置
- Base URL:
https://datasets-server.huggingface.co - Default API method:
GET - Query params should be URL-encoded.
- is 0-based.
offset - max is usually
lengthfor row-like endpoints.100 - Gated/private datasets require .
Authorization: Bearer <HF_TOKEN>
- 基础URL:
https://datasets-server.huggingface.co - 默认API请求方法:
GET - 查询参数需要经过URL编码。
- 从0开始计数。
offset - 类行查询接口的参数最大值通常为
length。100 - 门禁/私有数据集需要携带请求头:。
Authorization: Bearer <HF_TOKEN>
Dataset Viewer
Dataset Viewer 接口
- :
Validate dataset/is-valid?dataset=<namespace/repo> - :
List subsets and splits/splits?dataset=<namespace/repo> - :
Preview first rows/first-rows?dataset=<namespace/repo>&config=<config>&split=<split> - :
Paginate rows/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int> - :
Search text/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int> - :
Filter with predicates/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int> - :
List parquet shards/parquet?dataset=<namespace/repo> - :
Get size totals/size?dataset=<namespace/repo> - :
Get column statistics/statistics?dataset=<namespace/repo>&config=<config>&split=<split> - :
Get Croissant metadata (if available)/croissant?dataset=<namespace/repo>
Pagination pattern:
bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"When pagination is partial, use response fields such as , , and to drive continuation logic.
num_rows_totalnum_rows_per_pagepartialSearch/filter notes:
- matches string columns (full-text style behavior is internal to the API).
/search - requires predicate syntax in
/filterand optional sort inwhere.orderby - Keep filtering and searches read-only and side-effect free.
- :
验证数据集有效性/is-valid?dataset=<namespace/repo> - :
列出子集和拆分/splits?dataset=<namespace/repo> - :
预览首行数据/first-rows?dataset=<namespace/repo>&config=<config>&split=<split> - :
分页查询行数据/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int> - :
文本搜索/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int> - :
谓语筛选/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int> - :
列出parquet分片/parquet?dataset=<namespace/repo> - :
获取总大小/size?dataset=<namespace/repo> - :
获取字段统计信息/statistics?dataset=<namespace/repo>&config=<config>&split=<split> - :
获取Croissant元数据(如果存在)/croissant?dataset=<namespace/repo>
分页示例:
bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"当分页返回部分结果时,可使用响应字段如、和来实现续查逻辑。
num_rows_totalnum_rows_per_pagepartial搜索/筛选注意事项:
- 接口匹配字符串类型字段(全文检索逻辑由API内部实现)。
/search - 接口要求在
/filter参数中传入谓语语法,可选择在where参数中传入排序规则。orderby - 确保筛选和搜索操作为只读,不会产生副作用。
Querying Datasets
查询数据集
Use with Hub parquet alias paths for SQL querying.
npx parquetlensParquet alias shape:
text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquetDerive , , and from Dataset Viewer :
<config><split><shard>/parquetbash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
| jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'Run SQL query:
bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
"hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
--sql "SELECT * FROM data LIMIT 20"可搭配和Hub parquet别名路径进行SQL查询。
npx parquetlensParquet别名格式:
text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet可通过Dataset Viewer的接口获取、和参数:
/parquet<config><split><shard>bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
| jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'执行SQL查询:
bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
"hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
--sql "SELECT * FROM data LIMIT 20"SQL export
SQL导出
- CSV:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" - JSON:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" - Parquet:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
- CSV格式:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" - JSON格式:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" - Parquet格式:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
Creating and Uploading Datasets
创建和上传数据集
Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
- Create dataset repo in browser:
https://huggingface.co/new-dataset - Upload parquet files in the repo "Files and versions" page.
- Verify shards appear in Dataset Viewer:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"Low dependency CLI flow ( / ):
npx @huggingface/hubhfjs- Set auth token:
bash
export HF_TOKEN=<your_hf_token>- Upload parquet folder to a dataset repo (auto-creates repo if missing):
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data- Upload as private repo on creation:
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --privateAfter upload, call to discover values for querying with .
/parquet<config>/<split>/<shard>@~parquet根据依赖约束选择以下任意流程:
零本地依赖(通过Hub UI操作):
- 在浏览器中创建数据集仓库:
https://huggingface.co/new-dataset - 在仓库的「Files and versions」页面上传parquet文件
- 验证分片是否出现在Dataset Viewer中:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"低依赖CLI流程( / ):
npx @huggingface/hubhfjs- 设置认证令牌:
bash
export HF_TOKEN=<your_hf_token>- 上传parquet文件夹到数据集仓库(仓库不存在时会自动创建):
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data- 创建并上传到私有仓库:
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private上传完成后,调用接口获取参数,即可通过路径查询数据。
/parquet<config>/<split>/<shard>@~parquet