hugging-face-dataset-viewer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hugging Face Dataset Viewer

Hugging Face Dataset Viewer

Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
使用此技能执行只读的Dataset Viewer API调用,用于数据集探索和提取。

Core workflow

核心工作流

  1. Optionally validate dataset availability with
    /is-valid
    .
  2. Resolve
    config
    +
    split
    with
    /splits
    .
  3. Preview with
    /first-rows
    .
  4. Paginate content with
    /rows
    using
    offset
    and
    length
    (max 100).
  5. Use
    /search
    for text matching and
    /filter
    for row predicates.
  6. Retrieve parquet links via
    /parquet
    and totals/metadata via
    /size
    and
    /statistics
    .
  1. 可选择通过
    /is-valid
    接口验证数据集可用性。
  2. 通过
    /splits
    接口解析
    config
    +
    split
    参数。
  3. 通过
    /first-rows
    接口预览数据。
  4. 使用携带
    offset
    length
    参数的
    /rows
    接口对内容进行分页(最多返回100条)。
  5. 文本匹配使用
    /search
    接口,行筛选使用
    /filter
    接口。
  6. 通过
    /parquet
    接口获取parquet文件链接,通过
    /size
    /statistics
    接口获取数据总量/元数据。

Defaults

默认配置

  • Base URL:
    https://datasets-server.huggingface.co
  • Default API method:
    GET
  • Query params should be URL-encoded.
  • offset
    is 0-based.
  • length
    max is usually
    100
    for row-like endpoints.
  • Gated/private datasets require
    Authorization: Bearer <HF_TOKEN>
    .
  • 基础URL:
    https://datasets-server.huggingface.co
  • 默认API请求方法:
    GET
  • 查询参数需要经过URL编码。
  • offset
    从0开始计数。
  • 类行查询接口的
    length
    参数最大值通常为
    100
  • 门禁/私有数据集需要携带请求头:
    Authorization: Bearer <HF_TOKEN>

Dataset Viewer

Dataset Viewer 接口

  • Validate dataset
    :
    /is-valid?dataset=<namespace/repo>
  • List subsets and splits
    :
    /splits?dataset=<namespace/repo>
  • Preview first rows
    :
    /first-rows?dataset=<namespace/repo>&config=<config>&split=<split>
  • Paginate rows
    :
    /rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>
  • Search text
    :
    /search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>
  • Filter with predicates
    :
    /filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>
  • List parquet shards
    :
    /parquet?dataset=<namespace/repo>
  • Get size totals
    :
    /size?dataset=<namespace/repo>
  • Get column statistics
    :
    /statistics?dataset=<namespace/repo>&config=<config>&split=<split>
  • Get Croissant metadata (if available)
    :
    /croissant?dataset=<namespace/repo>
Pagination pattern:
bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
When pagination is partial, use response fields such as
num_rows_total
,
num_rows_per_page
, and
partial
to drive continuation logic.
Search/filter notes:
  • /search
    matches string columns (full-text style behavior is internal to the API).
  • /filter
    requires predicate syntax in
    where
    and optional sort in
    orderby
    .
  • Keep filtering and searches read-only and side-effect free.
  • 验证数据集有效性
    /is-valid?dataset=<namespace/repo>
  • 列出子集和拆分
    /splits?dataset=<namespace/repo>
  • 预览首行数据
    /first-rows?dataset=<namespace/repo>&config=<config>&split=<split>
  • 分页查询行数据
    /rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>
  • 文本搜索
    /search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>
  • 谓语筛选
    /filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>
  • 列出parquet分片
    /parquet?dataset=<namespace/repo>
  • 获取总大小
    /size?dataset=<namespace/repo>
  • 获取字段统计信息
    /statistics?dataset=<namespace/repo>&config=<config>&split=<split>
  • 获取Croissant元数据(如果存在)
    /croissant?dataset=<namespace/repo>
分页示例:
bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
当分页返回部分结果时,可使用响应字段如
num_rows_total
num_rows_per_page
partial
来实现续查逻辑。
搜索/筛选注意事项:
  • /search
    接口匹配字符串类型字段(全文检索逻辑由API内部实现)。
  • /filter
    接口要求在
    where
    参数中传入谓语语法,可选择在
    orderby
    参数中传入排序规则。
  • 确保筛选和搜索操作为只读,不会产生副作用。

Querying Datasets

查询数据集

Use
npx parquetlens
with Hub parquet alias paths for SQL querying.
Parquet alias shape:
text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
Derive
<config>
,
<split>
, and
<shard>
from Dataset Viewer
/parquet
:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
  | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
Run SQL query:
bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
  "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
  --sql "SELECT * FROM data LIMIT 20"
可搭配
npx parquetlens
和Hub parquet别名路径进行SQL查询。
Parquet别名格式:
text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
可通过Dataset Viewer的
/parquet
接口获取
<config>
<split>
<shard>
参数:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
  | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
执行SQL查询:
bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
  "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
  --sql "SELECT * FROM data LIMIT 20"

SQL export

SQL导出

  • CSV:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"
  • JSON:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"
  • Parquet:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
  • CSV格式:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"
  • JSON格式:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"
  • Parquet格式:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"

Creating and Uploading Datasets

创建和上传数据集

Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
  • Create dataset repo in browser:
    https://huggingface.co/new-dataset
  • Upload parquet files in the repo "Files and versions" page.
  • Verify shards appear in Dataset Viewer:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
Low dependency CLI flow (
npx @huggingface/hub
/
hfjs
):
  • Set auth token:
bash
export HF_TOKEN=<your_hf_token>
  • Upload parquet folder to a dataset repo (auto-creates repo if missing):
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
  • Upload as private repo on creation:
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
After upload, call
/parquet
to discover
<config>/<split>/<shard>
values for querying with
@~parquet
.
根据依赖约束选择以下任意流程:
零本地依赖(通过Hub UI操作):
  • 在浏览器中创建数据集仓库:
    https://huggingface.co/new-dataset
  • 在仓库的「Files and versions」页面上传parquet文件
  • 验证分片是否出现在Dataset Viewer中:
bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
低依赖CLI流程(
npx @huggingface/hub
/
hfjs
):
  • 设置认证令牌:
bash
export HF_TOKEN=<your_hf_token>
  • 上传parquet文件夹到数据集仓库(仓库不存在时会自动创建):
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
  • 创建并上传到私有仓库:
bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
上传完成后,调用
/parquet
接口获取
<config>/<split>/<shard>
参数,即可通过
@~parquet
路径查询数据。