hugging-face-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Overview

概述

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
  • Extracting existing evaluation tables from README content
  • Importing benchmark scores from Artificial Analysis
  • Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
该Skill提供了向Hugging Face模型卡片添加结构化评估结果的工具。它支持多种添加评估数据的方式:
  • 从README内容中提取现有评估表格
  • 从Artificial Analysis导入基准测试分数
  • 使用vLLM或accelerate后端(lighteval/inspect-ai)运行自定义模型评估

Integration with HF Ecosystem

与HF生态系统的集成

  • Model Cards: Updates model-index metadata for leaderboard integration
  • Artificial Analysis: Direct API integration for benchmark imports
  • Papers with Code: Compatible with their model-index specification
  • Jobs: Run evaluations directly on Hugging Face Jobs with
    uv
    integration
  • vLLM: Efficient GPU inference for custom model evaluation
  • lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
  • inspect-ai: UK AI Safety Institute's evaluation framework
  • 模型卡片:更新model-index元数据以集成到排行榜
  • Artificial Analysis:直接API集成以导入基准测试数据
  • Papers with Code:兼容其model-index规范
  • Jobs:通过
    uv
    集成直接在Hugging Face Jobs上运行评估
  • vLLM:用于自定义模型评估的高效GPU推理
  • lighteval:HuggingFace的评估库,支持vLLM/accelerate后端
  • inspect-ai:英国AI安全研究所的评估框架

Version

版本

1.3.0
1.3.0

Dependencies

依赖项

Core Dependencies

核心依赖项

  • huggingface_hub>=0.26.0
  • markdown-it-py>=3.0.0
  • python-dotenv>=1.2.1
  • pyyaml>=6.0.3
  • requests>=2.32.5
  • re (built-in)
  • huggingface_hub>=0.26.0
  • markdown-it-py>=3.0.0
  • python-dotenv>=1.2.1
  • pyyaml>=6.0.3
  • requests>=2.32.5
  • re(内置)

Inference Provider Evaluation

推理提供商评估

  • inspect-ai>=0.3.0
  • inspect-evals
  • openai
  • inspect-ai>=0.3.0
  • inspect-evals
  • openai

vLLM Custom Model Evaluation (GPU required)

vLLM自定义模型评估(需要GPU)

  • lighteval[accelerate,vllm]>=0.6.0
  • vllm>=0.4.0
  • torch>=2.0.0
  • transformers>=4.40.0
  • accelerate>=0.30.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using
uv run
.
  • lighteval[accelerate,vllm]>=0.6.0
  • vllm>=0.4.0
  • torch>=2.0.0
  • transformers>=4.40.0
  • accelerate>=0.30.0
注意:使用
uv run
时,vLLM依赖项会通过PEP 723脚本标头自动安装。

IMPORTANT: Using This Skill

重要提示:使用该Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

⚠️ 关键:创建新PR前请检查现有PR

Before creating ANY pull request with
--create-pr
, you MUST check for existing open PRs:
bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
If open PRs exist:
  1. DO NOT create a new PR - this creates duplicate work for maintainers
  2. Warn the user that open PRs already exist
  3. Show the user the existing PR URLs so they can review them
  4. Only proceed if the user explicitly confirms they want to create another PR
This prevents spamming model repositories with duplicate evaluation PRs.

All paths are relative to the directory containing this SKILL.md file. Before running any script, first
cd
to that directory or use the full path.
Use
--help
for the latest workflow guidance.
Works with plain Python or
uv run
:
bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
Key workflow (matches CLI help):
  1. get-prs
    → check for existing open PRs first
  2. inspect-tables
    → find table numbers/columns
  3. extract-readme --table N
    → prints YAML by default
  4. add
    --apply
    (push) or
    --create-pr
    to write changes
在使用
--create-pr
创建任何拉取请求之前,您必须检查是否存在已打开的PR:
bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
如果存在已打开的PR:
  1. 请勿创建新PR - 这会为维护者带来重复工作
  2. 警告用户 已存在打开的PR
  3. 向用户展示 现有PR的URL以便他们查看
  4. 仅当用户明确确认要创建另一个PR时才继续
这可以避免向模型仓库发送重复的评估PR。

所有路径均相对于包含此SKILL.md文件的目录。 在运行任何脚本之前,请先
cd
到该目录或使用完整路径。
使用
--help
获取最新的工作流指导。
可配合纯Python或
uv run
使用:
bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
关键工作流(与CLI帮助一致):
  1. get-prs
    → 首先检查现有打开的PR
  2. inspect-tables
    → 查找表格编号/列
  3. extract-readme --table N
    → 默认输出YAML格式
  4. 添加
    --apply
    (推送)或
    --create-pr
    以写入更改

Core Capabilities

核心功能

1. Inspect and Extract Evaluation Tables from README

1. 检查并从README中提取评估表格

  • Inspect Tables: Use
    inspect-tables
    to see all tables in a README with structure, columns, and sample rows
  • Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
  • Table Selection: Use
    --table N
    to extract from a specific table (required when multiple tables exist)
  • Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
  • Column Matching: Automatically identify model columns/rows; prefer
    --model-column-index
    (index from inspect output). Use
    --model-name-override
    only with exact column header text.
  • YAML Generation: Convert selected table to model-index YAML format
  • Task Typing:
    --task-type
    sets the
    task.type
    field in model-index output (e.g.,
    text-generation
    ,
    summarization
    )
  • 检查表格:使用
    inspect-tables
    查看README中的所有表格,包括结构、列和示例行
  • 解析Markdown表格:使用markdown-it-py进行准确解析(忽略代码块和示例)
  • 表格选择:使用
    --table N
    从特定表格中提取(当存在多个表格时为必填项)
  • 格式检测:识别常见格式(基准测试作为行、列,或包含多个模型的对比表格)
  • 列匹配:自动识别模型列/行;优先使用
    --model-column-index
    (来自inspect输出的索引)。仅当列头文本完全匹配时才使用
    --model-name-override
  • YAML生成:将选定表格转换为model-index YAML格式
  • 任务类型
    --task-type
    设置model-index输出中的
    task.type
    字段(例如
    text-generation
    summarization

2. Import from Artificial Analysis

2. 从Artificial Analysis导入

  • API Integration: Fetch benchmark scores directly from Artificial Analysis
  • Automatic Formatting: Convert API responses to model-index format
  • Metadata Preservation: Maintain source attribution and URLs
  • PR Creation: Automatically create pull requests with evaluation updates
  • API集成:直接从Artificial Analysis获取基准测试分数
  • 自动格式化:将API响应转换为model-index格式
  • 元数据保留:维护来源归属和URL
  • PR创建:自动创建包含评估更新的拉取请求

3. Model-Index Management

3. Model-Index管理

  • YAML Generation: Create properly formatted model-index entries
  • Merge Support: Add evaluations to existing model cards without overwriting
  • Validation: Ensure compliance with Papers with Code specification
  • Batch Operations: Process multiple models efficiently
  • YAML生成:创建格式正确的model-index条目
  • 合并支持:在不覆盖现有内容的情况下向模型卡片添加评估结果
  • 验证:确保符合Papers with Code规范
  • 批量操作:高效处理多个模型

4. Run Evaluations on HF Jobs (Inference Providers)

4. 在HF Jobs上运行评估(推理提供商)

  • Inspect-AI Integration: Run standard evaluations using the
    inspect-ai
    library
  • UV Integration: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
  • Zero-Config: No Dockerfiles or Space management required
  • Hardware Selection: Configure CPU or GPU hardware for the evaluation job
  • Secure Execution: Handles API tokens safely via secrets passed through the CLI
  • Inspect-AI集成:使用
    inspect-ai
    库运行标准评估
  • UV集成:在HF基础设施上无缝运行带有临时依赖项的Python脚本
  • 零配置:无需Dockerfile或Space管理
  • 硬件选择:为评估作业配置CPU或GPU硬件
  • 安全执行:通过CLI传递的密钥安全处理API令牌

5. Run Custom Model Evaluations with vLLM (NEW)

5. 使用vLLM运行自定义模型评估(新增)

⚠️ Important: This approach is only possible on devices with
uv
installed and sufficient GPU memory. Benefits: No need to use
hf_jobs()
MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available
⚠️ 重要提示:此方法仅在安装了
uv
且GPU内存充足的设备上可行。 优势:无需使用
hf_jobs()
MCP工具,可直接在终端运行脚本 适用场景:用户在本地设备且GPU可用时直接操作

Before running the script

运行脚本前

  • check the script path
  • check uv is installed
  • check gpu is available with
    nvidia-smi
  • 检查脚本路径
  • 检查是否已安装uv
  • 使用
    nvidia-smi
    检查GPU是否可用

Running the script

运行脚本

bash
uv run scripts/train_sft_example.py
bash
uv run scripts/train_sft_example.py

Features

特性

  • vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods)
  • lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks
  • inspect-ai Framework: UK AI Safety Institute's evaluation library
  • Standalone or Jobs: Run locally or submit to HF Jobs infrastructure
  • vLLM后端:高性能GPU推理(比标准HF方法快5-10倍)
  • lighteval框架:HuggingFace的评估库,支持Open LLM Leaderboard任务
  • inspect-ai框架:英国AI安全研究所的评估库
  • 独立运行或通过Jobs:本地运行或提交到HF Jobs基础设施

Usage Instructions

使用说明

The skill includes Python scripts in
scripts/
to perform operations.
该Skill在
scripts/
目录中包含Python脚本以执行操作。

Prerequisites

前提条件

  • Preferred: use
    uv run
    (PEP 723 header auto-installs deps)
  • Or install manually:
    pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests
  • Set
    HF_TOKEN
    environment variable with Write-access token
  • For Artificial Analysis: Set
    AA_API_KEY
    environment variable
  • .env
    is loaded automatically if
    python-dotenv
    is installed
  • 推荐:使用
    uv run
    (PEP 723标头自动安装依赖项)
  • 或手动安装:
    pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests
  • 设置
    HF_TOKEN
    环境变量,值为具有写入权限的令牌
  • 对于Artificial Analysis:设置
    AA_API_KEY
    环境变量
  • 如果安装了
    python-dotenv
    ,会自动加载
    .env
    文件

Method 1: Extract from README (CLI workflow)

方法1:从README提取(CLI工作流)

Recommended flow (matches
--help
):
bash
undefined
推荐流程(与
--help
一致):
bash
undefined

1) Inspect tables to get table numbers and column hints

1) 检查表格以获取表格编号和列提示

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

2) Extract a specific table (prints YAML by default)

2) 提取特定表格(默认输出YAML格式)

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
[--model-column-index <column index shown by inspect-tables>]
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
[--model-column-index <inspect-tables输出中的列索引>]
[--model-name-override "<列头/模型名称>"] # 如果无法使用索引,请使用精确的列头文本

3) Apply changes (push or PR)

3) 应用更改(推送或PR)

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--apply # push directly
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--apply # 直接推送

or

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--create-pr # open a PR

Validation checklist:
- YAML is printed by default; compare against the README table before applying.
- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
- For transposed tables (models as rows), ensure only one row is extracted.
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--create-pr # 打开PR

验证清单:
- 默认输出YAML;应用前请与README表格进行对比。
- 优先使用`--model-column-index`;如果使用`--model-name-override`,列头文本必须完全匹配。
- 对于转置表格(模型作为行),确保仅提取一行。

Method 2: Import from Artificial Analysis

方法2:从Artificial Analysis导入

Fetch benchmark scores from Artificial Analysis API and add them to a model card.
Basic Usage:
bash
AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"
With Environment File:
bash
undefined
从Artificial Analysis API获取基准测试分数并添加到模型卡片中。
基本用法:
bash
AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"
使用环境文件:
bash
undefined

Create .env file

创建.env文件

echo "AA_API_KEY=your-api-key" >> .env echo "HF_TOKEN=your-hf-token" >> .env
echo "AA_API_KEY=your-api-key" >> .env echo "HF_TOKEN=your-hf-token" >> .env

Run import

运行导入

uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"

**Create Pull Request:**
```bash
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr
uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"

**创建拉取请求:**
```bash
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

Method 3: Run Evaluation Job

方法3:运行评估作业

Submit an evaluation job on Hugging Face infrastructure using the
hf jobs uv run
CLI.
Direct CLI Usage:
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"
GPU Example (A10G):
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"
Python Helper (optional):
bash
uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"
使用
hf jobs uv run
CLI在Hugging Face基础设施上提交评估作业。
直接CLI用法:
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"
GPU示例(A10G):
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"
Python辅助脚本(可选):
bash
uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

Method 4: Run Custom Model Evaluation with vLLM

方法4:使用vLLM运行自定义模型评估

Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.
使用vLLM或accelerate后端直接在GPU上评估自定义HuggingFace模型。这些脚本独立于推理提供商脚本,并在作业的硬件上本地运行模型。

When to Use vLLM Evaluation (vs Inference Providers)

vLLM评估与推理提供商的对比

FeaturevLLM ScriptsInference Provider Scripts
Model accessAny HF modelModels with API endpoints
HardwareYour GPU (or HF Jobs GPU)Provider's infrastructure
CostHF Jobs compute costAPI usage fees
SpeedvLLM optimizedDepends on provider
OfflineYes (after download)No
特性vLLM脚本推理提供商脚本
模型访问权限任何HF模型具有API端点的模型
硬件您的GPU(或HF Jobs GPU)提供商的基础设施
成本HF Jobs计算成本API使用费用
速度vLLM优化取决于提供商
离线运行是(下载后)

Option A: lighteval with vLLM Backend

选项A:使用vLLM后端的lighteval

lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
Standalone (local GPU):
bash
undefined
lighteval是HuggingFace的评估库,支持Open LLM Leaderboard任务。
独立运行(本地GPU):
bash
undefined

Run MMLU 5-shot with vLLM

使用vLLM运行MMLU 5-shot

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"

Run multiple tasks

运行多个任务

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

Use accelerate backend instead of vLLM

使用accelerate后端替代vLLM

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate

Chat/instruction-tuned models

对话/指令微调模型

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template

**Via HF Jobs:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"
lighteval Task Format: Tasks use the format
suite|task|num_fewshot
:
  • leaderboard|mmlu|5
    - MMLU with 5-shot
  • leaderboard|gsm8k|5
    - GSM8K with 5-shot
  • lighteval|hellaswag|0
    - HellaSwag zero-shot
  • leaderboard|arc_challenge|25
    - ARC-Challenge with 25-shot
Finding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
This file contains all supported tasks in the format
suite|task|num_fewshot|0
(the trailing
0
is a version flag and can be ignored). Common suites include:
  • leaderboard
    - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
  • lighteval
    - Additional lighteval tasks
  • bigbench
    - BigBench tasks
  • original
    - Original benchmark tasks
To use a task from the list, extract the
suite|task|num_fewshot
portion (without the trailing
0
) and pass it to the
--tasks
parameter. For example:
  • From file:
    leaderboard|mmlu|0
    → Use:
    leaderboard|mmlu|0
    (or change to
    5
    for 5-shot)
  • From file:
    bigbench|abstract_narrative_understanding|0
    → Use:
    bigbench|abstract_narrative_understanding|0
  • From file:
    lighteval|wmt14:hi-en|0
    → Use:
    lighteval|wmt14:hi-en|0
Multiple tasks can be specified as comma-separated values:
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template

**通过HF Jobs运行:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"
lighteval任务格式: 任务使用
suite|task|num_fewshot
格式:
  • leaderboard|mmlu|5
    - 5-shot的MMLU
  • leaderboard|gsm8k|5
    - 5-shot的GSM8K
  • lighteval|hellaswag|0
    - zero-shot的HellaSwag
  • leaderboard|arc_challenge|25
    - 25-shot的ARC-Challenge
查找可用任务: 完整的lighteval可用任务列表可在以下地址找到: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
该文件包含所有支持的任务,格式为
suite|task|num_fewshot|0
(末尾的
0
是版本标志,可忽略)。常见套件包括:
  • leaderboard
    - Open LLM Leaderboard任务(MMLU、GSM8K、ARC、HellaSwag等)
  • lighteval
    - 额外的lighteval任务
  • bigbench
    - BigBench任务
  • original
    - 原始基准测试任务
要使用列表中的任务,请提取
suite|task|num_fewshot
部分(忽略末尾的
0
)并传递给
--tasks
参数。例如:
  • 从文件:
    leaderboard|mmlu|0
    → 使用:
    leaderboard|mmlu|0
    (或改为
    5
    以使用5-shot)
  • 从文件:
    bigbench|abstract_narrative_understanding|0
    → 使用:
    bigbench|abstract_narrative_understanding|0
  • 从文件:
    lighteval|wmt14:hi-en|0
    → 使用:
    lighteval|wmt14:hi-en|0
多个任务可通过逗号分隔指定:
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

Option B: inspect-ai with vLLM Backend

选项B:使用vLLM后端的inspect-ai

inspect-ai is the UK AI Safety Institute's evaluation framework.
Standalone (local GPU):
bash
undefined
inspect-ai是英国AI安全研究所的评估框架。
独立运行(本地GPU):
bash
undefined

Run MMLU with vLLM

使用vLLM运行MMLU

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu

Use HuggingFace Transformers backend

使用HuggingFace Transformers后端

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf

Multi-GPU with tensor parallelism

多GPU张量并行

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4

**Via HF Jobs:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu
Available inspect-ai Tasks:
  • mmlu
    - Massive Multitask Language Understanding
  • gsm8k
    - Grade School Math
  • hellaswag
    - Common sense reasoning
  • arc_challenge
    - AI2 Reasoning Challenge
  • truthfulqa
    - TruthfulQA benchmark
  • winogrande
    - Winograd Schema Challenge
  • humaneval
    - Code generation
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4

**通过HF Jobs运行:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu
可用的inspect-ai任务:
  • mmlu
    - 大规模多任务语言理解
  • gsm8k
    - 小学数学问题
  • hellaswag
    - 常识推理
  • arc_challenge
    - AI2推理挑战
  • truthfulqa
    - TruthfulQA基准测试
  • winogrande
    - Winograd模式挑战
  • humaneval
    - 代码生成

Option C: Python Helper Script

选项C:Python辅助脚本

The helper script auto-selects hardware and simplifies job submission:
bash
undefined
辅助脚本会自动选择硬件并简化作业提交:
bash
undefined

Auto-detect hardware based on model size

根据模型大小自动检测硬件

uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval
uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval

Explicit hardware selection

显式选择硬件

uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4
uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4

Use HF Transformers backend

使用HF Transformers后端

uv run scripts/run_vllm_eval_job.py
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf

**Hardware Recommendations:**
| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B params | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |
uv run scripts/run_vllm_eval_job.py
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf

**硬件推荐:**
| 模型大小 | 推荐硬件 |
|------------|---------------------|
| < 3B参数 | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |

Commands Reference

命令参考

Top-level help and version:
bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
Inspect Tables (start here):
bash
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
Extract from README:
bash
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]
Import from Artificial Analysis:
bash
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]
View / Validate:
bash
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
Check Open PRs (ALWAYS run before --create-pr):
bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
Run Evaluation Job (Inference Providers):
bash
hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"
or use the Python helper:
bash
uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."
Run vLLM Evaluation (Custom Models):
bash
undefined
顶级帮助和版本:
bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
检查表格(从此处开始):
bash
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
从README提取:
bash
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]
从Artificial Analysis导入:
bash
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]
查看/验证:
bash
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
检查打开的PR(创建--create-pr前务必运行):
bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
列出模型仓库的所有打开的拉取请求。显示PR编号、标题、作者、日期和URL。
运行评估作业(推理提供商):
bash
hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"
或使用Python辅助脚本:
bash
uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."
运行vLLM评估(自定义模型):
bash
undefined

lighteval with vLLM

使用vLLM的lighteval

hf jobs uv run scripts/lighteval_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"
hf jobs uv run scripts/lighteval_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"

inspect-ai with vLLM

使用vLLM的inspect-ai

hf jobs uv run scripts/inspect_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"
hf jobs uv run scripts/inspect_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"

Helper script (auto hardware selection)

辅助脚本(自动硬件选择)

uv run scripts/run_vllm_eval_job.py
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval
undefined
uv run scripts/run_vllm_eval_job.py
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval
undefined

Model-Index Format

Model-Index格式

The generated model-index follows this structure:
yaml
model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com
WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
生成的model-index遵循以下结构:
yaml
model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com
警告:请勿在模型名称中使用Markdown格式。使用表格中的精确名称。仅在source.url字段中使用URL。

Error Handling

错误处理

  • Table Not Found: Script will report if no evaluation tables are detected
  • Invalid Format: Clear error messages for malformed tables
  • API Errors: Retry logic for transient Artificial Analysis API failures
  • Token Issues: Validation before attempting updates
  • Merge Conflicts: Preserves existing model-index entries when adding new ones
  • Space Creation: Handles naming conflicts and hardware request failures gracefully
  • 未找到表格:脚本会报告未检测到评估表格
  • 格式无效:针对格式错误的表格显示清晰的错误消息
  • API错误:针对Artificial Analysis API的临时故障提供重试逻辑
  • 令牌问题:在尝试更新前验证令牌权限
  • 合并冲突:添加新评估结果时保留现有model-index条目
  • Space创建:优雅处理命名冲突和硬件请求失败

Best Practices

最佳实践

  1. Check for existing PRs first: Run
    get-prs
    before creating any new PR to avoid duplicates
  2. Always start with
    inspect-tables
    : See table structure and get the correct extraction command
  3. Use
    --help
    for guidance
    : Run
    inspect-tables --help
    to see the complete workflow
  4. Preview first: Default behavior prints YAML; review it before using
    --apply
    or
    --create-pr
  5. Verify extracted values: Compare YAML output against the README table manually
  6. Use
    --table N
    for multi-table READMEs
    : Required when multiple evaluation tables exist
  7. Use
    --model-name-override
    for comparison tables
    : Copy the exact column header from
    inspect-tables
    output
  8. Create PRs for Others: Use
    --create-pr
    when updating models you don't own
  9. One model per repo: Only add the main model's results to model-index
  10. No markdown in YAML names: The model name field in YAML should be plain text
  1. 先检查现有PR:创建任何新PR前运行
    get-prs
    以避免重复
  2. 始终从
    inspect-tables
    开始
    :查看表格结构并获取正确的提取命令
  3. 使用
    --help
    获取指导
    :运行
    inspect-tables --help
    查看完整工作流
  4. 先预览:默认行为是输出YAML;使用
    --apply
    --create-pr
    前先查看
  5. 验证提取的值:手动对比YAML输出与README表格
  6. 多表格README使用
    --table N
    :存在多个评估表格时为必填项
  7. 对比表格使用
    --model-name-override
    :从
    inspect-tables
    输出中复制精确的列头
  8. 为他人创建PR:更新非自己拥有的模型时使用
    --create-pr
  9. 每个仓库一个模型:仅将主模型的结果添加到model-index
  10. YAML名称中无Markdown:YAML中的模型名字段应为纯文本

Model Name Matching

模型名称匹配

When extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching:
  • Removes markdown formatting (bold
    **
    , links
    []()
    )
  • Normalizes names (lowercase, replace
    -
    and
    _
    with spaces)
  • Compares token sets:
    "OLMo-3-32B"
    {"olmo", "3", "32b"}
    matches
    "**Olmo 3 32B**"
    or
    "[Olmo-3-32B](...)
  • Only extracts if tokens match exactly (handles different word orders and separators)
  • Fails if no exact match found (rather than guessing from similar names)
For column-based tables (benchmarks as rows, models as columns):
  • Finds the column header matching the model name
  • Extracts scores from that column only
For transposed tables (models as rows, benchmarks as columns):
  • Finds the row in the first column matching the model name
  • Extracts all benchmark scores from that row only
This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
当提取包含多个模型的评估表格(作为列或行)时,脚本使用精确的规范化令牌匹配
  • 移除Markdown格式(粗体
    **
    、链接
    []()
  • 规范化名称(小写,将
    -
    _
    替换为空格)
  • 对比令牌集:
    "OLMo-3-32B"
    {"olmo", "3", "32b"}
    匹配
    "**Olmo 3 32B**"
    "[Olmo-3-32B](...)
  • 仅当令牌完全匹配时才提取(处理不同的词序和分隔符)
  • 如果未找到精确匹配则失败(而非从相似名称猜测)
基于列的表格(基准测试作为行,模型作为列):
  • 查找与模型名称匹配的列头
  • 仅提取该列的分数
转置表格(模型作为行,基准测试作为列):
  • 查找第一列中与模型名称匹配的行
  • 仅提取该行的所有基准测试分数
这确保仅提取正确模型的分数,绝不会提取无关模型或训练检查点的分数。

Common Patterns

常见模式

Update Your Own Model:
bash
undefined
更新自己的模型:
bash
undefined

Extract from README and push directly

从README提取并直接推送

uv run scripts/evaluation_manager.py extract-readme
--repo-id "your-username/your-model"
--task-type "text-generation"

**Update Someone Else's Model (Full Workflow):**
```bash
uv run scripts/evaluation_manager.py extract-readme
--repo-id "your-username/your-model"
--task-type "text-generation"

**更新他人的模型(完整工作流):**
```bash

Step 1: ALWAYS check for existing PRs first

步骤1:务必先检查现有PR

uv run scripts/evaluation_manager.py get-prs
--repo-id "other-username/their-model"
uv run scripts/evaluation_manager.py get-prs
--repo-id "other-username/their-model"

Step 2: If NO open PRs exist, proceed with creating one

步骤2:如果没有打开的PR,继续创建

uv run scripts/evaluation_manager.py extract-readme
--repo-id "other-username/their-model"
--create-pr
uv run scripts/evaluation_manager.py extract-readme
--repo-id "other-username/their-model"
--create-pr

If open PRs DO exist:

如果存在打开的PR:

- Warn the user about existing PRs

- 警告用户存在现有PR

- Show them the PR URLs

- 向他们展示PR URL

- Do NOT create a new PR unless user explicitly confirms

- 除非用户明确确认,否则请勿创建新PR


**Import Fresh Benchmarks:**
```bash

**导入新基准测试:**
```bash

Step 1: Check for existing PRs

步骤1:检查现有PR

uv run scripts/evaluation_manager.py get-prs
--repo-id "anthropic/claude-sonnet-4"
uv run scripts/evaluation_manager.py get-prs
--repo-id "anthropic/claude-sonnet-4"

Step 2: If no PRs, import from Artificial Analysis

步骤2:如果没有PR,从Artificial Analysis导入

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr
undefined
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr
undefined

Troubleshooting

故障排除

Issue: "No evaluation tables found in README"
  • Solution: Check if README contains markdown tables with numeric scores
Issue: "Could not find model 'X' in transposed table"
  • Solution: The script will display available models. Use
    --model-name-override
    with the exact name from the list
  • Example:
    --model-name-override "**Olmo 3-32B**"
Issue: "AA_API_KEY not set"
  • Solution: Set environment variable or add to .env file
Issue: "Token does not have write access"
  • Solution: Ensure HF_TOKEN has write permissions for the repository
Issue: "Model not found in Artificial Analysis"
  • Solution: Verify creator-slug and model-name match API values
Issue: "Payment required for hardware"
  • Solution: Add a payment method to your Hugging Face account to use non-CPU hardware
Issue: "vLLM out of memory" or CUDA OOM
  • Solution: Use a larger hardware flavor, reduce
    --gpu-memory-utilization
    , or use
    --tensor-parallel-size
    for multi-GPU
Issue: "Model architecture not supported by vLLM"
  • Solution: Use
    --backend hf
    (inspect-ai) or
    --backend accelerate
    (lighteval) for HuggingFace Transformers
Issue: "Trust remote code required"
  • Solution: Add
    --trust-remote-code
    flag for models with custom code (e.g., Phi-2, Qwen)
Issue: "Chat template not found"
  • Solution: Only use
    --use-chat-template
    for instruction-tuned models that include a chat template
问题:"README中未找到评估表格"
  • 解决方案:检查README是否包含带有数字分数的Markdown表格
问题:"转置表格中未找到模型'X'"
  • 解决方案:脚本会显示可用模型。使用
    --model-name-override
    并传入列表中的精确名称
  • 示例
    --model-name-override "**Olmo 3-32B**"
问题:"AA_API_KEY未设置"
  • 解决方案:设置环境变量或添加到.env文件
问题:"令牌无写入权限"
  • 解决方案:确保HF_TOKEN对仓库具有写入权限
问题:"Artificial Analysis中未找到模型"
  • 解决方案:验证creator-slug和model-name是否与API值匹配
问题:"硬件需要付费"
  • 解决方案:在Hugging Face账户中添加付款方式以使用非CPU硬件
问题:"vLLM内存不足"或CUDA OOM
  • 解决方案:使用更大的硬件规格,降低
    --gpu-memory-utilization
    ,或为多GPU使用
    --tensor-parallel-size
问题:"vLLM不支持该模型架构"
  • 解决方案:使用
    --backend hf
    (inspect-ai)或
    --backend accelerate
    (lighteval)以使用HuggingFace Transformers
问题:"需要信任远程代码"
  • 解决方案:对于包含自定义代码的模型(如Phi-2、Qwen),添加
    --trust-remote-code
    标志
问题:"未找到聊天模板"
  • 解决方案:仅对包含聊天模板的指令微调模型使用
    --use-chat-template

Integration Examples

集成示例

Python Script Integration:
python
import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")
Python脚本集成:
python
import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")