hugging-face-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOverview
概述
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
- Extracting existing evaluation tables from README content
- Importing benchmark scores from Artificial Analysis
- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
该Skill提供了向Hugging Face模型卡片添加结构化评估结果的工具。它支持多种添加评估数据的方式:
- 从README内容中提取现有评估表格
- 从Artificial Analysis导入基准测试分数
- 使用vLLM或accelerate后端(lighteval/inspect-ai)运行自定义模型评估
Integration with HF Ecosystem
与HF生态系统的集成
- Model Cards: Updates model-index metadata for leaderboard integration
- Artificial Analysis: Direct API integration for benchmark imports
- Papers with Code: Compatible with their model-index specification
- Jobs: Run evaluations directly on Hugging Face Jobs with integration
uv - vLLM: Efficient GPU inference for custom model evaluation
- lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
- inspect-ai: UK AI Safety Institute's evaluation framework
- 模型卡片:更新model-index元数据以集成到排行榜
- Artificial Analysis:直接API集成以导入基准测试数据
- Papers with Code:兼容其model-index规范
- Jobs:通过集成直接在Hugging Face Jobs上运行评估
uv - vLLM:用于自定义模型评估的高效GPU推理
- lighteval:HuggingFace的评估库,支持vLLM/accelerate后端
- inspect-ai:英国AI安全研究所的评估框架
Version
版本
1.3.0
1.3.0
Dependencies
依赖项
Core Dependencies
核心依赖项
- huggingface_hub>=0.26.0
- markdown-it-py>=3.0.0
- python-dotenv>=1.2.1
- pyyaml>=6.0.3
- requests>=2.32.5
- re (built-in)
- huggingface_hub>=0.26.0
- markdown-it-py>=3.0.0
- python-dotenv>=1.2.1
- pyyaml>=6.0.3
- requests>=2.32.5
- re(内置)
Inference Provider Evaluation
推理提供商评估
- inspect-ai>=0.3.0
- inspect-evals
- openai
- inspect-ai>=0.3.0
- inspect-evals
- openai
vLLM Custom Model Evaluation (GPU required)
vLLM自定义模型评估(需要GPU)
- lighteval[accelerate,vllm]>=0.6.0
- vllm>=0.4.0
- torch>=2.0.0
- transformers>=4.40.0
- accelerate>=0.30.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using .
uv run- lighteval[accelerate,vllm]>=0.6.0
- vllm>=0.4.0
- torch>=2.0.0
- transformers>=4.40.0
- accelerate>=0.30.0
注意:使用时,vLLM依赖项会通过PEP 723脚本标头自动安装。
uv runIMPORTANT: Using This Skill
重要提示:使用该Skill
⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
⚠️ 关键:创建新PR前请检查现有PR
Before creating ANY pull request with , you MUST check for existing open PRs:
--create-prbash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"If open PRs exist:
- DO NOT create a new PR - this creates duplicate work for maintainers
- Warn the user that open PRs already exist
- Show the user the existing PR URLs so they can review them
- Only proceed if the user explicitly confirms they want to create another PR
This prevents spamming model repositories with duplicate evaluation PRs.
All paths are relative to the directory containing this SKILL.md file. Before running any script, firstto that directory or use the full path.cd
Use for the latest workflow guidance. Works with plain Python or :
--helpuv runbash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --helpKey workflow (matches CLI help):
- → check for existing open PRs first
get-prs - → find table numbers/columns
inspect-tables - → prints YAML by default
extract-readme --table N - add (push) or
--applyto write changes--create-pr
在使用创建任何拉取请求之前,您必须检查是否存在已打开的PR:
--create-prbash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"如果存在已打开的PR:
- 请勿创建新PR - 这会为维护者带来重复工作
- 警告用户 已存在打开的PR
- 向用户展示 现有PR的URL以便他们查看
- 仅当用户明确确认要创建另一个PR时才继续
这可以避免向模型仓库发送重复的评估PR。
所有路径均相对于包含此SKILL.md文件的目录。 在运行任何脚本之前,请先到该目录或使用完整路径。cd
使用获取最新的工作流指导。 可配合纯Python或使用:
--helpuv runbash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help关键工作流(与CLI帮助一致):
- → 首先检查现有打开的PR
get-prs - → 查找表格编号/列
inspect-tables - → 默认输出YAML格式
extract-readme --table N - 添加(推送)或
--apply以写入更改--create-pr
Core Capabilities
核心功能
1. Inspect and Extract Evaluation Tables from README
1. 检查并从README中提取评估表格
- Inspect Tables: Use to see all tables in a README with structure, columns, and sample rows
inspect-tables - Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
- Table Selection: Use to extract from a specific table (required when multiple tables exist)
--table N - Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
- Column Matching: Automatically identify model columns/rows; prefer (index from inspect output). Use
--model-column-indexonly with exact column header text.--model-name-override - YAML Generation: Convert selected table to model-index YAML format
- Task Typing: sets the
--task-typefield in model-index output (e.g.,task.type,text-generation)summarization
- 检查表格:使用查看README中的所有表格,包括结构、列和示例行
inspect-tables - 解析Markdown表格:使用markdown-it-py进行准确解析(忽略代码块和示例)
- 表格选择:使用从特定表格中提取(当存在多个表格时为必填项)
--table N - 格式检测:识别常见格式(基准测试作为行、列,或包含多个模型的对比表格)
- 列匹配:自动识别模型列/行;优先使用(来自inspect输出的索引)。仅当列头文本完全匹配时才使用
--model-column-index。--model-name-override - YAML生成:将选定表格转换为model-index YAML格式
- 任务类型:设置model-index输出中的
--task-type字段(例如task.type、text-generation)summarization
2. Import from Artificial Analysis
2. 从Artificial Analysis导入
- API Integration: Fetch benchmark scores directly from Artificial Analysis
- Automatic Formatting: Convert API responses to model-index format
- Metadata Preservation: Maintain source attribution and URLs
- PR Creation: Automatically create pull requests with evaluation updates
- API集成:直接从Artificial Analysis获取基准测试分数
- 自动格式化:将API响应转换为model-index格式
- 元数据保留:维护来源归属和URL
- PR创建:自动创建包含评估更新的拉取请求
3. Model-Index Management
3. Model-Index管理
- YAML Generation: Create properly formatted model-index entries
- Merge Support: Add evaluations to existing model cards without overwriting
- Validation: Ensure compliance with Papers with Code specification
- Batch Operations: Process multiple models efficiently
- YAML生成:创建格式正确的model-index条目
- 合并支持:在不覆盖现有内容的情况下向模型卡片添加评估结果
- 验证:确保符合Papers with Code规范
- 批量操作:高效处理多个模型
4. Run Evaluations on HF Jobs (Inference Providers)
4. 在HF Jobs上运行评估(推理提供商)
- Inspect-AI Integration: Run standard evaluations using the library
inspect-ai - UV Integration: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
- Zero-Config: No Dockerfiles or Space management required
- Hardware Selection: Configure CPU or GPU hardware for the evaluation job
- Secure Execution: Handles API tokens safely via secrets passed through the CLI
- Inspect-AI集成:使用库运行标准评估
inspect-ai - UV集成:在HF基础设施上无缝运行带有临时依赖项的Python脚本
- 零配置:无需Dockerfile或Space管理
- 硬件选择:为评估作业配置CPU或GPU硬件
- 安全执行:通过CLI传递的密钥安全处理API令牌
5. Run Custom Model Evaluations with vLLM (NEW)
5. 使用vLLM运行自定义模型评估(新增)
⚠️ Important: This approach is only possible on devices with installed and sufficient GPU memory.
Benefits: No need to use MCP tool, can run scripts directly in terminal
When to use: User working in local device directly when GPU is available
uvhf_jobs()⚠️ 重要提示:此方法仅在安装了且GPU内存充足的设备上可行。
优势:无需使用 MCP工具,可直接在终端运行脚本
适用场景:用户在本地设备且GPU可用时直接操作
uvhf_jobs()Before running the script
运行脚本前
- check the script path
- check uv is installed
- check gpu is available with
nvidia-smi
- 检查脚本路径
- 检查是否已安装uv
- 使用检查GPU是否可用
nvidia-smi
Running the script
运行脚本
bash
uv run scripts/train_sft_example.pybash
uv run scripts/train_sft_example.pyFeatures
特性
- vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods)
- lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks
- inspect-ai Framework: UK AI Safety Institute's evaluation library
- Standalone or Jobs: Run locally or submit to HF Jobs infrastructure
- vLLM后端:高性能GPU推理(比标准HF方法快5-10倍)
- lighteval框架:HuggingFace的评估库,支持Open LLM Leaderboard任务
- inspect-ai框架:英国AI安全研究所的评估库
- 独立运行或通过Jobs:本地运行或提交到HF Jobs基础设施
Usage Instructions
使用说明
The skill includes Python scripts in to perform operations.
scripts/该Skill在目录中包含Python脚本以执行操作。
scripts/Prerequisites
前提条件
- Preferred: use (PEP 723 header auto-installs deps)
uv run - Or install manually:
pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests - Set environment variable with Write-access token
HF_TOKEN - For Artificial Analysis: Set environment variable
AA_API_KEY - is loaded automatically if
.envis installedpython-dotenv
- 推荐:使用(PEP 723标头自动安装依赖项)
uv run - 或手动安装:
pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests - 设置环境变量,值为具有写入权限的令牌
HF_TOKEN - 对于Artificial Analysis:设置环境变量
AA_API_KEY - 如果安装了,会自动加载
python-dotenv文件.env
Method 1: Extract from README (CLI workflow)
方法1:从README提取(CLI工作流)
Recommended flow (matches ):
--helpbash
undefined推荐流程(与一致):
--helpbash
undefined1) Inspect tables to get table numbers and column hints
1) 检查表格以获取表格编号和列提示
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
2) Extract a specific table (prints YAML by default)
2) 提取特定表格(默认输出YAML格式)
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
[--model-column-index <column index shown by inspect-tables>]
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
--repo-id "username/model"
--table 1
[--model-column-index <column index shown by inspect-tables>]
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
[--model-column-index <inspect-tables输出中的列索引>]
[--model-name-override "<列头/模型名称>"] # 如果无法使用索引,请使用精确的列头文本
--repo-id "username/model"
--table 1
[--model-column-index <inspect-tables输出中的列索引>]
[--model-name-override "<列头/模型名称>"] # 如果无法使用索引,请使用精确的列头文本
3) Apply changes (push or PR)
3) 应用更改(推送或PR)
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--apply # push directly
--repo-id "username/model"
--table 1
--apply # push directly
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--apply # 直接推送
--repo-id "username/model"
--table 1
--apply # 直接推送
or
或
uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--create-pr # open a PR
--repo-id "username/model"
--table 1
--create-pr # open a PR
Validation checklist:
- YAML is printed by default; compare against the README table before applying.
- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
- For transposed tables (models as rows), ensure only one row is extracted.uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--create-pr # 打开PR
--repo-id "username/model"
--table 1
--create-pr # 打开PR
验证清单:
- 默认输出YAML;应用前请与README表格进行对比。
- 优先使用`--model-column-index`;如果使用`--model-name-override`,列头文本必须完全匹配。
- 对于转置表格(模型作为行),确保仅提取一行。Method 2: Import from Artificial Analysis
方法2:从Artificial Analysis导入
Fetch benchmark scores from Artificial Analysis API and add them to a model card.
Basic Usage:
bash
AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"With Environment File:
bash
undefined从Artificial Analysis API获取基准测试分数并添加到模型卡片中。
基本用法:
bash
AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"使用环境文件:
bash
undefinedCreate .env file
创建.env文件
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
Run import
运行导入
uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"
**Create Pull Request:**
```bash
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pruv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"
**创建拉取请求:**
```bash
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-prMethod 3: Run Evaluation Job
方法3:运行评估作业
Submit an evaluation job on Hugging Face infrastructure using the CLI.
hf jobs uv runDirect CLI Usage:
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"GPU Example (A10G):
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"Python Helper (optional):
bash
uv run scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"使用 CLI在Hugging Face基础设施上提交评估作业。
hf jobs uv run直接CLI用法:
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"GPU示例(A10G):
bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"Python辅助脚本(可选):
bash
uv run scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"Method 4: Run Custom Model Evaluation with vLLM
方法4:使用vLLM运行自定义模型评估
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.
使用vLLM或accelerate后端直接在GPU上评估自定义HuggingFace模型。这些脚本独立于推理提供商脚本,并在作业的硬件上本地运行模型。
When to Use vLLM Evaluation (vs Inference Providers)
vLLM评估与推理提供商的对比
| Feature | vLLM Scripts | Inference Provider Scripts |
|---|---|---|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
| 特性 | vLLM脚本 | 推理提供商脚本 |
|---|---|---|
| 模型访问权限 | 任何HF模型 | 具有API端点的模型 |
| 硬件 | 您的GPU(或HF Jobs GPU) | 提供商的基础设施 |
| 成本 | HF Jobs计算成本 | API使用费用 |
| 速度 | vLLM优化 | 取决于提供商 |
| 离线运行 | 是(下载后) | 否 |
Option A: lighteval with vLLM Backend
选项A:使用vLLM后端的lighteval
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
Standalone (local GPU):
bash
undefinedlighteval是HuggingFace的评估库,支持Open LLM Leaderboard任务。
独立运行(本地GPU):
bash
undefinedRun MMLU 5-shot with vLLM
使用vLLM运行MMLU 5-shot
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
Run multiple tasks
运行多个任务
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
Use accelerate backend instead of vLLM
使用accelerate后端替代vLLM
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate
Chat/instruction-tuned models
对话/指令微调模型
uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template
**Via HF Jobs:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"lighteval Task Format:
Tasks use the format :
suite|task|num_fewshot- - MMLU with 5-shot
leaderboard|mmlu|5 - - GSM8K with 5-shot
leaderboard|gsm8k|5 - - HellaSwag zero-shot
lighteval|hellaswag|0 - - ARC-Challenge with 25-shot
leaderboard|arc_challenge|25
Finding Available Tasks:
The complete list of available lighteval tasks can be found at:
https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
This file contains all supported tasks in the format (the trailing is a version flag and can be ignored). Common suites include:
suite|task|num_fewshot|00- - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
leaderboard - - Additional lighteval tasks
lighteval - - BigBench tasks
bigbench - - Original benchmark tasks
original
To use a task from the list, extract the portion (without the trailing ) and pass it to the parameter. For example:
suite|task|num_fewshot0--tasks- From file: → Use:
leaderboard|mmlu|0(or change toleaderboard|mmlu|0for 5-shot)5 - From file: → Use:
bigbench|abstract_narrative_understanding|0bigbench|abstract_narrative_understanding|0 - From file: → Use:
lighteval|wmt14:hi-en|0lighteval|wmt14:hi-en|0
Multiple tasks can be specified as comma-separated values:
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template
**通过HF Jobs运行:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"lighteval任务格式:
任务使用格式:
suite|task|num_fewshot- - 5-shot的MMLU
leaderboard|mmlu|5 - - 5-shot的GSM8K
leaderboard|gsm8k|5 - - zero-shot的HellaSwag
lighteval|hellaswag|0 - - 25-shot的ARC-Challenge
leaderboard|arc_challenge|25
查找可用任务:
完整的lighteval可用任务列表可在以下地址找到:
https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
该文件包含所有支持的任务,格式为(末尾的是版本标志,可忽略)。常见套件包括:
suite|task|num_fewshot|00- - Open LLM Leaderboard任务(MMLU、GSM8K、ARC、HellaSwag等)
leaderboard - - 额外的lighteval任务
lighteval - - BigBench任务
bigbench - - 原始基准测试任务
original
要使用列表中的任务,请提取部分(忽略末尾的)并传递给参数。例如:
suite|task|num_fewshot0--tasks- 从文件:→ 使用:
leaderboard|mmlu|0(或改为leaderboard|mmlu|0以使用5-shot)5 - 从文件:→ 使用:
bigbench|abstract_narrative_understanding|0bigbench|abstract_narrative_understanding|0 - 从文件:→ 使用:
lighteval|wmt14:hi-en|0lighteval|wmt14:hi-en|0
多个任务可通过逗号分隔指定:
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"Option B: inspect-ai with vLLM Backend
选项B:使用vLLM后端的inspect-ai
inspect-ai is the UK AI Safety Institute's evaluation framework.
Standalone (local GPU):
bash
undefinedinspect-ai是英国AI安全研究所的评估框架。
独立运行(本地GPU):
bash
undefinedRun MMLU with vLLM
使用vLLM运行MMLU
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--model meta-llama/Llama-3.2-1B
--task mmlu
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--model meta-llama/Llama-3.2-1B
--task mmlu
Use HuggingFace Transformers backend
使用HuggingFace Transformers后端
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf
Multi-GPU with tensor parallelism
多GPU张量并行
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4
**Via HF Jobs:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmluAvailable inspect-ai Tasks:
- - Massive Multitask Language Understanding
mmlu - - Grade School Math
gsm8k - - Common sense reasoning
hellaswag - - AI2 Reasoning Challenge
arc_challenge - - TruthfulQA benchmark
truthfulqa - - Winograd Schema Challenge
winogrande - - Code generation
humaneval
uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4
**通过HF Jobs运行:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlu可用的inspect-ai任务:
- - 大规模多任务语言理解
mmlu - - 小学数学问题
gsm8k - - 常识推理
hellaswag - - AI2推理挑战
arc_challenge - - TruthfulQA基准测试
truthfulqa - - Winograd模式挑战
winogrande - - 代码生成
humaneval
Option C: Python Helper Script
选项C:Python辅助脚本
The helper script auto-selects hardware and simplifies job submission:
bash
undefined辅助脚本会自动选择硬件并简化作业提交:
bash
undefinedAuto-detect hardware based on model size
根据模型大小自动检测硬件
uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval
uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval
Explicit hardware selection
显式选择硬件
uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4
uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4
Use HF Transformers backend
使用HF Transformers后端
uv run scripts/run_vllm_eval_job.py
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf
**Hardware Recommendations:**
| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B params | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |uv run scripts/run_vllm_eval_job.py
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf
**硬件推荐:**
| 模型大小 | 推荐硬件 |
|------------|---------------------|
| < 3B参数 | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |Commands Reference
命令参考
Top-level help and version:
bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --versionInspect Tables (start here):
bash
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"Extract from README:
bash
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]Import from Artificial Analysis:
bash
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]View / Validate:
bash
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"Check Open PRs (ALWAYS run before --create-pr):
bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
Run Evaluation Job (Inference Providers):
bash
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"or use the Python helper:
bash
uv run scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."Run vLLM Evaluation (Custom Models):
bash
undefined顶级帮助和版本:
bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version检查表格(从此处开始):
bash
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"从README提取:
bash
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]从Artificial Analysis导入:
bash
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]查看/验证:
bash
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"检查打开的PR(创建--create-pr前务必运行):
bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"列出模型仓库的所有打开的拉取请求。显示PR编号、标题、作者、日期和URL。
运行评估作业(推理提供商):
bash
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"或使用Python辅助脚本:
bash
uv run scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."运行vLLM评估(自定义模型):
bash
undefinedlighteval with vLLM
使用vLLM的lighteval
hf jobs uv run scripts/lighteval_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"
hf jobs uv run scripts/lighteval_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"
inspect-ai with vLLM
使用vLLM的inspect-ai
hf jobs uv run scripts/inspect_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"
hf jobs uv run scripts/inspect_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"
Helper script (auto hardware selection)
辅助脚本(自动硬件选择)
uv run scripts/run_vllm_eval_job.py
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval
undefineduv run scripts/run_vllm_eval_job.py
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval
undefinedModel-Index Format
Model-Index格式
The generated model-index follows this structure:
yaml
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.comWARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
生成的model-index遵循以下结构:
yaml
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.com警告:请勿在模型名称中使用Markdown格式。使用表格中的精确名称。仅在source.url字段中使用URL。
Error Handling
错误处理
- Table Not Found: Script will report if no evaluation tables are detected
- Invalid Format: Clear error messages for malformed tables
- API Errors: Retry logic for transient Artificial Analysis API failures
- Token Issues: Validation before attempting updates
- Merge Conflicts: Preserves existing model-index entries when adding new ones
- Space Creation: Handles naming conflicts and hardware request failures gracefully
- 未找到表格:脚本会报告未检测到评估表格
- 格式无效:针对格式错误的表格显示清晰的错误消息
- API错误:针对Artificial Analysis API的临时故障提供重试逻辑
- 令牌问题:在尝试更新前验证令牌权限
- 合并冲突:添加新评估结果时保留现有model-index条目
- Space创建:优雅处理命名冲突和硬件请求失败
Best Practices
最佳实践
- Check for existing PRs first: Run before creating any new PR to avoid duplicates
get-prs - Always start with : See table structure and get the correct extraction command
inspect-tables - Use for guidance: Run
--helpto see the complete workflowinspect-tables --help - Preview first: Default behavior prints YAML; review it before using or
--apply--create-pr - Verify extracted values: Compare YAML output against the README table manually
- Use for multi-table READMEs: Required when multiple evaluation tables exist
--table N - Use for comparison tables: Copy the exact column header from
--model-name-overrideoutputinspect-tables - Create PRs for Others: Use when updating models you don't own
--create-pr - One model per repo: Only add the main model's results to model-index
- No markdown in YAML names: The model name field in YAML should be plain text
- 先检查现有PR:创建任何新PR前运行以避免重复
get-prs - 始终从开始:查看表格结构并获取正确的提取命令
inspect-tables - 使用获取指导:运行
--help查看完整工作流inspect-tables --help - 先预览:默认行为是输出YAML;使用或
--apply前先查看--create-pr - 验证提取的值:手动对比YAML输出与README表格
- 多表格README使用:存在多个评估表格时为必填项
--table N - 对比表格使用:从
--model-name-override输出中复制精确的列头inspect-tables - 为他人创建PR:更新非自己拥有的模型时使用
--create-pr - 每个仓库一个模型:仅将主模型的结果添加到model-index
- YAML名称中无Markdown:YAML中的模型名字段应为纯文本
Model Name Matching
模型名称匹配
When extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching:
- Removes markdown formatting (bold , links
**)[]() - Normalizes names (lowercase, replace and
-with spaces)_ - Compares token sets: →
"OLMo-3-32B"matches{"olmo", "3", "32b"}or"**Olmo 3 32B**""[Olmo-3-32B](...) - Only extracts if tokens match exactly (handles different word orders and separators)
- Fails if no exact match found (rather than guessing from similar names)
For column-based tables (benchmarks as rows, models as columns):
- Finds the column header matching the model name
- Extracts scores from that column only
For transposed tables (models as rows, benchmarks as columns):
- Finds the row in the first column matching the model name
- Extracts all benchmark scores from that row only
This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
当提取包含多个模型的评估表格(作为列或行)时,脚本使用精确的规范化令牌匹配:
- 移除Markdown格式(粗体、链接
**)[]() - 规范化名称(小写,将和
-替换为空格)_ - 对比令牌集:→
"OLMo-3-32B"匹配{"olmo", "3", "32b"}或"**Olmo 3 32B**""[Olmo-3-32B](...) - 仅当令牌完全匹配时才提取(处理不同的词序和分隔符)
- 如果未找到精确匹配则失败(而非从相似名称猜测)
基于列的表格(基准测试作为行,模型作为列):
- 查找与模型名称匹配的列头
- 仅提取该列的分数
转置表格(模型作为行,基准测试作为列):
- 查找第一列中与模型名称匹配的行
- 仅提取该行的所有基准测试分数
这确保仅提取正确模型的分数,绝不会提取无关模型或训练检查点的分数。
Common Patterns
常见模式
Update Your Own Model:
bash
undefined更新自己的模型:
bash
undefinedExtract from README and push directly
从README提取并直接推送
uv run scripts/evaluation_manager.py extract-readme
--repo-id "your-username/your-model"
--task-type "text-generation"
--repo-id "your-username/your-model"
--task-type "text-generation"
**Update Someone Else's Model (Full Workflow):**
```bashuv run scripts/evaluation_manager.py extract-readme
--repo-id "your-username/your-model"
--task-type "text-generation"
--repo-id "your-username/your-model"
--task-type "text-generation"
**更新他人的模型(完整工作流):**
```bashStep 1: ALWAYS check for existing PRs first
步骤1:务必先检查现有PR
uv run scripts/evaluation_manager.py get-prs
--repo-id "other-username/their-model"
--repo-id "other-username/their-model"
uv run scripts/evaluation_manager.py get-prs
--repo-id "other-username/their-model"
--repo-id "other-username/their-model"
Step 2: If NO open PRs exist, proceed with creating one
步骤2:如果没有打开的PR,继续创建
uv run scripts/evaluation_manager.py extract-readme
--repo-id "other-username/their-model"
--create-pr
--repo-id "other-username/their-model"
--create-pr
uv run scripts/evaluation_manager.py extract-readme
--repo-id "other-username/their-model"
--create-pr
--repo-id "other-username/their-model"
--create-pr
If open PRs DO exist:
如果存在打开的PR:
- Warn the user about existing PRs
- 警告用户存在现有PR
- Show them the PR URLs
- 向他们展示PR URL
- Do NOT create a new PR unless user explicitly confirms
- 除非用户明确确认,否则请勿创建新PR
**Import Fresh Benchmarks:**
```bash
**导入新基准测试:**
```bashStep 1: Check for existing PRs
步骤1:检查现有PR
uv run scripts/evaluation_manager.py get-prs
--repo-id "anthropic/claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
uv run scripts/evaluation_manager.py get-prs
--repo-id "anthropic/claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
Step 2: If no PRs, import from Artificial Analysis
步骤2:如果没有PR,从Artificial Analysis导入
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr
undefinedAA_API_KEY=... uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr
undefinedTroubleshooting
故障排除
Issue: "No evaluation tables found in README"
- Solution: Check if README contains markdown tables with numeric scores
Issue: "Could not find model 'X' in transposed table"
- Solution: The script will display available models. Use with the exact name from the list
--model-name-override - Example:
--model-name-override "**Olmo 3-32B**"
Issue: "AA_API_KEY not set"
- Solution: Set environment variable or add to .env file
Issue: "Token does not have write access"
- Solution: Ensure HF_TOKEN has write permissions for the repository
Issue: "Model not found in Artificial Analysis"
- Solution: Verify creator-slug and model-name match API values
Issue: "Payment required for hardware"
- Solution: Add a payment method to your Hugging Face account to use non-CPU hardware
Issue: "vLLM out of memory" or CUDA OOM
- Solution: Use a larger hardware flavor, reduce , or use
--gpu-memory-utilizationfor multi-GPU--tensor-parallel-size
Issue: "Model architecture not supported by vLLM"
- Solution: Use (inspect-ai) or
--backend hf(lighteval) for HuggingFace Transformers--backend accelerate
Issue: "Trust remote code required"
- Solution: Add flag for models with custom code (e.g., Phi-2, Qwen)
--trust-remote-code
Issue: "Chat template not found"
- Solution: Only use for instruction-tuned models that include a chat template
--use-chat-template
问题:"README中未找到评估表格"
- 解决方案:检查README是否包含带有数字分数的Markdown表格
问题:"转置表格中未找到模型'X'"
- 解决方案:脚本会显示可用模型。使用并传入列表中的精确名称
--model-name-override - 示例:
--model-name-override "**Olmo 3-32B**"
问题:"AA_API_KEY未设置"
- 解决方案:设置环境变量或添加到.env文件
问题:"令牌无写入权限"
- 解决方案:确保HF_TOKEN对仓库具有写入权限
问题:"Artificial Analysis中未找到模型"
- 解决方案:验证creator-slug和model-name是否与API值匹配
问题:"硬件需要付费"
- 解决方案:在Hugging Face账户中添加付款方式以使用非CPU硬件
问题:"vLLM内存不足"或CUDA OOM
- 解决方案:使用更大的硬件规格,降低,或为多GPU使用
--gpu-memory-utilization--tensor-parallel-size
问题:"vLLM不支持该模型架构"
- 解决方案:使用(inspect-ai)或
--backend hf(lighteval)以使用HuggingFace Transformers--backend accelerate
问题:"需要信任远程代码"
- 解决方案:对于包含自定义代码的模型(如Phi-2、Qwen),添加标志
--trust-remote-code
问题:"未找到聊天模板"
- 解决方案:仅对包含聊天模板的指令微调模型使用
--use-chat-template
Integration Examples
集成示例
Python Script Integration:
python
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")Python脚本集成:
python
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")