hugging-face-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Overview

概述

This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:

Extracting existing evaluation tables from README content
Importing benchmark scores from Artificial Analysis
Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

该Skill提供了向Hugging Face模型卡片添加结构化评估结果的工具。它支持多种添加评估数据的方式：

从README内容中提取现有评估表格
从Artificial Analysis导入基准测试分数
使用vLLM或accelerate后端（lighteval/inspect-ai）运行自定义模型评估

Integration with HF Ecosystem

与HF生态系统的集成

Model Cards: Updates model-index metadata for leaderboard integration
Artificial Analysis: Direct API integration for benchmark imports
Papers with Code: Compatible with their model-index specification
Jobs: Run evaluations directly on Hugging Face Jobs with
```
uv
```
integration
vLLM: Efficient GPU inference for custom model evaluation
lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
inspect-ai: UK AI Safety Institute's evaluation framework

模型卡片：更新model-index元数据以集成到排行榜
Artificial Analysis：直接API集成以导入基准测试数据
Papers with Code：兼容其model-index规范
Jobs：通过
```
uv
```
集成直接在Hugging Face Jobs上运行评估
vLLM：用于自定义模型评估的高效GPU推理
lighteval：HuggingFace的评估库，支持vLLM/accelerate后端
inspect-ai：英国AI安全研究所的评估框架

Version

版本

1.3.0

Dependencies

依赖项

Core Dependencies

核心依赖项

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re (built-in)

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re（内置）

Inference Provider Evaluation

推理提供商评估

inspect-ai>=0.3.0
inspect-evals
openai

inspect-ai>=0.3.0
inspect-evals
openai

vLLM Custom Model Evaluation (GPU required)

vLLM自定义模型评估（需要GPU）

lighteval[accelerate,vllm]>=0.6.0
vllm>=0.4.0
torch>=2.0.0
transformers>=4.40.0
accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using

uv run

lighteval[accelerate,vllm]>=0.6.0
vllm>=0.4.0
torch>=2.0.0
transformers>=4.40.0
accelerate>=0.30.0

注意：使用

uv run

时，vLLM依赖项会通过PEP 723脚本标头自动安装。

IMPORTANT: Using This Skill

重要提示：使用该Skill

⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

⚠️ 关键：创建新PR前请检查现有PR

Before creating ANY pull request with
--create-pr
, you MUST check for existing open PRs:

bash

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

DO NOT create a new PR - this creates duplicate work for maintainers
Warn the user that open PRs already exist
Show the user the existing PR URLs so they can review them
Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.

All paths are relative to the directory containing this SKILL.md file. Before running any script, first
cd
to that directory or use the full path.

Use
--help
for the latest workflow guidance. Works with plain Python or

uv run

bash

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

Key workflow (matches CLI help):

```
get-prs
```
→ check for existing open PRs first
```
inspect-tables
```
→ find table numbers/columns
```
extract-readme --table N
```
→ prints YAML by default
add
```
--apply
```
(push) or
```
--create-pr
```
to write changes

在使用
--create-pr
创建任何拉取请求之前，您必须检查是否存在已打开的PR：

bash

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

如果存在已打开的PR：

请勿创建新PR - 这会为维护者带来重复工作
警告用户 已存在打开的PR
向用户展示 现有PR的URL以便他们查看
仅当用户明确确认要创建另一个PR时才继续

这可以避免向模型仓库发送重复的评估PR。

所有路径均相对于包含此SKILL.md文件的目录。 在运行任何脚本之前，请先
cd
到该目录或使用完整路径。

使用
--help
获取最新的工作流指导。可配合纯Python或

uv run

使用：

bash

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

关键工作流（与CLI帮助一致）：

```
get-prs
```
→ 首先检查现有打开的PR
```
inspect-tables
```
→ 查找表格编号/列
```
extract-readme --table N
```
→ 默认输出YAML格式
添加
```
--apply
```
（推送）或
```
--create-pr
```
以写入更改

Core Capabilities

核心功能

1. Inspect and Extract Evaluation Tables from README

1. 检查并从README中提取评估表格

Inspect Tables: Use
```
inspect-tables
```
to see all tables in a README with structure, columns, and sample rows
Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
Table Selection: Use
```
--table N
```
to extract from a specific table (required when multiple tables exist)
Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
Column Matching: Automatically identify model columns/rows; prefer
```
--model-column-index
```
(index from inspect output). Use
```
--model-name-override
```
only with exact column header text.
YAML Generation: Convert selected table to model-index YAML format
Task Typing:
```
--task-type
```
sets the
```
task.type
```
field in model-index output (e.g.,
```
text-generation
```
,
```
summarization
```
)

检查表格：使用
```
inspect-tables
```
查看README中的所有表格，包括结构、列和示例行
解析Markdown表格：使用markdown-it-py进行准确解析（忽略代码块和示例）
表格选择：使用
```
--table N
```
从特定表格中提取（当存在多个表格时为必填项）
格式检测：识别常见格式（基准测试作为行、列，或包含多个模型的对比表格）
列匹配：自动识别模型列/行；优先使用
```
--model-column-index
```
（来自inspect输出的索引）。仅当列头文本完全匹配时才使用
```
--model-name-override
```
。
YAML生成：将选定表格转换为model-index YAML格式
任务类型：
```
--task-type
```
设置model-index输出中的
```
task.type
```
字段（例如
```
text-generation
```
、
```
summarization
```
）

2. Import from Artificial Analysis

2. 从Artificial Analysis导入

API Integration: Fetch benchmark scores directly from Artificial Analysis
Automatic Formatting: Convert API responses to model-index format
Metadata Preservation: Maintain source attribution and URLs
PR Creation: Automatically create pull requests with evaluation updates

API集成：直接从Artificial Analysis获取基准测试分数
自动格式化：将API响应转换为model-index格式
元数据保留：维护来源归属和URL
PR创建：自动创建包含评估更新的拉取请求

3. Model-Index Management

3. Model-Index管理

YAML Generation: Create properly formatted model-index entries
Merge Support: Add evaluations to existing model cards without overwriting
Validation: Ensure compliance with Papers with Code specification
Batch Operations: Process multiple models efficiently

YAML生成：创建格式正确的model-index条目
合并支持：在不覆盖现有内容的情况下向模型卡片添加评估结果
验证：确保符合Papers with Code规范
批量操作：高效处理多个模型

4. Run Evaluations on HF Jobs (Inference Providers)

4. 在HF Jobs上运行评估（推理提供商）

Inspect-AI Integration: Run standard evaluations using the
```
inspect-ai
```
library
UV Integration: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
Zero-Config: No Dockerfiles or Space management required
Hardware Selection: Configure CPU or GPU hardware for the evaluation job
Secure Execution: Handles API tokens safely via secrets passed through the CLI

Inspect-AI集成：使用
```
inspect-ai
```
库运行标准评估
UV集成：在HF基础设施上无缝运行带有临时依赖项的Python脚本
零配置：无需Dockerfile或Space管理
硬件选择：为评估作业配置CPU或GPU硬件
安全执行：通过CLI传递的密钥安全处理API令牌

5. Run Custom Model Evaluations with vLLM (NEW)

5. 使用vLLM运行自定义模型评估（新增）

⚠️ Important: This approach is only possible on devices with

uv

installed and sufficient GPU memory. Benefits: No need to use

hf_jobs()

MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available

⚠️ 重要提示：此方法仅在安装了

uv

且GPU内存充足的设备上可行。优势：无需使用

hf_jobs()

MCP工具，可直接在终端运行脚本 适用场景：用户在本地设备且GPU可用时直接操作

Before running the script

运行脚本前

check the script path
check uv is installed
check gpu is available with
```
nvidia-smi
```

检查脚本路径
检查是否已安装uv
使用
```
nvidia-smi
```
检查GPU是否可用

Running the script

运行脚本

bash

uv run scripts/train_sft_example.py

bash

uv run scripts/train_sft_example.py

Features

特性

vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods)
lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks
inspect-ai Framework: UK AI Safety Institute's evaluation library
Standalone or Jobs: Run locally or submit to HF Jobs infrastructure

vLLM后端：高性能GPU推理（比标准HF方法快5-10倍）
lighteval框架：HuggingFace的评估库，支持Open LLM Leaderboard任务
inspect-ai框架：英国AI安全研究所的评估库
独立运行或通过Jobs：本地运行或提交到HF Jobs基础设施

Usage Instructions

使用说明

The skill includes Python scripts in

scripts/

to perform operations.

该Skill在

scripts/

目录中包含Python脚本以执行操作。

Prerequisites

前提条件

Preferred: use
```
uv run
```
(PEP 723 header auto-installs deps)

Or install manually:

pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests

Set
```
HF_TOKEN
```
environment variable with Write-access token
For Artificial Analysis: Set
```
AA_API_KEY
```
environment variable
```
.env
```
is loaded automatically if
```
python-dotenv
```
is installed

推荐：使用
```
uv run
```
（PEP 723标头自动安装依赖项）

或手动安装：

pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests

设置
```
HF_TOKEN
```
环境变量，值为具有写入权限的令牌
对于Artificial Analysis：设置
```
AA_API_KEY
```
环境变量
如果安装了
```
python-dotenv
```
，会自动加载
```
.env
```
文件

Method 1: Extract from README (CLI workflow)

方法1：从README提取（CLI工作流）

Recommended flow (matches

--help

bash

undefined

推荐流程（与

--help

一致）：

bash

undefined

1) Inspect tables to get table numbers and column hints

1) 检查表格以获取表格编号和列提示

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

2) Extract a specific table (prints YAML by default)

2) 提取特定表格（默认输出YAML格式）

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
[--model-column-index <column index shown by inspect-tables>]
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
[--model-column-index <inspect-tables输出中的列索引>]
[--model-name-override "<列头/模型名称>"] # 如果无法使用索引，请使用精确的列头文本

3) Apply changes (push or PR)

3) 应用更改（推送或PR）

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--apply # push directly

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--apply # 直接推送

or

或

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--create-pr # open a PR


Validation checklist:
- YAML is printed by default; compare against the README table before applying.
- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
- For transposed tables (models as rows), ensure only one row is extracted.

uv run scripts/evaluation_manager.py extract-readme
--repo-id "username/model"
--table 1
--create-pr # 打开PR


验证清单：
- 默认输出YAML；应用前请与README表格进行对比。
- 优先使用`--model-column-index`；如果使用`--model-name-override`，列头文本必须完全匹配。
- 对于转置表格（模型作为行），确保仅提取一行。

Method 2: Import from Artificial Analysis

方法2：从Artificial Analysis导入

Fetch benchmark scores from Artificial Analysis API and add them to a model card.

Basic Usage:

bash

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

With Environment File:

bash

undefined

从Artificial Analysis API获取基准测试分数并添加到模型卡片中。

基本用法：

bash

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

使用环境文件：

bash

undefined

Create .env file

创建.env文件

echo "AA_API_KEY=your-api-key" >> .env echo "HF_TOKEN=your-hf-token" >> .env

Run import

运行导入

uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"


**Create Pull Request:**
```bash
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "username/model-name"


**创建拉取请求：**
```bash
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

Method 3: Run Evaluation Job

方法3：运行评估作业

Submit an evaluation job on Hugging Face infrastructure using the

hf jobs uv run

CLI.

Direct CLI Usage:

bash

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

GPU Example (A10G):

bash

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

Python Helper (optional):

bash

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

使用

hf jobs uv run

CLI在Hugging Face基础设施上提交评估作业。

直接CLI用法：

bash

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

GPU示例（A10G）：

bash

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

Python辅助脚本（可选）：

bash

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

Method 4: Run Custom Model Evaluation with vLLM

方法4：使用vLLM运行自定义模型评估

Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.

使用vLLM或accelerate后端直接在GPU上评估自定义HuggingFace模型。这些脚本独立于推理提供商脚本，并在作业的硬件上本地运行模型。

When to Use vLLM Evaluation (vs Inference Providers)

vLLM评估与推理提供商的对比

Feature	vLLM Scripts	Inference Provider Scripts
Model access	Any HF model	Models with API endpoints
Hardware	Your GPU (or HF Jobs GPU)	Provider's infrastructure
Cost	HF Jobs compute cost	API usage fees
Speed	vLLM optimized	Depends on provider
Offline	Yes (after download)	No

特性	vLLM脚本	推理提供商脚本
模型访问权限	任何HF模型	具有API端点的模型
硬件	您的GPU（或HF Jobs GPU）	提供商的基础设施
成本	HF Jobs计算成本	API使用费用
速度	vLLM优化	取决于提供商
离线运行	是（下载后）	否

Option A: lighteval with vLLM Backend

选项A：使用vLLM后端的lighteval

lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.

Standalone (local GPU):

bash

undefined

lighteval是HuggingFace的评估库，支持Open LLM Leaderboard任务。

独立运行（本地GPU）：

bash

undefined

Run MMLU 5-shot with vLLM

使用vLLM运行MMLU 5-shot

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"

Run multiple tasks

运行多个任务

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

Use accelerate backend instead of vLLM

使用accelerate后端替代vLLM

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--tasks "leaderboard|mmlu|5"
--backend accelerate

Chat/instruction-tuned models

对话/指令微调模型

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template


**Via HF Jobs:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"

lighteval Task Format: Tasks use the format

suite|task|num_fewshot

```
leaderboard|mmlu|5
```
- MMLU with 5-shot
```
leaderboard|gsm8k|5
```
- GSM8K with 5-shot
```
lighteval|hellaswag|0
```
- HellaSwag zero-shot
```
leaderboard|arc_challenge|25
```
- ARC-Challenge with 25-shot

Finding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt

This file contains all supported tasks in the format

suite|task|num_fewshot|0

(the trailing

is a version flag and can be ignored). Common suites include:

```
leaderboard
```
- Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
```
lighteval
```
- Additional lighteval tasks
```
bigbench
```
- BigBench tasks
```
original
```
- Original benchmark tasks

To use a task from the list, extract the

suite|task|num_fewshot

portion (without the trailing

) and pass it to the

--tasks

parameter. For example:

From file:
```
leaderboard|mmlu|0
```
→ Use:
```
leaderboard|mmlu|0
```
(or change to
```
5
```
for 5-shot)

From file:

bigbench|abstract_narrative_understanding|0

→ Use:

bigbench|abstract_narrative_understanding|0

From file:

lighteval|wmt14:hi-en|0

→ Use:

lighteval|wmt14:hi-en|0

Multiple tasks can be specified as comma-separated values:

--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

uv run scripts/lighteval_vllm_uv.py
--model meta-llama/Llama-3.2-1B-Instruct
--tasks "leaderboard|mmlu|5"
--use-chat-template


**通过HF Jobs运行：**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"

lighteval任务格式： 任务使用

suite|task|num_fewshot

格式：

```
leaderboard|mmlu|5
```
- 5-shot的MMLU
```
leaderboard|gsm8k|5
```
- 5-shot的GSM8K
```
lighteval|hellaswag|0
```
- zero-shot的HellaSwag
```
leaderboard|arc_challenge|25
```
- 25-shot的ARC-Challenge

查找可用任务： 完整的lighteval可用任务列表可在以下地址找到： https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt

该文件包含所有支持的任务，格式为

suite|task|num_fewshot|0

（末尾的

是版本标志，可忽略）。常见套件包括：

```
leaderboard
```
- Open LLM Leaderboard任务（MMLU、GSM8K、ARC、HellaSwag等）
```
lighteval
```
- 额外的lighteval任务
```
bigbench
```
- BigBench任务
```
original
```
- 原始基准测试任务

要使用列表中的任务，请提取

suite|task|num_fewshot

部分（忽略末尾的

）并传递给

--tasks

参数。例如：

从文件：
```
leaderboard|mmlu|0
```
→ 使用：
```
leaderboard|mmlu|0
```
（或改为
```
5
```
以使用5-shot）

从文件：

bigbench|abstract_narrative_understanding|0

→ 使用：

bigbench|abstract_narrative_understanding|0

从文件：

lighteval|wmt14:hi-en|0

→ 使用：

lighteval|wmt14:hi-en|0

多个任务可通过逗号分隔指定：

--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

Option B: inspect-ai with vLLM Backend

选项B：使用vLLM后端的inspect-ai

inspect-ai is the UK AI Safety Institute's evaluation framework.

Standalone (local GPU):

bash

undefined

inspect-ai是英国AI安全研究所的评估框架。

独立运行（本地GPU）：

bash

undefined

Run MMLU with vLLM

使用vLLM运行MMLU

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu

Use HuggingFace Transformers backend

使用HuggingFace Transformers后端

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-1B
--task mmlu
--backend hf

Multi-GPU with tensor parallelism

多GPU张量并行

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4


**Via HF Jobs:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu

Available inspect-ai Tasks:

```
mmlu
```
- Massive Multitask Language Understanding
```
gsm8k
```
- Grade School Math
```
hellaswag
```
- Common sense reasoning
```
arc_challenge
```
- AI2 Reasoning Challenge
```
truthfulqa
```
- TruthfulQA benchmark
```
winogrande
```
- Winograd Schema Challenge
```
humaneval
```
- Code generation

uv run scripts/inspect_vllm_uv.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--tensor-parallel-size 4


**通过HF Jobs运行：**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu

可用的inspect-ai任务：

```
mmlu
```
- 大规模多任务语言理解
```
gsm8k
```
- 小学数学问题
```
hellaswag
```
- 常识推理
```
arc_challenge
```
- AI2推理挑战
```
truthfulqa
```
- TruthfulQA基准测试
```
winogrande
```
- Winograd模式挑战
```
humaneval
```
- 代码生成

Option C: Python Helper Script

选项C：Python辅助脚本

The helper script auto-selects hardware and simplifies job submission:

bash

undefined

辅助脚本会自动选择硬件并简化作业提交：

bash

undefined

Auto-detect hardware based on model size

根据模型大小自动检测硬件

uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-1B
--task "leaderboard|mmlu|5"
--framework lighteval

Explicit hardware selection

显式选择硬件

uv run scripts/run_vllm_eval_job.py
--model meta-llama/Llama-3.2-70B
--task mmlu
--framework inspect
--hardware a100-large
--tensor-parallel-size 4

Use HF Transformers backend

使用HF Transformers后端

uv run scripts/run_vllm_eval_job.py
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf


**Hardware Recommendations:**
| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B params | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |

uv run scripts/run_vllm_eval_job.py
--model microsoft/phi-2
--task mmlu
--framework inspect
--backend hf


**硬件推荐：**
| 模型大小 | 推荐硬件 |
|------------|---------------------|
| < 3B参数 | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |

Commands Reference

命令参考

Top-level help and version:

bash

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version

Inspect Tables (start here):

bash

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

Extract from README:

bash

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

Import from Artificial Analysis:

bash

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

View / Validate:

bash

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

Check Open PRs (ALWAYS run before --create-pr):

bash

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.

Run Evaluation Job (Inference Providers):

bash

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"

or use the Python helper:

bash

uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."

Run vLLM Evaluation (Custom Models):

bash

undefined

顶级帮助和版本：

bash

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version

检查表格（从此处开始）：

bash

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

从README提取：

bash

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

从Artificial Analysis导入：

bash

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

查看/验证：

bash

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

检查打开的PR（创建--create-pr前务必运行）：

bash

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

列出模型仓库的所有打开的拉取请求。显示PR编号、标题、作者、日期和URL。

运行评估作业（推理提供商）：

bash

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"

或使用Python辅助脚本：

bash

uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."

运行vLLM评估（自定义模型）：

bash

undefined

lighteval with vLLM

使用vLLM的lighteval

hf jobs uv run scripts/lighteval_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--tasks "leaderboard|mmlu|5"

inspect-ai with vLLM

使用vLLM的inspect-ai

hf jobs uv run scripts/inspect_vllm_uv.py
--flavor "a10g-small"
--secrets HF_TOKEN=$HF_TOKEN
-- --model "model-id"
--task "mmlu"

Helper script (auto hardware selection)

辅助脚本（自动硬件选择）

uv run scripts/run_vllm_eval_job.py
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval

undefined

uv run scripts/run_vllm_eval_job.py
--model "model-id"
--task "leaderboard|mmlu|5"
--framework lighteval

undefined

Model-Index Format

Model-Index格式

The generated model-index follows this structure:

yaml

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com

WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.

生成的model-index遵循以下结构：

yaml

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com

警告：请勿在模型名称中使用Markdown格式。使用表格中的精确名称。仅在source.url字段中使用URL。

Error Handling

错误处理

Table Not Found: Script will report if no evaluation tables are detected
Invalid Format: Clear error messages for malformed tables
API Errors: Retry logic for transient Artificial Analysis API failures
Token Issues: Validation before attempting updates
Merge Conflicts: Preserves existing model-index entries when adding new ones
Space Creation: Handles naming conflicts and hardware request failures gracefully

未找到表格：脚本会报告未检测到评估表格
格式无效：针对格式错误的表格显示清晰的错误消息
API错误：针对Artificial Analysis API的临时故障提供重试逻辑
令牌问题：在尝试更新前验证令牌权限
合并冲突：添加新评估结果时保留现有model-index条目
Space创建：优雅处理命名冲突和硬件请求失败

Best Practices

最佳实践

Check for existing PRs first: Run
```
get-prs
```
before creating any new PR to avoid duplicates
Always start with
inspect-tables
: See table structure and get the correct extraction command
Use
--help
for guidance: Run
```
inspect-tables --help
```
to see the complete workflow
Preview first: Default behavior prints YAML; review it before using
```
--apply
```
or
```
--create-pr
```
Verify extracted values: Compare YAML output against the README table manually
Use
--table N
for multi-table READMEs: Required when multiple evaluation tables exist
Use
--model-name-override
for comparison tables: Copy the exact column header from
```
inspect-tables
```
output
Create PRs for Others: Use
```
--create-pr
```
when updating models you don't own
One model per repo: Only add the main model's results to model-index
No markdown in YAML names: The model name field in YAML should be plain text

先检查现有PR：创建任何新PR前运行
```
get-prs
```
以避免重复
始终从
inspect-tables
开始：查看表格结构并获取正确的提取命令
使用
--help
获取指导：运行
```
inspect-tables --help
```
查看完整工作流
先预览：默认行为是输出YAML；使用
```
--apply
```
或
```
--create-pr
```
前先查看
验证提取的值：手动对比YAML输出与README表格
多表格README使用
--table N
：存在多个评估表格时为必填项
对比表格使用
--model-name-override
：从
```
inspect-tables
```
输出中复制精确的列头
为他人创建PR：更新非自己拥有的模型时使用
```
--create-pr
```
每个仓库一个模型：仅将主模型的结果添加到model-index
YAML名称中无Markdown：YAML中的模型名字段应为纯文本

Model Name Matching

模型名称匹配

When extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching:

Removes markdown formatting (bold
```
**
```
, links
```
[]()
```
)
Normalizes names (lowercase, replace
```
-
```
and
```
_
```
with spaces)

Compares token sets:

"OLMo-3-32B"

→

{"olmo", "3", "32b"}

matches

"**Olmo 3 32B**"

"[Olmo-3-32B](...)

Only extracts if tokens match exactly (handles different word orders and separators)
Fails if no exact match found (rather than guessing from similar names)

For column-based tables (benchmarks as rows, models as columns):

Finds the column header matching the model name
Extracts scores from that column only

For transposed tables (models as rows, benchmarks as columns):

Finds the row in the first column matching the model name
Extracts all benchmark scores from that row only

This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.

当提取包含多个模型的评估表格（作为列或行）时，脚本使用精确的规范化令牌匹配：

移除Markdown格式（粗体
```
**
```
、链接
```
[]()
```
）
规范化名称（小写，将
```
-
```
和
```
_
```
替换为空格）

对比令牌集：

"OLMo-3-32B"

→

{"olmo", "3", "32b"}

匹配

"**Olmo 3 32B**"

或

"[Olmo-3-32B](...)

仅当令牌完全匹配时才提取（处理不同的词序和分隔符）
如果未找到精确匹配则失败（而非从相似名称猜测）

基于列的表格（基准测试作为行，模型作为列）：

查找与模型名称匹配的列头
仅提取该列的分数

转置表格（模型作为行，基准测试作为列）：

查找第一列中与模型名称匹配的行
仅提取该行的所有基准测试分数

这确保仅提取正确模型的分数，绝不会提取无关模型或训练检查点的分数。

Common Patterns

常见模式

Update Your Own Model:

bash

undefined

更新自己的模型：

bash

undefined

Extract from README and push directly

从README提取并直接推送

uv run scripts/evaluation_manager.py extract-readme
--repo-id "your-username/your-model"
--task-type "text-generation"


**Update Someone Else's Model (Full Workflow):**
```bash

uv run scripts/evaluation_manager.py extract-readme
--repo-id "your-username/your-model"
--task-type "text-generation"


**更新他人的模型（完整工作流）：**
```bash

Step 1: ALWAYS check for existing PRs first

步骤1：务必先检查现有PR

uv run scripts/evaluation_manager.py get-prs
--repo-id "other-username/their-model"

Step 2: If NO open PRs exist, proceed with creating one

步骤2：如果没有打开的PR，继续创建

uv run scripts/evaluation_manager.py extract-readme
--repo-id "other-username/their-model"
--create-pr

If open PRs DO exist:

如果存在打开的PR：

- Warn the user about existing PRs

- 警告用户存在现有PR

- Show them the PR URLs

- 向他们展示PR URL

- Do NOT create a new PR unless user explicitly confirms

- 除非用户明确确认，否则请勿创建新PR


**Import Fresh Benchmarks:**
```bash


**导入新基准测试：**
```bash

Step 1: Check for existing PRs

步骤1：检查现有PR

uv run scripts/evaluation_manager.py get-prs
--repo-id "anthropic/claude-sonnet-4"

Step 2: If no PRs, import from Artificial Analysis

步骤2：如果没有PR，从Artificial Analysis导入

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr

undefined

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa
--creator-slug "anthropic"
--model-name "claude-sonnet-4"
--repo-id "anthropic/claude-sonnet-4"
--create-pr

undefined

Troubleshooting

故障排除

Issue: "No evaluation tables found in README"

Solution: Check if README contains markdown tables with numeric scores

Issue: "Could not find model 'X' in transposed table"

Solution: The script will display available models. Use
```
--model-name-override
```
with the exact name from the list
Example:
```
--model-name-override "**Olmo 3-32B**"
```

Issue: "AA_API_KEY not set"

Solution: Set environment variable or add to .env file

Issue: "Token does not have write access"

Solution: Ensure HF_TOKEN has write permissions for the repository

Issue: "Model not found in Artificial Analysis"

Solution: Verify creator-slug and model-name match API values

Issue: "Payment required for hardware"

Solution: Add a payment method to your Hugging Face account to use non-CPU hardware

Issue: "vLLM out of memory" or CUDA OOM

Solution: Use a larger hardware flavor, reduce
```
--gpu-memory-utilization
```
, or use
```
--tensor-parallel-size
```
for multi-GPU

Issue: "Model architecture not supported by vLLM"

Solution: Use
```
--backend hf
```
(inspect-ai) or
```
--backend accelerate
```
(lighteval) for HuggingFace Transformers

Issue: "Trust remote code required"

Solution: Add
```
--trust-remote-code
```
flag for models with custom code (e.g., Phi-2, Qwen)

Issue: "Chat template not found"

Solution: Only use
```
--use-chat-template
```
for instruction-tuned models that include a chat template

问题："README中未找到评估表格"

解决方案：检查README是否包含带有数字分数的Markdown表格

问题："转置表格中未找到模型'X'"

解决方案：脚本会显示可用模型。使用
```
--model-name-override
```
并传入列表中的精确名称
示例：
```
--model-name-override "**Olmo 3-32B**"
```

问题："AA_API_KEY未设置"

解决方案：设置环境变量或添加到.env文件

问题："令牌无写入权限"

解决方案：确保HF_TOKEN对仓库具有写入权限

问题："Artificial Analysis中未找到模型"

解决方案：验证creator-slug和model-name是否与API值匹配

问题："硬件需要付费"

解决方案：在Hugging Face账户中添加付款方式以使用非CPU硬件

问题："vLLM内存不足"或CUDA OOM

解决方案：使用更大的硬件规格，降低
```
--gpu-memory-utilization
```
，或为多GPU使用
```
--tensor-parallel-size
```

问题："vLLM不支持该模型架构"

解决方案：使用
```
--backend hf
```
（inspect-ai）或
```
--backend accelerate
```
（lighteval）以使用HuggingFace Transformers

问题："需要信任远程代码"

解决方案：对于包含自定义代码的模型（如Phi-2、Qwen），添加
```
--trust-remote-code
```
标志

问题："未找到聊天模板"

解决方案：仅对包含聊天模板的指令微调模型使用
```
--use-chat-template
```

Integration Examples

集成示例

Python Script Integration:

python

import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")

Python脚本集成：

python

import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")