huggingface-community-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Overview

概述

This skill is for running evaluations against models on the Hugging Face Hub on local hardware.
It covers:
  • inspect-ai
    with local inference
  • lighteval
    with local inference
  • choosing between
    vllm
    , Hugging Face Transformers, and
    accelerate
  • smoke tests, task selection, and backend fallback strategy
It does not cover:
  • Hugging Face Jobs orchestration
  • model-card or
    model-index
    edits
  • README table extraction
  • Artificial Analysis imports
  • .eval_results
    generation or publishing
  • PR creation or community-evals automation
If the user wants to run the same eval remotely on Hugging Face Jobs, hand off to the
hugging-face-jobs
skill and pass it one of the local scripts in this skill.
If the user wants to publish results into the community evals workflow, stop after generating the evaluation run and hand off that publishing step to
~/code/community-evals
.
All paths below are relative to the directory containing this
SKILL.md
.
本Skill用于在本地硬件上对Hugging Face Hub中的模型运行评估
涵盖内容:
  • 结合本地推理使用
    inspect-ai
  • 结合本地推理使用
    lighteval
  • vllm
    、Hugging Face Transformers和
    accelerate
    之间进行选择
  • 冒烟测试、任务选择以及后端回退策略
不涵盖内容:
  • Hugging Face Jobs编排
  • 模型卡片或
    model-index
    编辑
  • README表格提取
  • 人工分析导入
  • .eval_results
    的生成或发布
  • PR创建或社区评估自动化
如果用户希望在Hugging Face Jobs上远程运行相同的评估,请转交至
hugging-face-jobs
Skill,并传递本Skill中的任一本地脚本。
如果用户希望将结果发布到社区评估工作流,请在生成评估运行后停止操作,将发布步骤转交至
~/code/community-evals
以下所有路径均相对于包含本
SKILL.md
的目录。

When To Use Which Script

脚本适用场景

Use caseScript
Local
inspect-ai
eval on a Hub model via inference providers
scripts/inspect_eval_uv.py
Local GPU eval with
inspect-ai
using
vllm
or Transformers
scripts/inspect_vllm_uv.py
Local GPU eval with
lighteval
using
vllm
or
accelerate
scripts/lighteval_vllm_uv.py
Extra command patterns
examples/USAGE_EXAMPLES.md
适用场景脚本
通过推理提供商对Hub模型进行本地
inspect-ai
评估
scripts/inspect_eval_uv.py
使用
vllm
或Transformers通过
inspect-ai
进行本地GPU评估
scripts/inspect_vllm_uv.py
使用
vllm
accelerate
通过
lighteval
进行本地GPU评估
scripts/lighteval_vllm_uv.py
额外命令示例
examples/USAGE_EXAMPLES.md

Prerequisites

前置条件

  • Prefer
    uv run
    for local execution.
  • Set
    HF_TOKEN
    for gated/private models.
  • For local GPU runs, verify GPU access before starting:
bash
uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
If
nvidia-smi
is unavailable, either:
  • use
    scripts/inspect_eval_uv.py
    for lighter provider-backed evaluation, or
  • hand off to the
    hugging-face-jobs
    skill if the user wants remote compute.
  • 优先使用
    uv run
    进行本地执行。
  • 为 gated/私有模型设置
    HF_TOKEN
  • 对于本地GPU运行,启动前请验证GPU访问权限:
bash
uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
如果
nvidia-smi
不可用:
  • 使用
    scripts/inspect_eval_uv.py
    进行轻量的基于提供商的评估,或者
  • 如果用户需要远程计算,转交至
    hugging-face-jobs
    Skill。

Core Workflow

核心工作流

  1. Choose the evaluation framework.
    • Use
      inspect-ai
      when you want explicit task control and inspect-native flows.
    • Use
      lighteval
      when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.
  2. Choose the inference backend.
    • Prefer
      vllm
      for throughput on supported architectures.
    • Use Hugging Face Transformers (
      --backend hf
      ) or
      accelerate
      as compatibility fallbacks.
  3. Start with a smoke test.
    • inspect-ai
      : add
      --limit 10
      or similar.
    • lighteval
      : add
      --max-samples 10
      .
  4. Scale up only after the smoke test passes.
  5. If the user wants remote execution, hand off to
    hugging-face-jobs
    with the same script + args.
  1. 选择评估框架。
    • 当需要明确的任务控制和inspect原生流程时,使用
      inspect-ai
    • 当基准测试可以自然表示为lighteval任务字符串时,尤其是排行榜类任务,使用
      lighteval
  2. 选择推理后端。
    • 对于支持的架构,优先使用
      vllm
      以获得高吞吐量。
    • 使用Hugging Face Transformers(
      --backend hf
      )或
      accelerate
      作为兼容性回退方案。
  3. 先进行冒烟测试。
    • inspect-ai
      :添加
      --limit 10
      或类似参数。
    • lighteval
      :添加
      --max-samples 10
  4. 仅在冒烟测试通过后再扩大规模。
  5. 如果用户需要远程执行,将相同的脚本+参数转交至
    hugging-face-jobs

Quick Start

快速开始

Option A: inspect-ai with local inference providers path

选项A:inspect-ai + 本地推理提供商路径

Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.
bash
uv run scripts/inspect_eval_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --limit 20
Use this path when:
  • you want a quick local smoke test
  • you do not need direct GPU control
  • the task already exists in
    inspect-evals
当模型已被Hugging Face推理提供商支持,且你希望本地设置开销最小时,此选项最佳。
bash
uv run scripts/inspect_eval_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --limit 20
在以下场景使用此路径:
  • 你需要快速进行本地冒烟测试
  • 不需要直接控制GPU
  • 任务已存在于
    inspect-evals

Option B: inspect-ai on Local GPU

选项B:本地GPU上的inspect-ai

Best when you need to load the Hub model directly, use
vllm
, or fall back to Transformers for unsupported architectures.
Local GPU:
bash
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task gsm8k \
  --limit 20
Transformers fallback:
bash
uv run scripts/inspect_vllm_uv.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --backend hf \
  --trust-remote-code \
  --limit 20
当需要直接加载Hub模型、使用
vllm
,或为不支持的架构回退到Transformers时,此选项最佳。
本地GPU:
bash
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task gsm8k \
  --limit 20
Transformers回退:
bash
uv run scripts/inspect_vllm_uv.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --backend hf \
  --trust-remote-code \
  --limit 20

Option C: lighteval on Local GPU

选项C:本地GPU上的lighteval

Best when the task is naturally expressed as a
lighteval
task string, especially Open LLM Leaderboard style benchmarks.
Local GPU:
bash
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
  --max-samples 20 \
  --use-chat-template
accelerate
fallback:
bash
uv run scripts/lighteval_vllm_uv.py \
  --model microsoft/phi-2 \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate \
  --trust-remote-code \
  --max-samples 20
当任务可以自然表示为
lighteval
任务字符串时,尤其是Open LLM排行榜类基准测试,此选项最佳。
本地GPU:
bash
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
  --max-samples 20 \
  --use-chat-template
accelerate
回退:
bash
uv run scripts/lighteval_vllm_uv.py \
  --model microsoft/phi-2 \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate \
  --trust-remote-code \
  --max-samples 20

Remote Execution Boundary

远程执行边界

This skill intentionally stops at local execution and backend selection.
If the user wants to:
  • run these scripts on Hugging Face Jobs
  • pick remote hardware
  • pass secrets to remote jobs
  • schedule recurring runs
  • inspect / cancel / monitor jobs
then switch to the
hugging-face-jobs
skill and pass it one of these scripts plus the chosen arguments.
本Skill特意仅覆盖本地执行和后端选择
如果用户希望:
  • 在Hugging Face Jobs上运行这些脚本
  • 选择远程硬件
  • 向远程作业传递密钥
  • 调度定期运行
  • 查看/取消/监控作业
请切换至**
hugging-face-jobs
** Skill,并传递这些脚本及所选参数。

Task Selection

任务选择

inspect-ai
examples:
  • mmlu
  • gsm8k
  • hellaswag
  • arc_challenge
  • truthfulqa
  • winogrande
  • humaneval
lighteval
task strings use
suite|task|num_fewshot
:
  • leaderboard|mmlu|5
  • leaderboard|gsm8k|5
  • leaderboard|arc_challenge|25
  • lighteval|hellaswag|0
Multiple
lighteval
tasks can be comma-separated in
--tasks
.
inspect-ai
示例:
  • mmlu
  • gsm8k
  • hellaswag
  • arc_challenge
  • truthfulqa
  • winogrande
  • humaneval
lighteval
任务字符串使用
suite|task|num_fewshot
格式:
  • leaderboard|mmlu|5
  • leaderboard|gsm8k|5
  • leaderboard|arc_challenge|25
  • lighteval|hellaswag|0
多个
lighteval
任务可在
--tasks
中用逗号分隔。

Backend Selection

后端选择

  • Prefer
    inspect_vllm_uv.py --backend vllm
    for fast GPU inference on supported architectures.
  • Use
    inspect_vllm_uv.py --backend hf
    when
    vllm
    does not support the model.
  • Prefer
    lighteval_vllm_uv.py --backend vllm
    for throughput on supported models.
  • Use
    lighteval_vllm_uv.py --backend accelerate
    as the compatibility fallback.
  • Use
    inspect_eval_uv.py
    when Inference Providers already cover the model and you do not need direct GPU control.
  • 对于支持的架构,优先使用
    inspect_vllm_uv.py --backend vllm
    以实现快速GPU推理。
  • vllm
    不支持该模型时,使用
    inspect_vllm_uv.py --backend hf
  • 对于支持的模型,优先使用
    lighteval_vllm_uv.py --backend vllm
    以获得高吞吐量。
  • 使用
    lighteval_vllm_uv.py --backend accelerate
    作为兼容性回退方案。
  • 当推理提供商已支持该模型且不需要直接控制GPU时,使用
    inspect_eval_uv.py

Hardware Guidance

硬件指导

Model sizeSuggested local hardware
< 3B
consumer GPU / Apple Silicon / small dev GPU
3B - 13B
stronger local GPU
13B+
high-memory local GPU or hand off to
hugging-face-jobs
For smoke tests, prefer cheaper local runs plus
--limit
or
--max-samples
.
模型大小建议本地硬件
< 3B
消费级GPU / Apple Silicon / 小型开发GPU
3B - 13B
性能更强的本地GPU
13B+
大显存本地GPU或转交至
hugging-face-jobs
对于冒烟测试,优先选择成本更低的本地运行,并搭配
--limit
--max-samples
参数。

Troubleshooting

故障排除

  • CUDA or vLLM OOM:
    • reduce
      --batch-size
    • reduce
      --gpu-memory-utilization
    • switch to a smaller model for the smoke test
    • if necessary, hand off to
      hugging-face-jobs
  • Model unsupported by
    vllm
    :
    • switch to
      --backend hf
      for
      inspect-ai
    • switch to
      --backend accelerate
      for
      lighteval
  • Gated/private repo access fails:
    • verify
      HF_TOKEN
  • Custom model code required:
    • add
      --trust-remote-code
  • CUDA或vLLM内存不足(OOM):
    • 减小
      --batch-size
    • 降低
      --gpu-memory-utilization
    • 切换至更小的模型进行冒烟测试
    • 必要时转交至
      hugging-face-jobs
  • 模型不被
    vllm
    支持:
    • 对于
      inspect-ai
      ,切换至
      --backend hf
    • 对于
      lighteval
      ,切换至
      --backend accelerate
  • Gated/私有仓库访问失败:
    • 验证
      HF_TOKEN
  • 需要自定义模型代码:
    • 添加
      --trust-remote-code

Examples

示例

See:
  • examples/USAGE_EXAMPLES.md
    for local command patterns
  • scripts/inspect_eval_uv.py
  • scripts/inspect_vllm_uv.py
  • scripts/lighteval_vllm_uv.py
请查看:
  • examples/USAGE_EXAMPLES.md
    获取本地命令示例
  • scripts/inspect_eval_uv.py
  • scripts/inspect_vllm_uv.py
  • scripts/lighteval_vllm_uv.py