huggingface-community-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOverview
概述
This skill is for running evaluations against models on the Hugging Face Hub on local hardware.
It covers:
- with local inference
inspect-ai - with local inference
lighteval - choosing between , Hugging Face Transformers, and
vllmaccelerate - smoke tests, task selection, and backend fallback strategy
It does not cover:
- Hugging Face Jobs orchestration
- model-card or edits
model-index - README table extraction
- Artificial Analysis imports
- generation or publishing
.eval_results - PR creation or community-evals automation
If the user wants to run the same eval remotely on Hugging Face Jobs, hand off to the skill and pass it one of the local scripts in this skill.
hugging-face-jobsIf the user wants to publish results into the community evals workflow, stop after generating the evaluation run and hand off that publishing step to .
~/code/community-evalsAll paths below are relative to the directory containing this.SKILL.md
本Skill用于在本地硬件上对Hugging Face Hub中的模型运行评估。
涵盖内容:
- 结合本地推理使用
inspect-ai - 结合本地推理使用
lighteval - 在、Hugging Face Transformers和
vllm之间进行选择accelerate - 冒烟测试、任务选择以及后端回退策略
不涵盖内容:
- Hugging Face Jobs编排
- 模型卡片或编辑
model-index - README表格提取
- 人工分析导入
- 的生成或发布
.eval_results - PR创建或社区评估自动化
如果用户希望在Hugging Face Jobs上远程运行相同的评估,请转交至 Skill,并传递本Skill中的任一本地脚本。
hugging-face-jobs如果用户希望将结果发布到社区评估工作流,请在生成评估运行后停止操作,将发布步骤转交至。
~/code/community-evals以下所有路径均相对于包含本的目录。SKILL.md
When To Use Which Script
脚本适用场景
| Use case | Script |
|---|---|
Local | |
Local GPU eval with | |
Local GPU eval with | |
| Extra command patterns | |
| 适用场景 | 脚本 |
|---|---|
通过推理提供商对Hub模型进行本地 | |
使用 | |
使用 | |
| 额外命令示例 | |
Prerequisites
前置条件
- Prefer for local execution.
uv run - Set for gated/private models.
HF_TOKEN - For local GPU runs, verify GPU access before starting:
bash
uv --version
printenv HF_TOKEN >/dev/null
nvidia-smiIf is unavailable, either:
nvidia-smi- use for lighter provider-backed evaluation, or
scripts/inspect_eval_uv.py - hand off to the skill if the user wants remote compute.
hugging-face-jobs
- 优先使用进行本地执行。
uv run - 为 gated/私有模型设置。
HF_TOKEN - 对于本地GPU运行,启动前请验证GPU访问权限:
bash
uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi如果不可用:
nvidia-smi- 使用进行轻量的基于提供商的评估,或者
scripts/inspect_eval_uv.py - 如果用户需要远程计算,转交至Skill。
hugging-face-jobs
Core Workflow
核心工作流
- Choose the evaluation framework.
- Use when you want explicit task control and inspect-native flows.
inspect-ai - Use when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.
lighteval
- Use
- Choose the inference backend.
- Prefer for throughput on supported architectures.
vllm - Use Hugging Face Transformers () or
--backend hfas compatibility fallbacks.accelerate
- Prefer
- Start with a smoke test.
- : add
inspect-aior similar.--limit 10 - : add
lighteval.--max-samples 10
- Scale up only after the smoke test passes.
- If the user wants remote execution, hand off to with the same script + args.
hugging-face-jobs
- 选择评估框架。
- 当需要明确的任务控制和inspect原生流程时,使用。
inspect-ai - 当基准测试可以自然表示为lighteval任务字符串时,尤其是排行榜类任务,使用。
lighteval
- 当需要明确的任务控制和inspect原生流程时,使用
- 选择推理后端。
- 对于支持的架构,优先使用以获得高吞吐量。
vllm - 使用Hugging Face Transformers()或
--backend hf作为兼容性回退方案。accelerate
- 对于支持的架构,优先使用
- 先进行冒烟测试。
- :添加
inspect-ai或类似参数。--limit 10 - :添加
lighteval。--max-samples 10
- 仅在冒烟测试通过后再扩大规模。
- 如果用户需要远程执行,将相同的脚本+参数转交至。
hugging-face-jobs
Quick Start
快速开始
Option A: inspect-ai with local inference providers path
选项A:inspect-ai + 本地推理提供商路径
Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.
bash
uv run scripts/inspect_eval_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--limit 20Use this path when:
- you want a quick local smoke test
- you do not need direct GPU control
- the task already exists in
inspect-evals
当模型已被Hugging Face推理提供商支持,且你希望本地设置开销最小时,此选项最佳。
bash
uv run scripts/inspect_eval_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--limit 20在以下场景使用此路径:
- 你需要快速进行本地冒烟测试
- 不需要直接控制GPU
- 任务已存在于中
inspect-evals
Option B: inspect-ai on Local GPU
选项B:本地GPU上的inspect-ai
Best when you need to load the Hub model directly, use , or fall back to Transformers for unsupported architectures.
vllmLocal GPU:
bash
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task gsm8k \
--limit 20Transformers fallback:
bash
uv run scripts/inspect_vllm_uv.py \
--model microsoft/phi-2 \
--task mmlu \
--backend hf \
--trust-remote-code \
--limit 20当需要直接加载Hub模型、使用,或为不支持的架构回退到Transformers时,此选项最佳。
vllm本地GPU:
bash
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task gsm8k \
--limit 20Transformers回退:
bash
uv run scripts/inspect_vllm_uv.py \
--model microsoft/phi-2 \
--task mmlu \
--backend hf \
--trust-remote-code \
--limit 20Option C: lighteval on Local GPU
选项C:本地GPU上的lighteval
Best when the task is naturally expressed as a task string, especially Open LLM Leaderboard style benchmarks.
lightevalLocal GPU:
bash
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-3B-Instruct \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
--max-samples 20 \
--use-chat-templateacceleratebash
uv run scripts/lighteval_vllm_uv.py \
--model microsoft/phi-2 \
--tasks "leaderboard|mmlu|5" \
--backend accelerate \
--trust-remote-code \
--max-samples 20当任务可以自然表示为任务字符串时,尤其是Open LLM排行榜类基准测试,此选项最佳。
lighteval本地GPU:
bash
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-3B-Instruct \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
--max-samples 20 \
--use-chat-templateacceleratebash
uv run scripts/lighteval_vllm_uv.py \
--model microsoft/phi-2 \
--tasks "leaderboard|mmlu|5" \
--backend accelerate \
--trust-remote-code \
--max-samples 20Remote Execution Boundary
远程执行边界
This skill intentionally stops at local execution and backend selection.
If the user wants to:
- run these scripts on Hugging Face Jobs
- pick remote hardware
- pass secrets to remote jobs
- schedule recurring runs
- inspect / cancel / monitor jobs
then switch to the skill and pass it one of these scripts plus the chosen arguments.
hugging-face-jobs本Skill特意仅覆盖本地执行和后端选择。
如果用户希望:
- 在Hugging Face Jobs上运行这些脚本
- 选择远程硬件
- 向远程作业传递密钥
- 调度定期运行
- 查看/取消/监控作业
请切换至**** Skill,并传递这些脚本及所选参数。
hugging-face-jobsTask Selection
任务选择
inspect-aimmlugsm8khellaswagarc_challengetruthfulqawinograndehumaneval
lightevalsuite|task|num_fewshotleaderboard|mmlu|5leaderboard|gsm8k|5leaderboard|arc_challenge|25lighteval|hellaswag|0
Multiple tasks can be comma-separated in .
lighteval--tasksinspect-aimmlugsm8khellaswagarc_challengetruthfulqawinograndehumaneval
lightevalsuite|task|num_fewshotleaderboard|mmlu|5leaderboard|gsm8k|5leaderboard|arc_challenge|25lighteval|hellaswag|0
多个任务可在中用逗号分隔。
lighteval--tasksBackend Selection
后端选择
- Prefer for fast GPU inference on supported architectures.
inspect_vllm_uv.py --backend vllm - Use when
inspect_vllm_uv.py --backend hfdoes not support the model.vllm - Prefer for throughput on supported models.
lighteval_vllm_uv.py --backend vllm - Use as the compatibility fallback.
lighteval_vllm_uv.py --backend accelerate - Use when Inference Providers already cover the model and you do not need direct GPU control.
inspect_eval_uv.py
- 对于支持的架构,优先使用以实现快速GPU推理。
inspect_vllm_uv.py --backend vllm - 当不支持该模型时,使用
vllm。inspect_vllm_uv.py --backend hf - 对于支持的模型,优先使用以获得高吞吐量。
lighteval_vllm_uv.py --backend vllm - 使用作为兼容性回退方案。
lighteval_vllm_uv.py --backend accelerate - 当推理提供商已支持该模型且不需要直接控制GPU时,使用。
inspect_eval_uv.py
Hardware Guidance
硬件指导
| Model size | Suggested local hardware |
|---|---|
| consumer GPU / Apple Silicon / small dev GPU |
| stronger local GPU |
| high-memory local GPU or hand off to |
For smoke tests, prefer cheaper local runs plus or .
--limit--max-samples| 模型大小 | 建议本地硬件 |
|---|---|
| 消费级GPU / Apple Silicon / 小型开发GPU |
| 性能更强的本地GPU |
| 大显存本地GPU或转交至 |
对于冒烟测试,优先选择成本更低的本地运行,并搭配或参数。
--limit--max-samplesTroubleshooting
故障排除
- CUDA or vLLM OOM:
- reduce
--batch-size - reduce
--gpu-memory-utilization - switch to a smaller model for the smoke test
- if necessary, hand off to
hugging-face-jobs
- reduce
- Model unsupported by :
vllm- switch to for
--backend hfinspect-ai - switch to for
--backend acceleratelighteval
- switch to
- Gated/private repo access fails:
- verify
HF_TOKEN
- verify
- Custom model code required:
- add
--trust-remote-code
- add
- CUDA或vLLM内存不足(OOM):
- 减小
--batch-size - 降低
--gpu-memory-utilization - 切换至更小的模型进行冒烟测试
- 必要时转交至
hugging-face-jobs
- 减小
- 模型不被支持:
vllm- 对于,切换至
inspect-ai--backend hf - 对于,切换至
lighteval--backend accelerate
- 对于
- Gated/私有仓库访问失败:
- 验证
HF_TOKEN
- 验证
- 需要自定义模型代码:
- 添加
--trust-remote-code
- 添加
Examples
示例
See:
- for local command patterns
examples/USAGE_EXAMPLES.md scripts/inspect_eval_uv.pyscripts/inspect_vllm_uv.pyscripts/lighteval_vllm_uv.py
请查看:
- 获取本地命令示例
examples/USAGE_EXAMPLES.md scripts/inspect_eval_uv.pyscripts/inspect_vllm_uv.pyscripts/lighteval_vllm_uv.py