evaluating-llms-harness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

lm-evaluation-harness - LLM Benchmarking

lm-evaluation-harness - LLM基准测试工具

Quick start

快速开始

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

bash

pip install lm-eval

Evaluate any HuggingFace model:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

View available tasks:

bash

lm_eval --tasks list

lm-evaluation-harness 可通过标准化提示词与指标，在60+学术基准测试上评估LLM。

安装:

bash

pip install lm-eval

评估任意HuggingFace模型:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

查看可用任务:

bash

lm_eval --tasks list

Common workflows

常见工作流

Workflow 1: Standard benchmark evaluation

工作流1：标准基准测试评估

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
GSM8K - Grade school math word problems
HellaSwag - Common sense reasoning
TruthfulQA - Truthfulness and factuality
ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

HumanEval - Python code generation (164 problems)
MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

bash

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

Custom checkpoint:

bash

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

Step 3: Run evaluation

bash

undefined

在核心基准测试（MMLU、GSM8K、HumanEval）上评估模型。

复制以下检查清单:

基准测试评估:
- [ ] 步骤1：选择基准测试套件
- [ ] 步骤2：配置模型
- [ ] 步骤3：运行评估
- [ ] 步骤4：分析结果

步骤1：选择基准测试套件

核心推理基准测试:

MMLU（Massive Multitask Language Understanding）- 涵盖57个学科的多项选择题测试
GSM8K - 小学数学应用题测试
HellaSwag - 常识推理测试
TruthfulQA - 真实性与事实性测试
ARC（AI2 Reasoning Challenge）- 科学问题推理测试

代码基准测试:

HumanEval - Python代码生成测试（164个问题）
MBPP（Mostly Basic Python Problems）- Python基础编程测试

标准套件（模型发布推荐使用）:

bash

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

步骤2：配置模型

HuggingFace模型:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # 自动检测最优批次大小

量化模型（4-bit/8-bit）:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

自定义检查点:

bash

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

步骤3：运行评估

bash

undefined

Full MMLU evaluation (57 subjects)

完整MMLU评估（57个学科）

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot evaluation (standard) --batch_size 8
--output_path results/
--log_samples # Save individual predictions

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot评估（标准配置） --batch_size 8
--output_path results/
--log_samples # 保存单个预测结果

Multiple benchmarks at once

同时运行多个基准测试

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json


**Step 4: Analyze results**

Results saved to `results/llama2-7b-eval.json`:

```json
{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json


**步骤4：分析结果**

结果将保存至 `results/llama2-7b-eval.json`:

```json
{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

Workflow 2: Track training progress

工作流2：追踪训练进度

Evaluate checkpoints during training.

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

bash

#!/bin/bash

在训练过程中评估检查点。

训练进度追踪:
- [ ] 步骤1：设置定期评估
- [ ] 步骤2：选择快速基准测试
- [ ] 步骤3：自动化评估
- [ ] 步骤4：绘制学习曲线

步骤1：设置定期评估

每N个训练步骤执行一次评估:

bash

#!/bin/bash

eval_checkpoint.sh

CHECKPOINT_DIR=$1 STEP=$2

lm_eval --model hf
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot for speed --batch_size 16
--output_path results/step-$STEP.json


**Step 2: Choose quick benchmarks**

Fast benchmarks for frequent evaluation:
- **HellaSwag**: ~10 minutes on 1 GPU
- **GSM8K**: ~5 minutes
- **PIQA**: ~2 minutes

Avoid for frequent eval (too slow):
- **MMLU**: ~2 hours (57 subjects)
- **HumanEval**: Requires code execution

**Step 3: Automate evaluation**

Integrate with training script:

```python

CHECKPOINT_DIR=$1 STEP=$2

lm_eval --model hf
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot配置以提升速度 --batch_size 16
--output_path results/step-$STEP.json


**步骤2：选择快速基准测试**

适合频繁评估的快速基准测试:
- **HellaSwag**: 单GPU上约10分钟完成
- **GSM8K**: 约5分钟完成
- **PIQA**: 约2分钟完成

不适合频繁评估的测试（速度过慢）:
- **MMLU**: 约2小时（57个学科）
- **HumanEval**: 需要执行代码

**步骤3：自动化评估**

与训练脚本集成:

```python

In training loop

在训练循环中

if step % eval_interval == 0: model.save_pretrained(f"checkpoints/step-{step}")

# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")


Or use PyTorch Lightning callbacks:

```python
from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint
        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

python

import json
import matplotlib.pyplot as plt

if step % eval_interval == 0: model.save_pretrained(f"checkpoints/step-{step}")

# 运行评估
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")


或使用PyTorch Lightning回调函数:

```python
from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # 保存检查点
        trainer.save_checkpoint(checkpoint_path)

        # 运行lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

步骤4：绘制学习曲线

python

import json
import matplotlib.pyplot as plt

Load all results

加载所有结果

steps = [] mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")): with open(file) as f: data = json.load(f) step = int(file.split("-")[1].split(".")[0]) steps.append(step) mmlu_scores.append(data["results"]["mmlu"]["acc"])

steps = [] mmlu_scores = []

Plot

绘制曲线

plt.plot(steps, mmlu_scores) plt.xlabel("Training Step") plt.ylabel("MMLU Accuracy") plt.title("Training Progress") plt.savefig("training_curve.png")

undefined

plt.plot(steps, mmlu_scores) plt.xlabel("Training Step") plt.ylabel("MMLU Accuracy") plt.title("Training Progress") plt.savefig("training_curve.png")

undefined

Workflow 3: Compare multiple models

工作流3：对比多个模型

Benchmark suite for model comparison.

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

Step 1: Define model list

bash

undefined

用于模型对比的基准测试套件。

模型对比:
- [ ] 步骤1：定义模型列表
- [ ] 步骤2：运行评估
- [ ] 步骤3：生成对比表格

步骤1：定义模型列表

bash

undefined

models.txt

meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf mistralai/Mistral-7B-v0.1 microsoft/phi-2


**Step 2: Run evaluations**

```bash
#!/bin/bash

meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf mistralai/Mistral-7B-v0.1 microsoft/phi-2


**步骤2：运行评估**

```bash
#!/bin/bash

eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do echo "Evaluating $model"

# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')

lm_eval --model hf \
  --model_args pretrained=$model,dtype=bfloat16 \
  --tasks $TASKS \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/$model_name.json

done < models.txt


**Step 3: Generate comparison table**

```python
import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # Get primary metric for each task
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

Output:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do echo "Evaluating $model"

# 提取模型名称用于输出文件
model_name=$(echo $model | sed 's/\//-/g')

lm_eval --model hf \
  --model_args pretrained=$model,dtype=bfloat16 \
  --tasks $TASKS \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/$model_name.json

done < models.txt


**步骤3：生成对比表格**

```python
import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # 获取每个任务的主要指标
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

输出:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

Workflow 4: Evaluate with vLLM (faster inference)

工作流4：使用vLLM评估（更快推理）

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

Step 1: Install vLLM

bash

pip install vllm

Step 2: Configure vLLM backend

bash

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

bash

undefined

使用vLLM后端实现5-10倍的评估加速。

vLLM评估:
- [ ] 步骤1：安装vLLM
- [ ] 步骤2：配置vLLM后端
- [ ] 步骤3：运行评估

步骤1：安装vLLM

bash

pip install vllm

步骤2：配置vLLM后端

bash

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

步骤3：运行评估

vLLM比标准HuggingFace实现快5-10倍:

bash

undefined

Standard HF: ~2 hours for MMLU on 7B model

标准HF实现：7B模型的MMLU评估约2小时

lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8

vLLM: ~15-20 minutes for MMLU on 7B model

vLLM实现：7B模型的MMLU评估约15-20分钟

lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto

undefined

lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto

undefined

When to use vs alternatives

适用场景与替代方案对比

Use lm-evaluation-harness when:

Benchmarking models for academic papers
Comparing model quality across standard tasks
Tracking training progress
Reporting standardized metrics (everyone uses same prompts)
Need reproducible evaluation

Use alternatives instead:

HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
AlpacaEval: Instruction-following evaluation with LLM judges
MT-Bench: Conversational multi-turn evaluation
Custom scripts: Domain-specific evaluation

适合使用lm-evaluation-harness的场景:

为学术论文进行模型基准测试
跨标准任务对比模型质量
追踪训练进度
汇报标准化指标（所有使用者采用相同提示词）
需要可复现的评估结果

适合使用替代方案的场景:

HELM（斯坦福）: 更全面的评估（公平性、效率、校准性）
AlpacaEval: 基于LLM裁判的指令遵循评估
MT-Bench: 多轮对话评估
自定义脚本: 特定领域的评估需求

Common issues

常见问题

Issue: Evaluation too slow

Use vLLM backend:

bash

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

bash

--num_fewshot 0  # Instead of 5

Or evaluate subset of MMLU:

bash

--tasks mmlu_stem  # Only STEM subjects

Issue: Out of memory

Reduce batch size:

bash

--batch_size 1  # Or --batch_size auto

Use quantization:

bash

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

bash

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

bash

--num_fewshot 5  # Most papers use 5-shot

Check exact task name:

bash

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

bash

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

bash

pip install human-eval

Enable code execution:

bash

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # Required for HumanEval

问题：评估速度过慢

使用vLLM后端:

bash

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

或减少fewshot示例数量:

bash

--num_fewshot 0  # 替代默认的5-shot

或评估MMLU的子集:

bash

--tasks mmlu_stem  # 仅评估STEM学科

问题：内存不足

减小批次大小:

bash

--batch_size 1  # 或使用 --batch_size auto

使用量化配置:

bash

--model_args pretrained=model-name,load_in_8bit=True

启用CPU卸载:

bash

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

问题：结果与报告不符

检查fewshot数量:

bash

--num_fewshot 5  # 大多数论文采用5-shot配置

检查任务名称是否准确:

bash

--tasks mmlu  # 而非mmlu_direct或mmlu_fewshot

验证模型与分词器是否匹配:

bash

--model_args pretrained=model-name,tokenizer=same-model-name

问题：HumanEval无法执行代码

安装执行依赖:

bash

pip install human-eval

启用代码执行:

bash

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # HumanEval必需配置

Advanced topics

进阶主题

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

基准测试说明: 查看 references/benchmark-guide.md 获取所有60+任务的详细说明、测试内容及结果解读方式。

自定义任务: 查看 references/custom-tasks.md 了解如何创建特定领域的评估任务。

API评估: 查看 references/api-evaluation.md 了解如何评估OpenAI、Anthropic及其他API模型。

多GPU策略: 查看 references/distributed-eval.md 了解数据并行与张量并行评估的实现方式。

Hardware requirements

硬件要求

GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
VRAM:
- 7B model: 16GB (bf16) or 8GB (8-bit)
- 13B model: 28GB (bf16) or 14GB (8-bit)
- 70B model: Requires multi-GPU or quantization
Time (7B model, single A100):
- HellaSwag: 10 minutes
- GSM8K: 5 minutes
- MMLU (full): 2 hours
- HumanEval: 20 minutes

GPU: NVIDIA（CUDA 11.8+），支持CPU运行（速度极慢）
显存:
- 7B模型: 16GB（bf16精度）或8GB（8-bit量化）
- 13B模型: 28GB（bf16精度）或14GB（8-bit量化）
- 70B模型: 需要多GPU或量化配置
耗时（7B模型，单A100 GPU）:
- HellaSwag: 10分钟
- GSM8K: 5分钟
- MMLU（完整）: 2小时
- HumanEval: 20分钟

Resources

资源

GitHub: https://github.com/EleutherAI/lm-evaluation-harness
Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)

GitHub: https://github.com/EleutherAI/lm-evaluation-harness
文档: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
任务库: 60+任务，包括MMLU、GSM8K、HumanEval、TruthfulQA、HellaSwag、ARC、WinoGrande等
排行榜: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard（使用本工具实现）