evaluating-llms-harness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineselm-evaluation-harness - LLM Benchmarking
lm-evaluation-harness - LLM基准测试工具
Quick start
快速开始
lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.
Installation:
bash
pip install lm-evalEvaluate any HuggingFace model:
bash
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8View available tasks:
bash
lm_eval --tasks listlm-evaluation-harness 可通过标准化提示词与指标,在60+学术基准测试上评估LLM。
安装:
bash
pip install lm-eval评估任意HuggingFace模型:
bash
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--device cuda:0 \
--batch_size 8查看可用任务:
bash
lm_eval --tasks listCommon workflows
常见工作流
Workflow 1: Standard benchmark evaluation
工作流1:标准基准测试评估
Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).
Copy this checklist:
Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze resultsStep 1: Choose benchmark suite
Core reasoning benchmarks:
- MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
- GSM8K - Grade school math word problems
- HellaSwag - Common sense reasoning
- TruthfulQA - Truthfulness and factuality
- ARC (AI2 Reasoning Challenge) - Science questions
Code benchmarks:
- HumanEval - Python code generation (164 problems)
- MBPP (Mostly Basic Python Problems) - Python coding
Standard suite (recommended for model releases):
bash
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challengeStep 2: Configure model
HuggingFace model:
bash
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # Auto-detect optimal batch sizeQuantized model (4-bit/8-bit):
bash
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0Custom checkpoint:
bash
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0Step 3: Run evaluation
bash
undefined在核心基准测试(MMLU、GSM8K、HumanEval)上评估模型。
复制以下检查清单:
基准测试评估:
- [ ] 步骤1:选择基准测试套件
- [ ] 步骤2:配置模型
- [ ] 步骤3:运行评估
- [ ] 步骤4:分析结果步骤1:选择基准测试套件
核心推理基准测试:
- MMLU(Massive Multitask Language Understanding)- 涵盖57个学科的多项选择题测试
- GSM8K - 小学数学应用题测试
- HellaSwag - 常识推理测试
- TruthfulQA - 真实性与事实性测试
- ARC(AI2 Reasoning Challenge)- 科学问题推理测试
代码基准测试:
- HumanEval - Python代码生成测试(164个问题)
- MBPP(Mostly Basic Python Problems)- Python基础编程测试
标准套件(模型发布推荐使用):
bash
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge步骤2:配置模型
HuggingFace模型:
bash
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu \
--device cuda:0 \
--batch_size auto # 自动检测最优批次大小量化模型(4-bit/8-bit):
bash
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
--tasks mmlu \
--device cuda:0自定义检查点:
bash
lm_eval --model hf \
--model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
--tasks mmlu \
--device cuda:0步骤3:运行评估
bash
undefinedFull MMLU evaluation (57 subjects)
完整MMLU评估(57个学科)
lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot evaluation (standard) --batch_size 8
--output_path results/
--log_samples # Save individual predictions
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot evaluation (standard) --batch_size 8
--output_path results/
--log_samples # Save individual predictions
lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot评估(标准配置) --batch_size 8
--output_path results/
--log_samples # 保存单个预测结果
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--num_fewshot 5 \ # 5-shot评估(标准配置) --batch_size 8
--output_path results/
--log_samples # 保存单个预测结果
Multiple benchmarks at once
同时运行多个基准测试
lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json
**Step 4: Analyze results**
Results saved to `results/llama2-7b-eval.json`:
```json
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
--num_fewshot 5
--batch_size 8
--output_path results/llama2-7b-eval.json
**步骤4:分析结果**
结果将保存至 `results/llama2-7b-eval.json`:
```json
{
"results": {
"mmlu": {
"acc": 0.459,
"acc_stderr": 0.004
},
"gsm8k": {
"exact_match": 0.142,
"exact_match_stderr": 0.006
},
"hellaswag": {
"acc_norm": 0.765,
"acc_norm_stderr": 0.004
}
},
"config": {
"model": "hf",
"model_args": "pretrained=meta-llama/Llama-2-7b-hf",
"num_fewshot": 5
}
}Workflow 2: Track training progress
工作流2:追踪训练进度
Evaluate checkpoints during training.
Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curvesStep 1: Set up periodic evaluation
Evaluate every N training steps:
bash
#!/bin/bash在训练过程中评估检查点。
训练进度追踪:
- [ ] 步骤1:设置定期评估
- [ ] 步骤2:选择快速基准测试
- [ ] 步骤3:自动化评估
- [ ] 步骤4:绘制学习曲线步骤1:设置定期评估
每N个训练步骤执行一次评估:
bash
#!/bin/basheval_checkpoint.sh
eval_checkpoint.sh
CHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot for speed --batch_size 16
--output_path results/step-$STEP.json
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot for speed --batch_size 16
--output_path results/step-$STEP.json
**Step 2: Choose quick benchmarks**
Fast benchmarks for frequent evaluation:
- **HellaSwag**: ~10 minutes on 1 GPU
- **GSM8K**: ~5 minutes
- **PIQA**: ~2 minutes
Avoid for frequent eval (too slow):
- **MMLU**: ~2 hours (57 subjects)
- **HumanEval**: Requires code execution
**Step 3: Automate evaluation**
Integrate with training script:
```pythonCHECKPOINT_DIR=$1
STEP=$2
lm_eval --model hf
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot配置以提升速度 --batch_size 16
--output_path results/step-$STEP.json
--model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP
--tasks gsm8k,hellaswag
--num_fewshot 0 \ # 0-shot配置以提升速度 --batch_size 16
--output_path results/step-$STEP.json
**步骤2:选择快速基准测试**
适合频繁评估的快速基准测试:
- **HellaSwag**: 单GPU上约10分钟完成
- **GSM8K**: 约5分钟完成
- **PIQA**: 约2分钟完成
不适合频繁评估的测试(速度过慢):
- **MMLU**: 约2小时(57个学科)
- **HumanEval**: 需要执行代码
**步骤3:自动化评估**
与训练脚本集成:
```pythonIn training loop
在训练循环中
if step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# Run evaluation
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
Or use PyTorch Lightning callbacks:
```python
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# Save checkpoint
trainer.save_checkpoint(checkpoint_path)
# Run lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")Step 4: Plot learning curves
python
import json
import matplotlib.pyplot as pltif step % eval_interval == 0:
model.save_pretrained(f"checkpoints/step-{step}")
# 运行评估
os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
或使用PyTorch Lightning回调函数:
```python
from pytorch_lightning import Callback
class EvalHarnessCallback(Callback):
def on_validation_epoch_end(self, trainer, pl_module):
step = trainer.global_step
checkpoint_path = f"checkpoints/step-{step}"
# 保存检查点
trainer.save_checkpoint(checkpoint_path)
# 运行lm-eval
os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")步骤4:绘制学习曲线
python
import json
import matplotlib.pyplot as pltLoad all results
加载所有结果
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
steps = []
mmlu_scores = []
for file in sorted(glob.glob("results/step-*.json")):
with open(file) as f:
data = json.load(f)
step = int(file.split("-")[1].split(".")[0])
steps.append(step)
mmlu_scores.append(data["results"]["mmlu"]["acc"])
Plot
绘制曲线
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")
undefinedplt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")
undefinedWorkflow 3: Compare multiple models
工作流3:对比多个模型
Benchmark suite for model comparison.
Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison tableStep 1: Define model list
bash
undefined用于模型对比的基准测试套件。
模型对比:
- [ ] 步骤1:定义模型列表
- [ ] 步骤2:运行评估
- [ ] 步骤3:生成对比表格步骤1:定义模型列表
bash
undefinedmodels.txt
models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
**Step 2: Run evaluations**
```bash
#!/bin/bashmeta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2
**步骤2:运行评估**
```bash
#!/bin/basheval_all_models.sh
eval_all_models.sh
TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# Extract model name for output file
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.jsondone < models.txt
**Step 3: Generate comparison table**
```python
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# Get primary metric for each task
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))Output:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
while read model; do
echo "Evaluating $model"
# 提取模型名称用于输出文件
model_name=$(echo $model | sed 's/\//-/g')
lm_eval --model hf \
--model_args pretrained=$model,dtype=bfloat16 \
--tasks $TASKS \
--num_fewshot 5 \
--batch_size auto \
--output_path results/$model_name.jsondone < models.txt
**步骤3:生成对比表格**
```python
import json
import pandas as pd
models = [
"meta-llama-Llama-2-7b-hf",
"meta-llama-Llama-2-13b-hf",
"mistralai-Mistral-7B-v0.1",
"microsoft-phi-2"
]
tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
row = {"Model": model.replace("-", "/")}
for task in tasks:
# 获取每个任务的主要指标
metrics = data["results"][task]
if "acc" in metrics:
row[task.upper()] = f"{metrics['acc']:.3f}"
elif "exact_match" in metrics:
row[task.upper()] = f"{metrics['exact_match']:.3f}"
results.append(row)
df = pd.DataFrame(results)
print(df.to_markdown(index=False))输出:
| Model | MMLU | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b | 0.459 | 0.142 | 0.765 | 0.391 |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801 | 0.430 |
| mistralai/Mistral-7B | 0.626 | 0.395 | 0.812 | 0.428 |
| microsoft/phi-2 | 0.560 | 0.613 | 0.682 | 0.447 |Workflow 4: Evaluate with vLLM (faster inference)
工作流4:使用vLLM评估(更快推理)
Use vLLM backend for 5-10x faster evaluation.
vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluationStep 1: Install vLLM
bash
pip install vllmStep 2: Configure vLLM backend
bash
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size autoStep 3: Run evaluation
vLLM is 5-10× faster than standard HuggingFace:
bash
undefined使用vLLM后端实现5-10倍的评估加速。
vLLM评估:
- [ ] 步骤1:安装vLLM
- [ ] 步骤2:配置vLLM后端
- [ ] 步骤3:运行评估步骤1:安装vLLM
bash
pip install vllm步骤2:配置vLLM后端
bash
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
--tasks mmlu \
--batch_size auto步骤3:运行评估
vLLM比标准HuggingFace实现快5-10倍:
bash
undefinedStandard HF: ~2 hours for MMLU on 7B model
标准HF实现:7B模型的MMLU评估约2小时
lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8
lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu
--batch_size 8
vLLM: ~15-20 minutes for MMLU on 7B model
vLLM实现:7B模型的MMLU评估约15-20分钟
lm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto
undefinedlm_eval --model vllm
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto
--model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2
--tasks mmlu
--batch_size auto
undefinedWhen to use vs alternatives
适用场景与替代方案对比
Use lm-evaluation-harness when:
- Benchmarking models for academic papers
- Comparing model quality across standard tasks
- Tracking training progress
- Reporting standardized metrics (everyone uses same prompts)
- Need reproducible evaluation
Use alternatives instead:
- HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
- AlpacaEval: Instruction-following evaluation with LLM judges
- MT-Bench: Conversational multi-turn evaluation
- Custom scripts: Domain-specific evaluation
适合使用lm-evaluation-harness的场景:
- 为学术论文进行模型基准测试
- 跨标准任务对比模型质量
- 追踪训练进度
- 汇报标准化指标(所有使用者采用相同提示词)
- 需要可复现的评估结果
适合使用替代方案的场景:
- HELM(斯坦福): 更全面的评估(公平性、效率、校准性)
- AlpacaEval: 基于LLM裁判的指令遵循评估
- MT-Bench: 多轮对话评估
- 自定义脚本: 特定领域的评估需求
Common issues
常见问题
Issue: Evaluation too slow
Use vLLM backend:
bash
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2Or reduce fewshot examples:
bash
--num_fewshot 0 # Instead of 5Or evaluate subset of MMLU:
bash
--tasks mmlu_stem # Only STEM subjectsIssue: Out of memory
Reduce batch size:
bash
--batch_size 1 # Or --batch_size autoUse quantization:
bash
--model_args pretrained=model-name,load_in_8bit=TrueEnable CPU offloading:
bash
--model_args pretrained=model-name,device_map=auto,offload_folder=offloadIssue: Different results than reported
Check fewshot count:
bash
--num_fewshot 5 # Most papers use 5-shotCheck exact task name:
bash
--tasks mmlu # Not mmlu_direct or mmlu_fewshotVerify model and tokenizer match:
bash
--model_args pretrained=model-name,tokenizer=same-model-nameIssue: HumanEval not executing code
Install execution dependencies:
bash
pip install human-evalEnable code execution:
bash
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # Required for HumanEval问题:评估速度过慢
使用vLLM后端:
bash
lm_eval --model vllm \
--model_args pretrained=model-name,tensor_parallel_size=2或减少fewshot示例数量:
bash
--num_fewshot 0 # 替代默认的5-shot或评估MMLU的子集:
bash
--tasks mmlu_stem # 仅评估STEM学科问题:内存不足
减小批次大小:
bash
--batch_size 1 # 或使用 --batch_size auto使用量化配置:
bash
--model_args pretrained=model-name,load_in_8bit=True启用CPU卸载:
bash
--model_args pretrained=model-name,device_map=auto,offload_folder=offload问题:结果与报告不符
检查fewshot数量:
bash
--num_fewshot 5 # 大多数论文采用5-shot配置检查任务名称是否准确:
bash
--tasks mmlu # 而非mmlu_direct或mmlu_fewshot验证模型与分词器是否匹配:
bash
--model_args pretrained=model-name,tokenizer=same-model-name问题:HumanEval无法执行代码
安装执行依赖:
bash
pip install human-eval启用代码执行:
bash
lm_eval --model hf \
--model_args pretrained=model-name \
--tasks humaneval \
--allow_code_execution # HumanEval必需配置Advanced topics
进阶主题
Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.
Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.
API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.
Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.
基准测试说明: 查看 references/benchmark-guide.md 获取所有60+任务的详细说明、测试内容及结果解读方式。
自定义任务: 查看 references/custom-tasks.md 了解如何创建特定领域的评估任务。
API评估: 查看 references/api-evaluation.md 了解如何评估OpenAI、Anthropic及其他API模型。
多GPU策略: 查看 references/distributed-eval.md 了解数据并行与张量并行评估的实现方式。
Hardware requirements
硬件要求
- GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
- VRAM:
- 7B model: 16GB (bf16) or 8GB (8-bit)
- 13B model: 28GB (bf16) or 14GB (8-bit)
- 70B model: Requires multi-GPU or quantization
- Time (7B model, single A100):
- HellaSwag: 10 minutes
- GSM8K: 5 minutes
- MMLU (full): 2 hours
- HumanEval: 20 minutes
- GPU: NVIDIA(CUDA 11.8+),支持CPU运行(速度极慢)
- 显存:
- 7B模型: 16GB(bf16精度)或8GB(8-bit量化)
- 13B模型: 28GB(bf16精度)或14GB(8-bit量化)
- 70B模型: 需要多GPU或量化配置
- 耗时(7B模型,单A100 GPU):
- HellaSwag: 10分钟
- GSM8K: 5分钟
- MMLU(完整): 2小时
- HumanEval: 20分钟
Resources
资源
- GitHub: https://github.com/EleutherAI/lm-evaluation-harness
- Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
- Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
- Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)
- GitHub: https://github.com/EleutherAI/lm-evaluation-harness
- 文档: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
- 任务库: 60+任务,包括MMLU、GSM8K、HumanEval、TruthfulQA、HellaSwag、ARC、WinoGrande等
- 排行榜: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard(使用本工具实现)