evaluating-code-models

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

BigCode Evaluation Harness - Code Model Benchmarking

BigCode Evaluation Harness - 代码模型基准测试

Quick Start

快速开始

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).

Installation:

bash

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

Evaluate on HumanEval:

bash

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

View available tasks:

bash

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

BigCode Evaluation Harness 可在HumanEval、MBPP、MultiPL-E（覆盖18种语言）等15+个基准测试集上评估代码生成模型。

安装步骤:

bash

git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config

在HumanEval上执行评估:

bash

accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

查看可用任务:

bash

python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

常见工作流

Workflow 1: Standard Code Benchmark Evaluation

工作流1：标准代码基准测试

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).

Checklist:

Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results

Step 1: Choose benchmark suite

Python code generation (most common):

HumanEval: 164 handwritten problems, function completion
HumanEval+: Same 164 problems with 80× more tests (stricter)
MBPP: 500 crowd-sourced problems, entry-level difficulty
MBPP+: 399 curated problems with 35× more tests

Multi-language (18 languages):

MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.

Advanced:

APPS: 10,000 problems (introductory/interview/competition)
DS-1000: 1,000 data science problems across 7 libraries

Step 2: Configure model and generation

bash

undefined

在核心代码基准测试集（HumanEval、MBPP、HumanEval+）上评估模型。

检查清单:

代码基准测试评估:
- [ ] 步骤1：选择基准测试套件
- [ ] 步骤2：配置模型与生成参数
- [ ] 步骤3：启用代码执行并运行评估
- [ ] 步骤4：分析pass@k结果

步骤1：选择基准测试套件

Python代码生成（最常用）:

HumanEval: 164个人工编写的问题，任务为函数补全
HumanEval+: 相同的164个问题，但测试用例数量增加80倍（评估更严格）
MBPP: 500个众包收集的问题，难度为入门级
MBPP+: 399个精选问题，测试用例数量增加35倍（评估更严格）

多语言测试（覆盖18种语言）:

MultiPL-E: 将HumanEval/MBPP翻译为C++、Java、JavaScript、Go、Rust等语言的测试集

进阶测试:

APPS: 10,000个问题（涵盖入门、面试、竞赛难度）
DS-1000: 1,000个数据科学问题，覆盖7个常用库

步骤2：配置模型与生成参数

bash

undefined

Standard HuggingFace model

标准HuggingFace模型

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--do_sample True
--n_samples 200
--batch_size 50
--allow_code_execution

Quantized model (4-bit)

量化模型（4-bit）

accelerate launch main.py
--model codellama/CodeLlama-34b-hf
--tasks humaneval
--load_in_4bit
--max_length_generation 512
--allow_code_execution

Custom/private model

自定义/私有模型

accelerate launch main.py
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution


**Step 3: Run evaluation**

```bash

accelerate launch main.py
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution


**步骤3：运行评估**

```bash

Full evaluation with pass@k estimation (k=1,10,100)

完整评估并估算pass@k（k=1,10,100）

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--temperature 0.8
--n_samples 200
--batch_size 50
--allow_code_execution
--save_generations
--metric_output_path results/starcoder2-humaneval.json


**Step 4: Analyze results**

Results in `results/starcoder2-humaneval.json`:
```json
{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}


**步骤4：分析结果**

结果文件 `results/starcoder2-humaneval.json` 内容示例:
```json
{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

Workflow 2: Multi-Language Evaluation (MultiPL-E)

工作流2：多语言评估（MultiPL-E）

Evaluate code generation across 18 programming languages.

Checklist:

Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages

Step 1: Generate solutions on host

bash

undefined

评估模型在18种编程语言上的代码生成能力。

检查清单:

多语言评估:
- [ ] 步骤1：在本地主机生成解决方案
- [ ] 步骤2：在Docker中运行评估（安全执行）
- [ ] 步骤3：跨语言对比结果

步骤1：在本地主机生成解决方案

bash

undefined

Generate without execution (safe)

仅生成不执行（安全模式）

accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--max_length_generation 650
--temperature 0.8
--n_samples 50
--batch_size 50
--generation_only
--save_generations
--save_generations_path generations_multi.json


**Step 2: Evaluate in Docker container**

```bash


**步骤2：在Docker容器中执行评估**

```bash

Pull the MultiPL-E Docker image

拉取MultiPL-E Docker镜像

docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Run evaluation inside container

在容器内运行评估

docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro
-it evaluation-harness-multiple python3 main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--load_generations_path /app/generations.json
--allow_code_execution
--n_samples 50


**Supported languages**: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket


**支持的语言**: Python、JavaScript、Java、C++、Go、Rust、TypeScript、C#、PHP、Ruby、Swift、Kotlin、Scala、Perl、Julia、Lua、R、Racket

Workflow 3: Instruction-Tuned Model Evaluation

工作流3：指令微调模型评估

Evaluate chat/instruction models with proper formatting.

Checklist:

Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation

Step 1: Choose instruction tasks

instruct-humaneval: HumanEval with instruction prompts
humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks

Step 2: Configure instruction tokens

bash

undefined

评估对话/指令微调模型，需使用正确的格式。

检查清单:

指令模型评估:
- [ ] 步骤1：使用指令专用任务
- [ ] 步骤2：配置指令标记
- [ ] 步骤3：运行评估

步骤1：选择指令专用任务

instruct-humaneval: 带指令提示的HumanEval测试集
humanevalsynthesize-{lang}: HumanEvalPack代码合成任务

步骤2：配置指令标记

bash

undefined

For models with chat templates (e.g., CodeLlama-Instruct)

适用于带聊天模板的模型（如CodeLlama-Instruct）

accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution


**Step 3: HumanEvalPack for instruction models**

```bash

accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution


**步骤3：针对指令模型的HumanEvalPack测试**

```bash

Test code synthesis across 6 languages

测试6种语言的代码合成能力

accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution

undefined

accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution

undefined

Workflow 4: Compare Multiple Models

工作流4：多模型对比

Benchmark suite for model comparison.

Step 1: Create evaluation script

bash

#!/bin/bash

用于模型对比的基准测试套件。

步骤1：创建评估脚本

bash

#!/bin/bash

eval_models.sh

MODELS=( "bigcode/starcoder2-7b" "codellama/CodeLlama-7b-hf" "deepseek-ai/deepseek-coder-6.7b-base" ) TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do model_name=$(echo $model | tr '/' '-') echo "Evaluating $model"

accelerate launch main.py
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done


**Step 2: Generate comparison table**

```python
import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

MODELS=( "bigcode/starcoder2-7b" "codellama/CodeLlama-7b-hf" "deepseek-ai/deepseek-coder-6.7b-base" ) TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do model_name=$(echo $model | tr '/' '-') echo "Evaluating $model"

accelerate launch main.py
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done


**步骤2：生成对比表格**

```python
import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

When to Use vs Alternatives

适用场景与替代方案对比

Use BigCode Evaluation Harness when:

Evaluating code generation models specifically
Need multi-language evaluation (18 languages via MultiPL-E)
Testing functional correctness with unit tests (pass@k)
Benchmarking for BigCode/HuggingFace leaderboards
Evaluating fill-in-the-middle (FIM) capabilities

Use alternatives instead:

lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
SWE-bench: Real-world GitHub issue resolution
LiveCodeBench: Contamination-free, continuously updated problems
CodeXGLUE: Code understanding tasks (clone detection, defect prediction)

适合使用BigCode Evaluation Harness的场景:

专门评估代码生成模型
需要多语言评估（通过MultiPL-E覆盖18种语言）
用单元测试验证功能正确性（基于pass@k指标）
为BigCode/HuggingFace排行榜提交基准测试结果
评估中间补全（FIM）能力

适合使用替代方案的场景:

lm-evaluation-harness: 通用大语言模型基准测试（如MMLU、GSM8K、HellaSwag）
EvalPlus: 更严格的HumanEval+/MBPP+测试集，包含更多测试用例
SWE-bench: 真实GitHub问题解决能力测试
LiveCodeBench: 无数据污染、持续更新的测试问题集
CodeXGLUE: 代码理解任务（如代码克隆检测、缺陷预测）

Supported Benchmarks

支持的基准测试集

Benchmark	Problems	Languages	Metric	Use Case
HumanEval	164	Python	pass@k	Standard code completion
HumanEval+	164	Python	pass@k	Stricter evaluation (80× tests)
MBPP	500	Python	pass@k	Entry-level problems
MBPP+	399	Python	pass@k	Stricter evaluation (35× tests)
MultiPL-E	164×18	18 languages	pass@k	Multi-language evaluation
APPS	10,000	Python	pass@k	Competition-level
DS-1000	1,000	Python	pass@k	Data science (pandas, numpy, etc.)
HumanEvalPack	164×3×6	6 languages	pass@k	Synthesis/fix/explain
Mercury	1,889	Python	Efficiency	Computational efficiency

基准测试集	问题数量	支持语言	评估指标	适用场景
HumanEval	164	Python	pass@k	标准代码补全测试
HumanEval+	164	Python	pass@k	严格评估（测试用例×80）
MBPP	500	Python	pass@k	入门级问题测试
MBPP+	399	Python	pass@k	严格评估（测试用例×35）
MultiPL-E	164×18	18种语言	pass@k	多语言评估
APPS	10,000	Python	pass@k	竞赛级难度测试
DS-1000	1,000	Python	pass@k	数据科学场景（pandas、numpy等）
HumanEvalPack	164×3×6	6种语言	pass@k	代码合成/修复/解释任务
Mercury	1,889	Python	执行效率	计算效率评估

Common Issues

常见问题

Issue: Different results than reported in papers

Check these factors:

bash

undefined

问题：与论文中报告的结果不一致

检查以下因素:

bash

undefined

1. Verify n_samples (need 200 for accurate pass@k)

1. 验证n_samples参数（需设置为200以获得准确的pass@k结果）

--n_samples 200

2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)

2. 检查temperature参数（0.2为类贪婪采样，0.8为随机采样）

--temperature 0.8

3. Verify task name matches exactly

3. 确认任务名称完全匹配

--tasks humaneval # Not "human_eval" or "HumanEval"

--tasks humaneval # 不要写成"human_eval"或"HumanEval"

4. Check max_length_generation

4. 检查max_length_generation参数

--max_length_generation 512 # Increase for longer problems


**Issue: CUDA out of memory**

```bash

--max_length_generation 512 # 针对长问题可适当增大


**问题：CUDA内存不足**

```bash

Use quantization

使用量化加载

--load_in_8bit

OR

或

--load_in_4bit

Reduce batch size

减小批次大小

--batch_size 1

Set memory limit

设置GPU内存限制

--max_memory_per_gpu "20GiB"


**Issue: Code execution hangs or times out**

Use Docker for safe execution:
```bash

--max_memory_per_gpu "20GiB"


**问题：代码执行卡住或超时**

使用Docker进行安全执行:
```bash

Generate on host (no execution)

在本地主机仅生成代码（不执行）

--generation_only --save_generations

Evaluate in Docker

在Docker中执行评估

docker run ... --allow_code_execution --load_generations_path ...


**Issue: Low scores on instruction models**

Ensure proper instruction formatting:
```bash

docker run ... --allow_code_execution --load_generations_path ...


**问题：指令模型得分较低**

确保使用正确的指令格式:
```bash

Use instruction-specific tasks

使用指令专用任务

--tasks instruct-humaneval

Set instruction tokens for your model

为模型设置对应的指令标记

--instruction_tokens "<s>[INST],</s>,[/INST]"


**Issue: MultiPL-E language failures**

Use the dedicated Docker image:
```bash
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

--instruction_tokens "<s>[INST],</s>,[/INST]"


**问题：MultiPL-E多语言测试失败**

使用专用的Docker镜像:
```bash
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

命令参数参考

Argument	Default	Description
`--model`	-	HuggingFace model ID or local path
`--tasks`	-	Comma-separated task names
`--n_samples`	1	Samples per problem (200 for pass@k)
`--temperature`	0.2	Sampling temperature
`--max_length_generation`	512	Max tokens (prompt + generation)
`--batch_size`	1	Batch size per GPU
`--allow_code_execution`	False	Enable code execution (required)
`--generation_only`	False	Generate without evaluation
`--load_generations_path`	-	Load pre-generated solutions
`--save_generations`	False	Save generated code
`--metric_output_path`	results.json	Output file for metrics
`--load_in_8bit`	False	8-bit quantization
`--load_in_4bit`	False	4-bit quantization
`--trust_remote_code`	False	Allow custom model code
`--precision`	fp32	Model precision (fp32/fp16/bf16)

参数	默认值	描述
`--model`	-	HuggingFace模型ID或本地路径
`--tasks`	-	逗号分隔的任务名称
`--n_samples`	1	每个问题的采样数量（pass@k评估建议设为200）
`--temperature`	0.2	采样温度
`--max_length_generation`	512	最大生成token数（提示词+生成内容）
`--batch_size`	1	单GPU的批次大小
`--allow_code_execution`	False	启用代码执行（评估功能正确性时必填）
`--generation_only`	False	仅生成代码不执行评估
`--load_generations_path`	-	加载预先生成的代码解决方案路径
`--save_generations`	False	保存生成的代码
`--metric_output_path`	results.json	评估结果输出文件路径
`--load_in_8bit`	False	8bit量化加载模型
`--load_in_4bit`	False	4bit量化加载模型
`--trust_remote_code`	False	允许加载自定义模型代码
`--precision`	fp32	模型精度（fp32/fp16/bf16）

Hardware Requirements

硬件要求

Model Size	VRAM (fp16)	VRAM (4-bit)	Time (HumanEval, n=200)
7B	14GB	6GB	~30 min (A100)
13B	26GB	10GB	~1 hour (A100)
34B	68GB	20GB	~2 hours (A100)

模型规模	VRAM（fp16）	VRAM（4-bit）	评估耗时（HumanEval，n=200）
7B	14GB	6GB	~30分钟（A100）
13B	26GB	10GB	~1小时（A100）
34B	68GB	20GB	~2小时（A100）

evaluating-code-models

Original

Translation

BigCode Evaluation Harness - Code Model Benchmarking

BigCode Evaluation Harness - 代码模型基准测试

Quick Start

快速开始

Common Workflows

常见工作流

Workflow 1: Standard Code Benchmark Evaluation

工作流1：标准代码基准测试

Standard HuggingFace model

标准HuggingFace模型

Quantized model (4-bit)

量化模型（4-bit）

Custom/private model

自定义/私有模型

Full evaluation with pass@k estimation (k=1,10,100)

完整评估并估算pass@k（k=1,10,100）

Workflow 2: Multi-Language Evaluation (MultiPL-E)

工作流2：多语言评估（MultiPL-E）

Generate without execution (safe)

仅生成不执行（安全模式）

Pull the MultiPL-E Docker image

拉取MultiPL-E Docker镜像

Run evaluation inside container

在容器内运行评估

Workflow 3: Instruction-Tuned Model Evaluation

工作流3：指令微调模型评估

For models with chat templates (e.g., CodeLlama-Instruct)

适用于带聊天模板的模型（如CodeLlama-Instruct）

Test code synthesis across 6 languages

测试6种语言的代码合成能力

Workflow 4: Compare Multiple Models

工作流4：多模型对比

eval_models.sh

eval_models.sh

When to Use vs Alternatives

适用场景与替代方案对比

Supported Benchmarks

支持的基准测试集

Common Issues

常见问题

1. Verify n_samples (need 200 for accurate pass@k)

1. 验证n_samples参数（需设置为200以获得准确的pass@k结果）

2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)

2. 检查temperature参数（0.2为类贪婪采样，0.8为随机采样）

3. Verify task name matches exactly

3. 确认任务名称完全匹配

4. Check max_length_generation

4. 检查max_length_generation参数

Use quantization

使用量化加载

OR

或

Reduce batch size

减小批次大小

Set memory limit

设置GPU内存限制

Generate on host (no execution)

在本地主机仅生成代码（不执行）

Evaluate in Docker

在Docker中执行评估

Use instruction-specific tasks

使用指令专用任务

Set instruction tokens for your model

为模型设置对应的指令标记

Command Reference

命令参数参考

Hardware Requirements

硬件要求

Resources

相关资源