evaluating-code-models

Original🇺🇸 English
Not Translated

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

2installs

NPX Install

npx skill4agent add orchestra-research/ai-research-skills evaluating-code-models

SKILL.md Content

BigCode Evaluation Harness - Code Model Benchmarking

Quick Start

BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
Installation:
bash
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config
Evaluate on HumanEval:
bash
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations
View available tasks:
bash
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"

Common Workflows

Workflow 1: Standard Code Benchmark Evaluation

Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
Checklist:
Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results
Step 1: Choose benchmark suite
Python code generation (most common):
  • HumanEval: 164 handwritten problems, function completion
  • HumanEval+: Same 164 problems with 80× more tests (stricter)
  • MBPP: 500 crowd-sourced problems, entry-level difficulty
  • MBPP+: 399 curated problems with 35× more tests
Multi-language (18 languages):
  • MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.
Advanced:
  • APPS: 10,000 problems (introductory/interview/competition)
  • DS-1000: 1,000 data science problems across 7 libraries
Step 2: Configure model and generation
bash
# Standard HuggingFace model
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --max_length_generation 512 \
  --temperature 0.2 \
  --do_sample True \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution

# Quantized model (4-bit)
accelerate launch main.py \
  --model codellama/CodeLlama-34b-hf \
  --tasks humaneval \
  --load_in_4bit \
  --max_length_generation 512 \
  --allow_code_execution

# Custom/private model
accelerate launch main.py \
  --model /path/to/my-code-model \
  --tasks humaneval \
  --trust_remote_code \
  --use_auth_token \
  --allow_code_execution
Step 3: Run evaluation
bash
# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks humaneval \
  --temperature 0.8 \
  --n_samples 200 \
  --batch_size 50 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path results/starcoder2-humaneval.json
Step 4: Analyze results
Results in
results/starcoder2-humaneval.json
:
json
{
  "humaneval": {
    "pass@1": 0.354,
    "pass@10": 0.521,
    "pass@100": 0.689
  },
  "config": {
    "model": "bigcode/starcoder2-7b",
    "temperature": 0.8,
    "n_samples": 200
  }
}

Workflow 2: Multi-Language Evaluation (MultiPL-E)

Evaluate code generation across 18 programming languages.
Checklist:
Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages
Step 1: Generate solutions on host
bash
# Generate without execution (safe)
accelerate launch main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --n_samples 50 \
  --batch_size 50 \
  --generation_only \
  --save_generations \
  --save_generations_path generations_multi.json
Step 2: Evaluate in Docker container
bash
# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
  -it evaluation-harness-multiple python3 main.py \
  --model bigcode/starcoder2-7b \
  --tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
  --load_generations_path /app/generations.json \
  --allow_code_execution \
  --n_samples 50
Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket

Workflow 3: Instruction-Tuned Model Evaluation

Evaluate chat/instruction models with proper formatting.
Checklist:
Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation
Step 1: Choose instruction tasks
  • instruct-humaneval: HumanEval with instruction prompts
  • humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks
Step 2: Configure instruction tokens
bash
# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks instruct-humaneval \
  --instruction_tokens "<s>[INST],</s>,[/INST]" \
  --max_length_generation 512 \
  --allow_code_execution
Step 3: HumanEvalPack for instruction models
bash
# Test code synthesis across 6 languages
accelerate launch main.py \
  --model codellama/CodeLlama-7b-Instruct-hf \
  --tasks humanevalsynthesize-python,humanevalsynthesize-js \
  --prompt instruct \
  --max_length_generation 512 \
  --allow_code_execution

Workflow 4: Compare Multiple Models

Benchmark suite for model comparison.
Step 1: Create evaluation script
bash
#!/bin/bash
# eval_models.sh

MODELS=(
  "bigcode/starcoder2-7b"
  "codellama/CodeLlama-7b-hf"
  "deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"

for model in "${MODELS[@]}"; do
  model_name=$(echo $model | tr '/' '-')
  echo "Evaluating $model"

  accelerate launch main.py \
    --model $model \
    --tasks $TASKS \
    --temperature 0.2 \
    --n_samples 20 \
    --batch_size 20 \
    --allow_code_execution \
    --metric_output_path results/${model_name}.json
done
Step 2: Generate comparison table
python
import json
import pandas as pd

models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []

for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        results.append({
            "Model": model,
            "HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
            "MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
        })

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

When to Use vs Alternatives

Use BigCode Evaluation Harness when:
  • Evaluating code generation models specifically
  • Need multi-language evaluation (18 languages via MultiPL-E)
  • Testing functional correctness with unit tests (pass@k)
  • Benchmarking for BigCode/HuggingFace leaderboards
  • Evaluating fill-in-the-middle (FIM) capabilities
Use alternatives instead:
  • lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
  • EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
  • SWE-bench: Real-world GitHub issue resolution
  • LiveCodeBench: Contamination-free, continuously updated problems
  • CodeXGLUE: Code understanding tasks (clone detection, defect prediction)

Supported Benchmarks

BenchmarkProblemsLanguagesMetricUse Case
HumanEval164Pythonpass@kStandard code completion
HumanEval+164Pythonpass@kStricter evaluation (80× tests)
MBPP500Pythonpass@kEntry-level problems
MBPP+399Pythonpass@kStricter evaluation (35× tests)
MultiPL-E164×1818 languagespass@kMulti-language evaluation
APPS10,000Pythonpass@kCompetition-level
DS-10001,000Pythonpass@kData science (pandas, numpy, etc.)
HumanEvalPack164×3×66 languagespass@kSynthesis/fix/explain
Mercury1,889PythonEfficiencyComputational efficiency

Common Issues

Issue: Different results than reported in papers
Check these factors:
bash
# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200

# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8

# 3. Verify task name matches exactly
--tasks humaneval  # Not "human_eval" or "HumanEval"

# 4. Check max_length_generation
--max_length_generation 512  # Increase for longer problems
Issue: CUDA out of memory
bash
# Use quantization
--load_in_8bit
# OR
--load_in_4bit

# Reduce batch size
--batch_size 1

# Set memory limit
--max_memory_per_gpu "20GiB"
Issue: Code execution hangs or times out
Use Docker for safe execution:
bash
# Generate on host (no execution)
--generation_only --save_generations

# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...
Issue: Low scores on instruction models
Ensure proper instruction formatting:
bash
# Use instruction-specific tasks
--tasks instruct-humaneval

# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"
Issue: MultiPL-E language failures
Use the dedicated Docker image:
bash
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple

Command Reference

ArgumentDefaultDescription
--model
-HuggingFace model ID or local path
--tasks
-Comma-separated task names
--n_samples
1Samples per problem (200 for pass@k)
--temperature
0.2Sampling temperature
--max_length_generation
512Max tokens (prompt + generation)
--batch_size
1Batch size per GPU
--allow_code_execution
FalseEnable code execution (required)
--generation_only
FalseGenerate without evaluation
--load_generations_path
-Load pre-generated solutions
--save_generations
FalseSave generated code
--metric_output_path
results.jsonOutput file for metrics
--load_in_8bit
False8-bit quantization
--load_in_4bit
False4-bit quantization
--trust_remote_code
FalseAllow custom model code
--precision
fp32Model precision (fp32/fp16/bf16)

Hardware Requirements

Model SizeVRAM (fp16)VRAM (4-bit)Time (HumanEval, n=200)
7B14GB6GB~30 min (A100)
13B26GB10GB~1 hour (A100)
34B68GB20GB~2 hours (A100)

Resources