Loading...
Loading...
Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
npx skill4agent add davila7/claude-code-templates evaluating-code-modelsgit clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate configaccelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generationspython -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results# Standard HuggingFace model
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--do_sample True \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution
# Quantized model (4-bit)
accelerate launch main.py \
--model codellama/CodeLlama-34b-hf \
--tasks humaneval \
--load_in_4bit \
--max_length_generation 512 \
--allow_code_execution
# Custom/private model
accelerate launch main.py \
--model /path/to/my-code-model \
--tasks humaneval \
--trust_remote_code \
--use_auth_token \
--allow_code_execution# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--temperature 0.8 \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution \
--save_generations \
--metric_output_path results/starcoder2-humaneval.jsonresults/starcoder2-humaneval.json{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages# Generate without execution (safe)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--max_length_generation 650 \
--temperature 0.8 \
--n_samples 50 \
--batch_size 50 \
--generation_only \
--save_generations \
--save_generations_path generations_multi.json# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
-it evaluation-harness-multiple python3 main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--load_generations_path /app/generations.json \
--allow_code_execution \
--n_samples 50Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks instruct-humaneval \
--instruction_tokens "<s>[INST],</s>,[/INST]" \
--max_length_generation 512 \
--allow_code_execution# Test code synthesis across 6 languages
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python,humanevalsynthesize-js \
--prompt instruct \
--max_length_generation 512 \
--allow_code_execution#!/bin/bash
# eval_models.sh
MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py \
--model $model \
--tasks $TASKS \
--temperature 0.2 \
--n_samples 20 \
--batch_size 20 \
--allow_code_execution \
--metric_output_path results/${model_name}.json
doneimport json
import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
results.append({
"Model": model,
"HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
"MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
})
df = pd.DataFrame(results)
print(df.to_markdown(index=False))| Benchmark | Problems | Languages | Metric | Use Case |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | Standard code completion |
| HumanEval+ | 164 | Python | pass@k | Stricter evaluation (80× tests) |
| MBPP | 500 | Python | pass@k | Entry-level problems |
| MBPP+ | 399 | Python | pass@k | Stricter evaluation (35× tests) |
| MultiPL-E | 164×18 | 18 languages | pass@k | Multi-language evaluation |
| APPS | 10,000 | Python | pass@k | Competition-level |
| DS-1000 | 1,000 | Python | pass@k | Data science (pandas, numpy, etc.) |
| HumanEvalPack | 164×3×6 | 6 languages | pass@k | Synthesis/fix/explain |
| Mercury | 1,889 | Python | Efficiency | Computational efficiency |
# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200
# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8
# 3. Verify task name matches exactly
--tasks humaneval # Not "human_eval" or "HumanEval"
# 4. Check max_length_generation
--max_length_generation 512 # Increase for longer problems# Use quantization
--load_in_8bit
# OR
--load_in_4bit
# Reduce batch size
--batch_size 1
# Set memory limit
--max_memory_per_gpu "20GiB"# Generate on host (no execution)
--generation_only --save_generations
# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...# Use instruction-specific tasks
--tasks instruct-humaneval
# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"docker pull ghcr.io/bigcode-project/evaluation-harness-multiple| Argument | Default | Description |
|---|---|---|
| - | HuggingFace model ID or local path |
| - | Comma-separated task names |
| 1 | Samples per problem (200 for pass@k) |
| 0.2 | Sampling temperature |
| 512 | Max tokens (prompt + generation) |
| 1 | Batch size per GPU |
| False | Enable code execution (required) |
| False | Generate without evaluation |
| - | Load pre-generated solutions |
| False | Save generated code |
| results.json | Output file for metrics |
| False | 8-bit quantization |
| False | 4-bit quantization |
| False | Allow custom model code |
| fp32 | Model precision (fp32/fp16/bf16) |
| Model Size | VRAM (fp16) | VRAM (4-bit) | Time (HumanEval, n=200) |
|---|---|---|---|
| 7B | 14GB | 6GB | ~30 min (A100) |
| 13B | 26GB | 10GB | ~1 hour (A100) |
| 34B | 68GB | 20GB | ~2 hours (A100) |