evaluating-code-models
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBigCode Evaluation Harness - Code Model Benchmarking
BigCode Evaluation Harness - 代码模型基准测试
Quick Start
快速开始
BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
Installation:
bash
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate configEvaluate on HumanEval:
bash
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generationsView available tasks:
bash
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"BigCode Evaluation Harness 可在HumanEval、MBPP、MultiPL-E(覆盖18种语言)等15+个基准测试集上评估代码生成模型。
安装步骤:
bash
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config在HumanEval上执行评估:
bash
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generations查看可用任务:
bash
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"Common Workflows
常见工作流
Workflow 1: Standard Code Benchmark Evaluation
工作流1:标准代码基准测试
Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
Checklist:
Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k resultsStep 1: Choose benchmark suite
Python code generation (most common):
- HumanEval: 164 handwritten problems, function completion
- HumanEval+: Same 164 problems with 80× more tests (stricter)
- MBPP: 500 crowd-sourced problems, entry-level difficulty
- MBPP+: 399 curated problems with 35× more tests
Multi-language (18 languages):
- MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.
Advanced:
- APPS: 10,000 problems (introductory/interview/competition)
- DS-1000: 1,000 data science problems across 7 libraries
Step 2: Configure model and generation
bash
undefined在核心代码基准测试集(HumanEval、MBPP、HumanEval+)上评估模型。
检查清单:
代码基准测试评估:
- [ ] 步骤1:选择基准测试套件
- [ ] 步骤2:配置模型与生成参数
- [ ] 步骤3:启用代码执行并运行评估
- [ ] 步骤4:分析pass@k结果步骤1:选择基准测试套件
Python代码生成(最常用):
- HumanEval: 164个人工编写的问题,任务为函数补全
- HumanEval+: 相同的164个问题,但测试用例数量增加80倍(评估更严格)
- MBPP: 500个众包收集的问题,难度为入门级
- MBPP+: 399个精选问题,测试用例数量增加35倍(评估更严格)
多语言测试(覆盖18种语言):
- MultiPL-E: 将HumanEval/MBPP翻译为C++、Java、JavaScript、Go、Rust等语言的测试集
进阶测试:
- APPS: 10,000个问题(涵盖入门、面试、竞赛难度)
- DS-1000: 1,000个数据科学问题,覆盖7个常用库
步骤2:配置模型与生成参数
bash
undefinedStandard HuggingFace model
标准HuggingFace模型
accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--do_sample True
--n_samples 200
--batch_size 50
--allow_code_execution
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--do_sample True
--n_samples 200
--batch_size 50
--allow_code_execution
accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--do_sample True
--n_samples 200
--batch_size 50
--allow_code_execution
--model bigcode/starcoder2-7b
--tasks humaneval
--max_length_generation 512
--temperature 0.2
--do_sample True
--n_samples 200
--batch_size 50
--allow_code_execution
Quantized model (4-bit)
量化模型(4-bit)
accelerate launch main.py
--model codellama/CodeLlama-34b-hf
--tasks humaneval
--load_in_4bit
--max_length_generation 512
--allow_code_execution
--model codellama/CodeLlama-34b-hf
--tasks humaneval
--load_in_4bit
--max_length_generation 512
--allow_code_execution
accelerate launch main.py
--model codellama/CodeLlama-34b-hf
--tasks humaneval
--load_in_4bit
--max_length_generation 512
--allow_code_execution
--model codellama/CodeLlama-34b-hf
--tasks humaneval
--load_in_4bit
--max_length_generation 512
--allow_code_execution
Custom/private model
自定义/私有模型
accelerate launch main.py
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution
**Step 3: Run evaluation**
```bashaccelerate launch main.py
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution
--model /path/to/my-code-model
--tasks humaneval
--trust_remote_code
--use_auth_token
--allow_code_execution
**步骤3:运行评估**
```bashFull evaluation with pass@k estimation (k=1,10,100)
完整评估并估算pass@k(k=1,10,100)
accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--temperature 0.8
--n_samples 200
--batch_size 50
--allow_code_execution
--save_generations
--metric_output_path results/starcoder2-humaneval.json
--model bigcode/starcoder2-7b
--tasks humaneval
--temperature 0.8
--n_samples 200
--batch_size 50
--allow_code_execution
--save_generations
--metric_output_path results/starcoder2-humaneval.json
**Step 4: Analyze results**
Results in `results/starcoder2-humaneval.json`:
```json
{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks humaneval
--temperature 0.8
--n_samples 200
--batch_size 50
--allow_code_execution
--save_generations
--metric_output_path results/starcoder2-humaneval.json
--model bigcode/starcoder2-7b
--tasks humaneval
--temperature 0.8
--n_samples 200
--batch_size 50
--allow_code_execution
--save_generations
--metric_output_path results/starcoder2-humaneval.json
**步骤4:分析结果**
结果文件 `results/starcoder2-humaneval.json` 内容示例:
```json
{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}Workflow 2: Multi-Language Evaluation (MultiPL-E)
工作流2:多语言评估(MultiPL-E)
Evaluate code generation across 18 programming languages.
Checklist:
Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languagesStep 1: Generate solutions on host
bash
undefined评估模型在18种编程语言上的代码生成能力。
检查清单:
多语言评估:
- [ ] 步骤1:在本地主机生成解决方案
- [ ] 步骤2:在Docker中运行评估(安全执行)
- [ ] 步骤3:跨语言对比结果步骤1:在本地主机生成解决方案
bash
undefinedGenerate without execution (safe)
仅生成不执行(安全模式)
accelerate launch main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--max_length_generation 650
--temperature 0.8
--n_samples 50
--batch_size 50
--generation_only
--save_generations
--save_generations_path generations_multi.json
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--max_length_generation 650
--temperature 0.8
--n_samples 50
--batch_size 50
--generation_only
--save_generations
--save_generations_path generations_multi.json
**Step 2: Evaluate in Docker container**
```bashaccelerate launch main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--max_length_generation 650
--temperature 0.8
--n_samples 50
--batch_size 50
--generation_only
--save_generations
--save_generations_path generations_multi.json
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--max_length_generation 650
--temperature 0.8
--n_samples 50
--batch_size 50
--generation_only
--save_generations
--save_generations_path generations_multi.json
**步骤2:在Docker容器中执行评估**
```bashPull the MultiPL-E Docker image
拉取MultiPL-E Docker镜像
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
Run evaluation inside container
在容器内运行评估
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro
-it evaluation-harness-multiple python3 main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--load_generations_path /app/generations.json
--allow_code_execution
--n_samples 50
-it evaluation-harness-multiple python3 main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--load_generations_path /app/generations.json
--allow_code_execution
--n_samples 50
**Supported languages**: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racketdocker run -v $(pwd)/generations_multi.json:/app/generations.json:ro
-it evaluation-harness-multiple python3 main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--load_generations_path /app/generations.json
--allow_code_execution
--n_samples 50
-it evaluation-harness-multiple python3 main.py
--model bigcode/starcoder2-7b
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp
--load_generations_path /app/generations.json
--allow_code_execution
--n_samples 50
**支持的语言**: Python、JavaScript、Java、C++、Go、Rust、TypeScript、C#、PHP、Ruby、Swift、Kotlin、Scala、Perl、Julia、Lua、R、RacketWorkflow 3: Instruction-Tuned Model Evaluation
工作流3:指令微调模型评估
Evaluate chat/instruction models with proper formatting.
Checklist:
Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluationStep 1: Choose instruction tasks
- instruct-humaneval: HumanEval with instruction prompts
- humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks
Step 2: Configure instruction tokens
bash
undefined评估对话/指令微调模型,需使用正确的格式。
检查清单:
指令模型评估:
- [ ] 步骤1:使用指令专用任务
- [ ] 步骤2:配置指令标记
- [ ] 步骤3:运行评估步骤1:选择指令专用任务
- instruct-humaneval: 带指令提示的HumanEval测试集
- humanevalsynthesize-{lang}: HumanEvalPack代码合成任务
步骤2:配置指令标记
bash
undefinedFor models with chat templates (e.g., CodeLlama-Instruct)
适用于带聊天模板的模型(如CodeLlama-Instruct)
accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution
**Step 3: HumanEvalPack for instruction models**
```bashaccelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution
--model codellama/CodeLlama-7b-Instruct-hf
--tasks instruct-humaneval
--instruction_tokens "<s>[INST],</s>,[/INST]"
--max_length_generation 512
--allow_code_execution
**步骤3:针对指令模型的HumanEvalPack测试**
```bashTest code synthesis across 6 languages
测试6种语言的代码合成能力
accelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution
undefinedaccelerate launch main.py
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution
--model codellama/CodeLlama-7b-Instruct-hf
--tasks humanevalsynthesize-python,humanevalsynthesize-js
--prompt instruct
--max_length_generation 512
--allow_code_execution
undefinedWorkflow 4: Compare Multiple Models
工作流4:多模型对比
Benchmark suite for model comparison.
Step 1: Create evaluation script
bash
#!/bin/bash用于模型对比的基准测试套件。
步骤1:创建评估脚本
bash
#!/bin/basheval_models.sh
eval_models.sh
MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done
**Step 2: Generate comparison table**
```python
import json
import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
results.append({
"Model": model,
"HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
"MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
})
df = pd.DataFrame(results)
print(df.to_markdown(index=False))MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done
--model $model
--tasks $TASKS
--temperature 0.2
--n_samples 20
--batch_size 20
--allow_code_execution
--metric_output_path results/${model_name}.json done
**步骤2:生成对比表格**
```python
import json
import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
results.append({
"Model": model,
"HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
"MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
})
df = pd.DataFrame(results)
print(df.to_markdown(index=False))When to Use vs Alternatives
适用场景与替代方案对比
Use BigCode Evaluation Harness when:
- Evaluating code generation models specifically
- Need multi-language evaluation (18 languages via MultiPL-E)
- Testing functional correctness with unit tests (pass@k)
- Benchmarking for BigCode/HuggingFace leaderboards
- Evaluating fill-in-the-middle (FIM) capabilities
Use alternatives instead:
- lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
- EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
- SWE-bench: Real-world GitHub issue resolution
- LiveCodeBench: Contamination-free, continuously updated problems
- CodeXGLUE: Code understanding tasks (clone detection, defect prediction)
适合使用BigCode Evaluation Harness的场景:
- 专门评估代码生成模型
- 需要多语言评估(通过MultiPL-E覆盖18种语言)
- 用单元测试验证功能正确性(基于pass@k指标)
- 为BigCode/HuggingFace排行榜提交基准测试结果
- 评估中间补全(FIM)能力
适合使用替代方案的场景:
- lm-evaluation-harness: 通用大语言模型基准测试(如MMLU、GSM8K、HellaSwag)
- EvalPlus: 更严格的HumanEval+/MBPP+测试集,包含更多测试用例
- SWE-bench: 真实GitHub问题解决能力测试
- LiveCodeBench: 无数据污染、持续更新的测试问题集
- CodeXGLUE: 代码理解任务(如代码克隆检测、缺陷预测)
Supported Benchmarks
支持的基准测试集
| Benchmark | Problems | Languages | Metric | Use Case |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | Standard code completion |
| HumanEval+ | 164 | Python | pass@k | Stricter evaluation (80× tests) |
| MBPP | 500 | Python | pass@k | Entry-level problems |
| MBPP+ | 399 | Python | pass@k | Stricter evaluation (35× tests) |
| MultiPL-E | 164×18 | 18 languages | pass@k | Multi-language evaluation |
| APPS | 10,000 | Python | pass@k | Competition-level |
| DS-1000 | 1,000 | Python | pass@k | Data science (pandas, numpy, etc.) |
| HumanEvalPack | 164×3×6 | 6 languages | pass@k | Synthesis/fix/explain |
| Mercury | 1,889 | Python | Efficiency | Computational efficiency |
| 基准测试集 | 问题数量 | 支持语言 | 评估指标 | 适用场景 |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | 标准代码补全测试 |
| HumanEval+ | 164 | Python | pass@k | 严格评估(测试用例×80) |
| MBPP | 500 | Python | pass@k | 入门级问题测试 |
| MBPP+ | 399 | Python | pass@k | 严格评估(测试用例×35) |
| MultiPL-E | 164×18 | 18种语言 | pass@k | 多语言评估 |
| APPS | 10,000 | Python | pass@k | 竞赛级难度测试 |
| DS-1000 | 1,000 | Python | pass@k | 数据科学场景(pandas、numpy等) |
| HumanEvalPack | 164×3×6 | 6种语言 | pass@k | 代码合成/修复/解释任务 |
| Mercury | 1,889 | Python | 执行效率 | 计算效率评估 |
Common Issues
常见问题
Issue: Different results than reported in papers
Check these factors:
bash
undefined问题:与论文中报告的结果不一致
检查以下因素:
bash
undefined1. Verify n_samples (need 200 for accurate pass@k)
1. 验证n_samples参数(需设置为200以获得准确的pass@k结果)
--n_samples 200
--n_samples 200
2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
2. 检查temperature参数(0.2为类贪婪采样,0.8为随机采样)
--temperature 0.8
--temperature 0.8
3. Verify task name matches exactly
3. 确认任务名称完全匹配
--tasks humaneval # Not "human_eval" or "HumanEval"
--tasks humaneval # 不要写成"human_eval"或"HumanEval"
4. Check max_length_generation
4. 检查max_length_generation参数
--max_length_generation 512 # Increase for longer problems
**Issue: CUDA out of memory**
```bash--max_length_generation 512 # 针对长问题可适当增大
**问题:CUDA内存不足**
```bashUse quantization
使用量化加载
--load_in_8bit
--load_in_8bit
OR
或
--load_in_4bit
--load_in_4bit
Reduce batch size
减小批次大小
--batch_size 1
--batch_size 1
Set memory limit
设置GPU内存限制
--max_memory_per_gpu "20GiB"
**Issue: Code execution hangs or times out**
Use Docker for safe execution:
```bash--max_memory_per_gpu "20GiB"
**问题:代码执行卡住或超时**
使用Docker进行安全执行:
```bashGenerate on host (no execution)
在本地主机仅生成代码(不执行)
--generation_only --save_generations
--generation_only --save_generations
Evaluate in Docker
在Docker中执行评估
docker run ... --allow_code_execution --load_generations_path ...
**Issue: Low scores on instruction models**
Ensure proper instruction formatting:
```bashdocker run ... --allow_code_execution --load_generations_path ...
**问题:指令模型得分较低**
确保使用正确的指令格式:
```bashUse instruction-specific tasks
使用指令专用任务
--tasks instruct-humaneval
--tasks instruct-humaneval
Set instruction tokens for your model
为模型设置对应的指令标记
--instruction_tokens "<s>[INST],</s>,[/INST]"
**Issue: MultiPL-E language failures**
Use the dedicated Docker image:
```bash
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple--instruction_tokens "<s>[INST],</s>,[/INST]"
**问题:MultiPL-E多语言测试失败**
使用专用的Docker镜像:
```bash
docker pull ghcr.io/bigcode-project/evaluation-harness-multipleCommand Reference
命令参数参考
| Argument | Default | Description |
|---|---|---|
| - | HuggingFace model ID or local path |
| - | Comma-separated task names |
| 1 | Samples per problem (200 for pass@k) |
| 0.2 | Sampling temperature |
| 512 | Max tokens (prompt + generation) |
| 1 | Batch size per GPU |
| False | Enable code execution (required) |
| False | Generate without evaluation |
| - | Load pre-generated solutions |
| False | Save generated code |
| results.json | Output file for metrics |
| False | 8-bit quantization |
| False | 4-bit quantization |
| False | Allow custom model code |
| fp32 | Model precision (fp32/fp16/bf16) |
| 参数 | 默认值 | 描述 |
|---|---|---|
| - | HuggingFace模型ID或本地路径 |
| - | 逗号分隔的任务名称 |
| 1 | 每个问题的采样数量(pass@k评估建议设为200) |
| 0.2 | 采样温度 |
| 512 | 最大生成token数(提示词+生成内容) |
| 1 | 单GPU的批次大小 |
| False | 启用代码执行(评估功能正确性时必填) |
| False | 仅生成代码不执行评估 |
| - | 加载预先生成的代码解决方案路径 |
| False | 保存生成的代码 |
| results.json | 评估结果输出文件路径 |
| False | 8bit量化加载模型 |
| False | 4bit量化加载模型 |
| False | 允许加载自定义模型代码 |
| fp32 | 模型精度(fp32/fp16/bf16) |
Hardware Requirements
硬件要求
| Model Size | VRAM (fp16) | VRAM (4-bit) | Time (HumanEval, n=200) |
|---|---|---|---|
| 7B | 14GB | 6GB | ~30 min (A100) |
| 13B | 26GB | 10GB | ~1 hour (A100) |
| 34B | 68GB | 20GB | ~2 hours (A100) |
| 模型规模 | VRAM(fp16) | VRAM(4-bit) | 评估耗时(HumanEval,n=200) |
|---|---|---|---|
| 7B | 14GB | 6GB | ~30分钟(A100) |
| 13B | 26GB | 10GB | ~1小时(A100) |
| 34B | 68GB | 20GB | ~2小时(A100) |
Resources
相关资源
- GitHub: https://github.com/bigcode-project/bigcode-evaluation-harness
- Documentation: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
- BigCode Leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
- HumanEval Dataset: https://huggingface.co/datasets/openai/openai_humaneval
- MultiPL-E: https://github.com/nuprl/MultiPL-E
- GitHub仓库: https://github.com/bigcode-project/bigcode-evaluation-harness
- 文档: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
- BigCode排行榜: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
- HumanEval数据集: https://huggingface.co/datasets/openai/openai_humaneval
- MultiPL-E: https://github.com/nuprl/MultiPL-E