nemo-evaluator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNeMo Evaluator SDK - Enterprise LLM Benchmarking
NeMo Evaluator SDK - 企业级LLM基准测试
Quick Start
快速开始
NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation:
bash
pip install nemo-evaluator-launcherBasic evaluation:
bash
export NGC_API_KEY=nvapi-your-key-here
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
nemo-evaluator-launcher run --config-dir . --config-name configNeMo Evaluator SDK 支持通过容器化、可复现的评估方式,在本地Docker、Slurm HPC、Lepton云等多后端环境中,基于18+个测试框架的100+项基准测试对LLM进行评估。
安装:
bash
pip install nemo-evaluator-launcher基础评估:
bash
export NGC_API_KEY=nvapi-your-key-here
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
nemo-evaluator-launcher run --config-dir . --config-name configCommon Workflows
常见工作流
Workflow 1: Standard Model Evaluation
工作流1:标准模型评估
Checklist:
- [ ] Configure API endpoint (NVIDIA Build or self-hosted)
- [ ] Select benchmarks (MMLU, GSM8K, IFEval, HumanEval)
- [ ] Run evaluation
- [ ] Check resultsStep 1: Configure endpoint
For NVIDIA Build:
yaml
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEYFor self-hosted (vLLM, TRT-LLM):
yaml
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: ""Step 2: Select benchmarks
yaml
evaluation:
tasks:
- name: ifeval # Instruction following
- name: gpqa_diamond # Graduate-level QA
env_vars:
HF_TOKEN: HF_TOKEN
- name: gsm8k_cot_instruct # Math reasoning
- name: humaneval # Code generationStep 3: Run and check results
bash
nemo-evaluator-launcher run --config-dir . --config-name config
nemo-evaluator-launcher status <invocation_id>
cat results/<invocation_id>/<task>/artifacts/results.yml检查清单:
- [ ] 配置API端点(NVIDIA Build或自托管)
- [ ] 选择基准测试(MMLU、GSM8K、IFEval、HumanEval)
- [ ] 运行评估
- [ ] 查看结果步骤1:配置端点
对于NVIDIA Build:
yaml
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY对于自托管(vLLM、TRT-LLM):
yaml
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: ""步骤2:选择基准测试
yaml
evaluation:
tasks:
- name: ifeval # 指令遵循能力测试
- name: gpqa_diamond # 研究生级问答测试
env_vars:
HF_TOKEN: HF_TOKEN
- name: gsm8k_cot_instruct # 数学推理测试
- name: humaneval # 代码生成测试步骤3:运行并查看结果
bash
nemo-evaluator-launcher run --config-dir . --config-name config
nemo-evaluator-launcher status <invocation_id>
cat results/<invocation_id>/<task>/artifacts/results.ymlWorkflow 2: Slurm HPC Evaluation
工作流2:Slurm HPC评估
yaml
defaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4yaml
defaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4Workflow 3: Model Comparison
工作流3:模型对比
bash
undefinedbash
undefinedSame config, different models
相同配置,不同模型
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
Export results
导出结果
nemo-evaluator-launcher export <id> --dest mlflow
nemo-evaluator-launcher export <id> --dest wandb
undefinednemo-evaluator-launcher export <id> --dest mlflow
nemo-evaluator-launcher export <id> --dest wandb
undefinedSupported Harnesses
支持的测试框架
| Harness | Tasks | Categories |
|---|---|---|
| lm-evaluation-harness | 60+ | MMLU, GSM8K, HellaSwag, ARC |
| simple-evals | 20+ | GPQA, MATH, AIME |
| bigcode-evaluation-harness | 25+ | HumanEval, MBPP, MultiPL-E |
| safety-harness | 3 | Aegis, WildGuard |
| vlmevalkit | 6+ | OCRBench, ChartQA, MMMU |
| bfcl | 6 | Function calling v2/v3 |
| 测试框架 | 任务数量 | 分类 |
|---|---|---|
| lm-evaluation-harness | 60+ | MMLU、GSM8K、HellaSwag、ARC |
| simple-evals | 20+ | GPQA、MATH、AIME |
| bigcode-evaluation-harness | 25+ | HumanEval、MBPP、MultiPL-E |
| safety-harness | 3 | Aegis、WildGuard |
| vlmevalkit | 6+ | OCRBench、ChartQA、MMMU |
| bfcl | 6 | 函数调用v2/v3 |
CLI Reference
CLI参考
| Command | Description |
|---|---|
| Execute evaluation with config |
| Check job status |
| List available benchmarks |
| List all invocations |
| Export results (mlflow/wandb/local) |
| Terminate running job |
| 命令 | 描述 |
|---|---|
| 使用配置文件执行评估 |
| 检查任务状态 |
| 列出可用的基准测试 |
| 列出所有调用记录 |
| 导出结果(支持mlflow/wandb/本地) |
| 终止运行中的任务 |
When to Use vs Alternatives
适用场景与替代方案对比
Use NeMo Evaluator when:
- Need 100+ benchmarks from 18+ harnesses
- Running on Slurm HPC clusters
- Requiring reproducible containerized evaluation
- Evaluating against OpenAI-compatible APIs
Use alternatives instead:
- lm-evaluation-harness: Simpler local evaluation
- bigcode-evaluation-harness: Code-only benchmarks
- HELM: Broader evaluation (fairness, efficiency)
选择NeMo Evaluator的场景:
- 需要基于18+个测试框架的100+项基准测试
- 在Slurm HPC集群上运行评估
- 要求容器化的可复现评估
- 针对兼容OpenAI的API进行评估
选择替代方案的场景:
- lm-evaluation-harness: 仅需简单的本地评估
- bigcode-evaluation-harness: 仅需代码相关的基准测试
- HELM: 需要更全面的评估(如公平性、效率)
Common Issues
常见问题
Container pull fails: Configure NGC credentials
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEYTask requires env var: Add to task config
yaml
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKENIncrease parallelism:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100容器拉取失败: 配置NGC凭证
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY任务需要环境变量: 添加到任务配置中
yaml
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN提升并行度:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100Requirements
环境要求
- Python 3.10-3.13
- Docker (for local execution)
- NGC API Key (for NVIDIA Build)
- HF_TOKEN (for some benchmarks)
- Python 3.10-3.13
- Docker(本地执行时需要)
- NGC API Key(使用NVIDIA Build时需要)
- HF_TOKEN(部分基准测试需要)