nemo-evaluator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NeMo Evaluator SDK - Enterprise LLM Benchmarking

NeMo Evaluator SDK - 企业级LLM基准测试

Quick Start

快速开始

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation:
bash
pip install nemo-evaluator-launcher
Basic evaluation:
bash
export NGC_API_KEY=nvapi-your-key-here

cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

nemo-evaluator-launcher run --config-dir . --config-name config
NeMo Evaluator SDK 支持通过容器化、可复现的评估方式,在本地Docker、Slurm HPC、Lepton云等多后端环境中,基于18+个测试框架的100+项基准测试对LLM进行评估。
安装:
bash
pip install nemo-evaluator-launcher
基础评估:
bash
export NGC_API_KEY=nvapi-your-key-here

cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

nemo-evaluator-launcher run --config-dir . --config-name config

Common Workflows

常见工作流

Workflow 1: Standard Model Evaluation

工作流1:标准模型评估

Checklist:
- [ ] Configure API endpoint (NVIDIA Build or self-hosted)
- [ ] Select benchmarks (MMLU, GSM8K, IFEval, HumanEval)
- [ ] Run evaluation
- [ ] Check results
Step 1: Configure endpoint
For NVIDIA Build:
yaml
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY
For self-hosted (vLLM, TRT-LLM):
yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""
Step 2: Select benchmarks
yaml
evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation
Step 3: Run and check results
bash
nemo-evaluator-launcher run --config-dir . --config-name config
nemo-evaluator-launcher status <invocation_id>
cat results/<invocation_id>/<task>/artifacts/results.yml
检查清单:
- [ ] 配置API端点(NVIDIA Build或自托管)
- [ ] 选择基准测试(MMLU、GSM8K、IFEval、HumanEval)
- [ ] 运行评估
- [ ] 查看结果
步骤1:配置端点
对于NVIDIA Build:
yaml
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY
对于自托管(vLLM、TRT-LLM):
yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""
步骤2:选择基准测试
yaml
evaluation:
  tasks:
    - name: ifeval           # 指令遵循能力测试
    - name: gpqa_diamond     # 研究生级问答测试
      env_vars:
        HF_TOKEN: HF_TOKEN
    - name: gsm8k_cot_instruct  # 数学推理测试
    - name: humaneval        # 代码生成测试
步骤3:运行并查看结果
bash
nemo-evaluator-launcher run --config-dir . --config-name config
nemo-evaluator-launcher status <invocation_id>
cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Slurm HPC Evaluation

工作流2:Slurm HPC评估

yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4

Workflow 3: Model Comparison

工作流3:模型对比

bash
undefined
bash
undefined

Same config, different models

相同配置,不同模型

nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
nemo-evaluator-launcher run --config-dir . --config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3

Export results

导出结果

nemo-evaluator-launcher export <id> --dest mlflow nemo-evaluator-launcher export <id> --dest wandb
undefined
nemo-evaluator-launcher export <id> --dest mlflow nemo-evaluator-launcher export <id> --dest wandb
undefined

Supported Harnesses

支持的测试框架

HarnessTasksCategories
lm-evaluation-harness60+MMLU, GSM8K, HellaSwag, ARC
simple-evals20+GPQA, MATH, AIME
bigcode-evaluation-harness25+HumanEval, MBPP, MultiPL-E
safety-harness3Aegis, WildGuard
vlmevalkit6+OCRBench, ChartQA, MMMU
bfcl6Function calling v2/v3
测试框架任务数量分类
lm-evaluation-harness60+MMLU、GSM8K、HellaSwag、ARC
simple-evals20+GPQA、MATH、AIME
bigcode-evaluation-harness25+HumanEval、MBPP、MultiPL-E
safety-harness3Aegis、WildGuard
vlmevalkit6+OCRBench、ChartQA、MMMU
bfcl6函数调用v2/v3

CLI Reference

CLI参考

CommandDescription
run
Execute evaluation with config
status <id>
Check job status
ls tasks
List available benchmarks
ls runs
List all invocations
export <id>
Export results (mlflow/wandb/local)
kill <id>
Terminate running job
命令描述
run
使用配置文件执行评估
status <id>
检查任务状态
ls tasks
列出可用的基准测试
ls runs
列出所有调用记录
export <id>
导出结果(支持mlflow/wandb/本地)
kill <id>
终止运行中的任务

When to Use vs Alternatives

适用场景与替代方案对比

Use NeMo Evaluator when:
  • Need 100+ benchmarks from 18+ harnesses
  • Running on Slurm HPC clusters
  • Requiring reproducible containerized evaluation
  • Evaluating against OpenAI-compatible APIs
Use alternatives instead:
  • lm-evaluation-harness: Simpler local evaluation
  • bigcode-evaluation-harness: Code-only benchmarks
  • HELM: Broader evaluation (fairness, efficiency)
选择NeMo Evaluator的场景:
  • 需要基于18+个测试框架的100+项基准测试
  • 在Slurm HPC集群上运行评估
  • 要求容器化的可复现评估
  • 针对兼容OpenAI的API进行评估
选择替代方案的场景:
  • lm-evaluation-harness: 仅需简单的本地评估
  • bigcode-evaluation-harness: 仅需代码相关的基准测试
  • HELM: 需要更全面的评估(如公平性、效率)

Common Issues

常见问题

Container pull fails: Configure NGC credentials
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Task requires env var: Add to task config
yaml
tasks:
  - name: gpqa_diamond
    env_vars:
      HF_TOKEN: HF_TOKEN
Increase parallelism:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
容器拉取失败: 配置NGC凭证
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
任务需要环境变量: 添加到任务配置中
yaml
tasks:
  - name: gpqa_diamond
    env_vars:
      HF_TOKEN: HF_TOKEN
提升并行度:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Requirements

环境要求

  • Python 3.10-3.13
  • Docker (for local execution)
  • NGC API Key (for NVIDIA Build)
  • HF_TOKEN (for some benchmarks)
  • Python 3.10-3.13
  • Docker(本地执行时需要)
  • NGC API Key(使用NVIDIA Build时需要)
  • HF_TOKEN(部分基准测试需要)