Loading...
Loading...
Use when evaluating LLMs, running benchmarks like MMLU/HumanEval/GSM8K, setting up evaluation pipelines, or asking about "NeMo Evaluator", "LLM benchmarking", "model evaluation", "MMLU", "HumanEval", "GSM8K", "benchmark harnesses"
npx skill4agent add eyadsibai/ltk nemo-evaluatorpip install nemo-evaluator-launcherexport NGC_API_KEY=nvapi-your-key-here
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
nemo-evaluator-launcher run --config-dir . --config-name config- [ ] Configure API endpoint (NVIDIA Build or self-hosted)
- [ ] Select benchmarks (MMLU, GSM8K, IFEval, HumanEval)
- [ ] Run evaluation
- [ ] Check resultstarget:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEYtarget:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: ""evaluation:
tasks:
- name: ifeval # Instruction following
- name: gpqa_diamond # Graduate-level QA
env_vars:
HF_TOKEN: HF_TOKEN
- name: gsm8k_cot_instruct # Math reasoning
- name: humaneval # Code generationnemo-evaluator-launcher run --config-dir . --config-name config
nemo-evaluator-launcher status <invocation_id>
cat results/<invocation_id>/<task>/artifacts/results.ymldefaults:
- execution: slurm
- deployment: vllm
- _self_
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4# Same config, different models
nemo-evaluator-launcher run --config-dir . --config-name base_eval \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
nemo-evaluator-launcher run --config-dir . --config-name base_eval \
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
# Export results
nemo-evaluator-launcher export <id> --dest mlflow
nemo-evaluator-launcher export <id> --dest wandb| Harness | Tasks | Categories |
|---|---|---|
| lm-evaluation-harness | 60+ | MMLU, GSM8K, HellaSwag, ARC |
| simple-evals | 20+ | GPQA, MATH, AIME |
| bigcode-evaluation-harness | 25+ | HumanEval, MBPP, MultiPL-E |
| safety-harness | 3 | Aegis, WildGuard |
| vlmevalkit | 6+ | OCRBench, ChartQA, MMMU |
| bfcl | 6 | Function calling v2/v3 |
| Command | Description |
|---|---|
| Execute evaluation with config |
| Check job status |
| List available benchmarks |
| List all invocations |
| Export results (mlflow/wandb/local) |
| Terminate running job |
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEYtasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100