launching-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNeMo Evaluator Skill
NeMo Evaluator 技能
Quick Reference
快速参考
nemo-evaluator-launcher CLI
nemo-evaluator-launcher 命令行工具
bash
undefinedbash
undefinedRun evaluation
运行评估
uv run nemo-evaluator-launcher run --config <path.yaml>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...
uv run nemo-evaluator-launcher run --config <path.yaml>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...
Preview the resolved config and the sbatch script without running the evaluation
预览解析后的配置和sbatch脚本,不实际运行评估
uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run
uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run
Check status (--json for machine-readable output)
检查状态(--json 参数用于生成机器可读的输出)
uv run nemo-evaluator-launcher status <invocation_id> --json
uv run nemo-evaluator-launcher status <invocation_id> --json
Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)
获取评估运行信息(输出路径、Slurm作业ID、集群主机名等)
uv run nemo-evaluator-launcher info <invocation_id>
uv run nemo-evaluator-launcher info <invocation_id>
Copy just the logs (quick — good for debugging)
仅复制日志(快速操作,适合调试)
uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/
uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/
For artifacts: use nel info
to discover paths. If remote, SSH to explore and rsync what you need.
nel info关于工件:使用 nel info
命令查找路径。如果是远程路径,通过SSH探索并使用rsync复制所需内容。
nel infoIf local, just read directly from the paths shown by nel info
.
nel info如果是本地路径,直接从 nel info
显示的路径读取即可。
nel infossh <user>@<hostname> "ls <artifacts_path>/"
ssh <user>@<hostname> "ls <artifacts_path>/"
rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/
rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/
Resume a failed/interrupted run (re-sbatches existing run.sub in the original run directory)
恢复失败/中断的运行任务(重新提交原始运行目录中的run.sub脚本)
uv run nemo-evaluator-launcher resume <invocation_id>
uv run nemo-evaluator-launcher resume <invocation_id>
List past runs
列出过往运行任务
uv run nemo-evaluator-launcher ls runs --since 1d
uv run nemo-evaluator-launcher ls runs --since 1d
List available evaluation tasks (by default, only shows tasks from the latest released containers)
列出可用的评估任务(默认仅显示最新发布容器中的任务)
uv run nemo-evaluator-launcher ls tasks
uv run nemo-evaluator-launcher ls tasks --from_container nvcr.io/nvidia/eval-factory/simple-evals:26.03
undefineduv run nemo-evaluator-launcher ls tasks
uv run nemo-evaluator-launcher ls tasks --from_container nvcr.io/nvidia/eval-factory/simple-evals:26.03
undefinedWorkflow
工作流程
The complete evaluation workflow is divided into the following steps you should follow IN ORDER.
- Create or modify a config using the skill. If the user provides a past run, use its
nel-assistantartifact as a starting point.config.yml - Run the evaluation. See when executing this step.
references/run-evaluation.md - Monitor progress (MANDATORY after every ): poll status repeatedly until SUCCESS/FAILED. See
nel run.references/check-progress.md - Post-run actions (when terminal state reached):
- When the evaluation status is , analyze the results. See
SUCCESSwhen executing this step.references/analyze-results.md - When the evaluation status is , debug the failed run. See
FAILEDwhen executing this step.references/debug-failed-runs.md
- When the evaluation status is
完整的评估工作流程分为以下步骤,请按顺序执行。
- 使用 技能创建或修改配置。如果用户提供过往的运行任务,可以将其
nel-assistant工件作为起点。config.yml - 运行评估。执行此步骤时,请参考文档。
references/run-evaluation.md - 监控进度(每次执行后必须执行):反复轮询状态,直到任务显示SUCCESS/FAILED。请参考
nel run文档。references/check-progress.md - 运行后操作(当任务进入终端状态时):
- 当评估状态为时,分析结果。执行此步骤时,请参考
SUCCESS文档。references/analyze-results.md - 当评估状态为时,调试失败的运行任务。执行此步骤时,请参考
FAILED文档。references/debug-failed-runs.md
- 当评估状态为
Key Facts
关键要点
- Benchmark-specific info learned during launching/analyzing evals should be added to
references/benchmarks/ - PPP = Slurm account (the field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g.,
account→coreai_dlalgo_compeval).coreai_dlalgo_llm - Slurm job pairs: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
- HF cache requirement: For configs with , models must be pre-downloaded to the HF cache on each cluster before launching. Before running a model on a new cluster, always ask the user if the model is already cached there. If not, on the cluster login node:
HF_HUB_OFFLINE=1thenpython3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub. Without this, vLLM will fail withHF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>.LocalEntryNotFoundError - is per node:
data_parallel_sizewithdp_size=1means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpretnum_nodes=8as the global replica count.dp_size - interceptor: The
payload_modifierlist (e.g.params_to_remove) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.[max_tokens, max_completion_tokens] - Auto-export git workaround: The export container () lacks
python:3.12-slim. When installing the launcher from a git URL, setgitto install git first (e.g.,auto_export.launcher_install_cmd).apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher" - Do NOT use — it only writes a summary JSON (
nemo-evaluator-launcher export --dest local), it does NOT copy actual logs or artifacts despite acceptingprocessed_results.jsonand--copy_logsflags.--copy-artifactsworks but copies everything (very slow for large benchmarks). Preferred approach: usenel info --copy-artifactsto discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note thatnel infoprints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.nel info
- 在启动/分析评估过程中获得的特定基准测试信息应添加至目录
references/benchmarks/ - PPP = Slurm账户(对应cluster_config.yaml中的字段)。当用户要求“将PPP更改为X”时,更新账户值(例如:
account→coreai_dlalgo_compeval)。coreai_dlalgo_llm - Slurm作业对:NEL(nemo-evaluator-launcher)会提交成对的Slurm作业——一个RUNNING状态的作业 + 一个PENDING状态的重启作业(用于4小时 walltime 到期时)。请勿取消处于pending状态的重启作业——它们是预期且必要的。
- HF缓存要求:对于设置了的配置,模型必须在启动前预先下载到每个集群的HF缓存中。在新集群上运行模型前,务必询问用户模型是否已缓存。 如果未缓存,在集群登录节点执行:
HF_HUB_OFFLINE=1,然后执行python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub。如果不执行此操作,vLLM会抛出HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>错误。LocalEntryNotFoundError - 为单节点配置:
data_parallel_size且dp_size=1意味着总共有8个模型实例(每个节点一个),由haproxy进行负载均衡。请勿将num_nodes=8理解为全局副本数量。dp_size - 拦截器:
payload_modifier列表(例如params_to_remove)会从输出负载中移除这些字段,有意解除输出长度限制,以便推理模型可以不受限制地进行思考。[max_tokens, max_completion_tokens] - 自动导出Git临时解决方案:导出容器()未安装
python:3.12-slim。从Git URL安装启动器时,需设置git先安装git(例如:auto_export.launcher_install_cmd)。apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher" - 请勿使用命令——该命令仅会写入一个摘要JSON文件(
nemo-evaluator-launcher export --dest local),尽管它接受processed_results.json和--copy_logs参数,但并不会复制实际的日志或工件。--copy-artifacts命令可以工作,但会复制所有内容(对于大型基准测试来说非常缓慢)。推荐方法:使用nel info --copy-artifacts命令查找路径——如果是本地路径,直接读取;如果是远程路径,通过SSH探索并使用rsync仅复制所需内容。请注意,nel info会打印标准工件,但基准测试会在子目录中生成额外工件,需要自行探索查找。nel info