launching-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NeMo Evaluator Skill

NeMo Evaluator 技能

Quick Reference

快速参考

nemo-evaluator-launcher CLI

nemo-evaluator-launcher 命令行工具

bash
undefined
bash
undefined

Run evaluation

运行评估

uv run nemo-evaluator-launcher run --config <path.yaml> uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name> uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ... uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...
uv run nemo-evaluator-launcher run --config <path.yaml> uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name> uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ... uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...

Preview the resolved config and the sbatch script without running the evaluation

预览解析后的配置和sbatch脚本,不实际运行评估

uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run
uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run

Check status (--json for machine-readable output)

检查状态(--json 参数用于生成机器可读的输出)

uv run nemo-evaluator-launcher status <invocation_id> --json
uv run nemo-evaluator-launcher status <invocation_id> --json

Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)

获取评估运行信息(输出路径、Slurm作业ID、集群主机名等)

uv run nemo-evaluator-launcher info <invocation_id>
uv run nemo-evaluator-launcher info <invocation_id>

Copy just the logs (quick — good for debugging)

仅复制日志(快速操作,适合调试)

uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/
uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/

For artifacts: use
nel info
to discover paths. If remote, SSH to explore and rsync what you need.

关于工件:使用
nel info
命令查找路径。如果是远程路径,通过SSH探索并使用rsync复制所需内容。

If local, just read directly from the paths shown by
nel info
.

如果是本地路径,直接从
nel info
显示的路径读取即可。

ssh <user>@<hostname> "ls <artifacts_path>/"

ssh <user>@<hostname> "ls <artifacts_path>/"

rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/

rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/

Resume a failed/interrupted run (re-sbatches existing run.sub in the original run directory)

恢复失败/中断的运行任务(重新提交原始运行目录中的run.sub脚本)

uv run nemo-evaluator-launcher resume <invocation_id>
uv run nemo-evaluator-launcher resume <invocation_id>

List past runs

列出过往运行任务

uv run nemo-evaluator-launcher ls runs --since 1d
uv run nemo-evaluator-launcher ls runs --since 1d

List available evaluation tasks (by default, only shows tasks from the latest released containers)

列出可用的评估任务(默认仅显示最新发布容器中的任务)

uv run nemo-evaluator-launcher ls tasks uv run nemo-evaluator-launcher ls tasks --from_container nvcr.io/nvidia/eval-factory/simple-evals:26.03
undefined
uv run nemo-evaluator-launcher ls tasks uv run nemo-evaluator-launcher ls tasks --from_container nvcr.io/nvidia/eval-factory/simple-evals:26.03
undefined

Workflow

工作流程

The complete evaluation workflow is divided into the following steps you should follow IN ORDER.
  1. Create or modify a config using the
    nel-assistant
    skill. If the user provides a past run, use its
    config.yml
    artifact as a starting point.
  2. Run the evaluation. See
    references/run-evaluation.md
    when executing this step.
  3. Monitor progress (MANDATORY after every
    nel run
    )
    : poll status repeatedly until SUCCESS/FAILED. See
    references/check-progress.md
    .
  4. Post-run actions (when terminal state reached):
    1. When the evaluation status is
      SUCCESS
      , analyze the results. See
      references/analyze-results.md
      when executing this step.
    2. When the evaluation status is
      FAILED
      , debug the failed run. See
      references/debug-failed-runs.md
      when executing this step.
完整的评估工作流程分为以下步骤,请按顺序执行。
  1. 使用
    nel-assistant
    技能创建或修改配置。如果用户提供过往的运行任务,可以将其
    config.yml
    工件作为起点。
  2. 运行评估。执行此步骤时,请参考
    references/run-evaluation.md
    文档。
  3. 监控进度(每次执行
    nel run
    后必须执行)
    :反复轮询状态,直到任务显示SUCCESS/FAILED。请参考
    references/check-progress.md
    文档。
  4. 运行后操作(当任务进入终端状态时):
    1. 当评估状态为
      SUCCESS
      时,分析结果。执行此步骤时,请参考
      references/analyze-results.md
      文档。
    2. 当评估状态为
      FAILED
      时,调试失败的运行任务。执行此步骤时,请参考
      references/debug-failed-runs.md
      文档。

Key Facts

关键要点

  • Benchmark-specific info learned during launching/analyzing evals should be added to
    references/benchmarks/
  • PPP = Slurm account (the
    account
    field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g.,
    coreai_dlalgo_compeval
    coreai_dlalgo_llm
    ).
  • Slurm job pairs: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
  • HF cache requirement: For configs with
    HF_HUB_OFFLINE=1
    , models must be pre-downloaded to the HF cache on each cluster before launching. Before running a model on a new cluster, always ask the user if the model is already cached there. If not, on the cluster login node:
    python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub
    then
    HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>
    . Without this, vLLM will fail with
    LocalEntryNotFoundError
    .
  • data_parallel_size
    is per node
    :
    dp_size=1
    with
    num_nodes=8
    means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret
    dp_size
    as the global replica count.
  • payload_modifier
    interceptor
    : The
    params_to_remove
    list (e.g.
    [max_tokens, max_completion_tokens]
    ) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
  • Auto-export git workaround: The export container (
    python:3.12-slim
    ) lacks
    git
    . When installing the launcher from a git URL, set
    auto_export.launcher_install_cmd
    to install git first (e.g.,
    apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"
    ).
  • Do NOT use
    nemo-evaluator-launcher export --dest local
    — it only writes a summary JSON (
    processed_results.json
    ), it does NOT copy actual logs or artifacts despite accepting
    --copy_logs
    and
    --copy-artifacts
    flags.
    nel info --copy-artifacts
    works but copies everything (very slow for large benchmarks). Preferred approach: use
    nel info
    to discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note that
    nel info
    prints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.
  • 在启动/分析评估过程中获得的特定基准测试信息应添加至
    references/benchmarks/
    目录
  • PPP = Slurm账户(对应cluster_config.yaml中的
    account
    字段)。当用户要求“将PPP更改为X”时,更新账户值(例如:
    coreai_dlalgo_compeval
    coreai_dlalgo_llm
    )。
  • Slurm作业对:NEL(nemo-evaluator-launcher)会提交成对的Slurm作业——一个RUNNING状态的作业 + 一个PENDING状态的重启作业(用于4小时 walltime 到期时)。请勿取消处于pending状态的重启作业——它们是预期且必要的。
  • HF缓存要求:对于设置了
    HF_HUB_OFFLINE=1
    的配置,模型必须在启动前预先下载到每个集群的HF缓存中。在新集群上运行模型前,务必询问用户模型是否已缓存。 如果未缓存,在集群登录节点执行:
    python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub
    ,然后执行
    HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>
    。如果不执行此操作,vLLM会抛出
    LocalEntryNotFoundError
    错误。
  • data_parallel_size
    为单节点配置
    dp_size=1
    num_nodes=8
    意味着总共有8个模型实例(每个节点一个),由haproxy进行负载均衡。请勿将
    dp_size
    理解为全局副本数量。
  • payload_modifier
    拦截器
    params_to_remove
    列表(例如
    [max_tokens, max_completion_tokens]
    )会从输出负载中移除这些字段,有意解除输出长度限制,以便推理模型可以不受限制地进行思考。
  • 自动导出Git临时解决方案:导出容器(
    python:3.12-slim
    )未安装
    git
    。从Git URL安装启动器时,需设置
    auto_export.launcher_install_cmd
    先安装git(例如:
    apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"
    )。
  • 请勿使用
    nemo-evaluator-launcher export --dest local
    命令
    ——该命令仅会写入一个摘要JSON文件(
    processed_results.json
    ),尽管它接受
    --copy_logs
    --copy-artifacts
    参数,但并不会复制实际的日志或工件。
    nel info --copy-artifacts
    命令可以工作,但会复制所有内容(对于大型基准测试来说非常缓慢)。推荐方法:使用
    nel info
    命令查找路径——如果是本地路径,直接读取;如果是远程路径,通过SSH探索并使用rsync仅复制所需内容。请注意,
    nel info
    会打印标准工件,但基准测试会在子目录中生成额外工件,需要自行探索查找。