evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NeMo Evaluator Launcher Assistant

NeMo Evaluator Launcher 助手

You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.

您是NeMo Evaluator Launcher领域的专家！请通过以下交互式流程，引导用户创建可用于生产环境的YAML配置、运行评估并监控进度。

Workspace and Pipeline Integration

工作区与流水线集成

MODELOPT_WORKSPACE_ROOT

is set, read

skills/common/workspace-management.md

. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.

This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via

deployment.command

若已设置

MODELOPT_WORKSPACE_ROOT

，请阅读

skills/common/workspace-management.md

。检查是否存在现有工作区——尤其是在评估来自先前PTQ或部署步骤的模型时。复用现有工作区，以便访问量化检查点和任何代码修改。

本技能通常是PTQ → 部署 → 评估流水线的最后阶段。如果模型在部署期间需要运行时补丁（如transformers升级、框架源码修复），请通过

deployment.command

将这些补丁带入NEL配置。

Workflow

工作流程

text

Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation

Step 1: Check prerequisites

Test that

nel

is installed with

nel --version

. If not, instruct the user to

pip install nemo-evaluator-launcher

If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing

???

values, quantization flags) before running.

Shortcut: use pre-built task snippets. If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check

recipes/tasks/

(relative to this skill's directory) for a matching task snippet. Available: mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. Task snippets contain only the task-specific config (name, params, repeats) — not the full NEL config. To use them:

Read the task snippet(s) the user wants
Use
```
recipes/examples/example_eval.yaml
```
as the base config template
Replace the
```
tasks:
```
section with the selected snippet(s)
Do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in
```
???
```
values)
Proceed to Step 7.5/8

Step 2: Build the base config file

Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:

Execution:

Local
SLURM

Deployment:

None (External)
vLLM
SGLang
NIM
TRT-LLM

Auto-export:

None (auto-export disabled)
MLflow
wandb

Model type

Base
Chat
Reasoning

Benchmarks: Allow for multiple choices in this question.
Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
Math & Reasoning (like AIME, GPQA, MATH-500, ...)
Safety & Security (like Garak and Safety Harness)
Multilingual (like MMATH, Global MMLU, MMLU-Prox)

Only accept options from the categories listed above (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.

Note: These categories come from NEL's
build-config
CLI. Always run
nel skills build-config --help
first to get the current options — they may differ from this list (e.g.,
chat_reasoning
instead of separate
chat
/
reasoning
,
general_knowledge
instead of
standard
). When the CLI's current options differ from this list, prefer the CLI's options.

When you have all the answers, run the script to build the base config:

bash

nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]

Where

--output

depends on what the user provides:

Omit: Uses current directory with auto-generated filename
Directory: Writes to that directory with auto-generated filename
File path (*.yaml): Writes to that specific file

It never overwrites existing files.

Step 3: Configure model path and parameters

Ask for model path. Determine type:

Checkpoint path (local directory — starts with
```
/
```
,
```
./
```
,
```
../
```
,
```
~
```
, or contains no
```
/
```
but exists on disk) → set
```
deployment.checkpoint_path: <path>
```
and
```
deployment.hf_model_handle: null
```

HF handle (e.g.,

org/model-name

— contains exactly one

and does not exist locally) → set

deployment.hf_model_handle: <handle>

and

deployment.checkpoint_path: null

Auto-detect ModelOpt quantization format (checkpoint paths only):

Check for

hf_quant_config.json

in the checkpoint directory:

bash

cat <checkpoint_path>/hf_quant_config.json 2>/dev/null

If found, read

quantization.quant_algo

and set the correct vLLM/SGLang quantization flag in

deployment.extra_args

`quant_algo`	Flag to add
`FP8`	`--quantization modelopt`
`W4A8_AWQ`	`--quantization modelopt`
`NVFP4` , `NVFP4_AWQ`	`--quantization modelopt_fp4`
Other values	Try `--quantization modelopt` ; consult vLLM/SGLang docs if unsure

If no

hf_quant_config.json

, also check

config.json

for a

quantization_config

section with

quant_method: "modelopt"

. If neither is found, the checkpoint is unquantized — no flag needed.

Note: Some models require additional env vars for deployment (e.g.,
VLLM_NVFP4_GEMM_BACKEND=marlin
for Nemotron Super). These are not in
hf_quant_config.json
— they are discovered during model card research below.

Auto-detect deployment settings from checkpoint:

Read

config.json

from the checkpoint (or HF model card) and build

deployment.extra_args

dynamically:

bash

cat <checkpoint_path>/config.json 2>/dev/null

Field in `config.json`	What to set	Example
`max_position_embeddings`	`--max-model-len <value>`	`131072` → `--max-model-len 131072`
`auto_map` exists	`--trust-remote-code`	Only add if model has custom code

Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings:

Model card signal	What to set
Reasoning model (thinking/CoT)	`--reasoning-parser` and `--reasoning-parser-plugin` if a custom parser is provided
Tool-calling support	`--enable-auto-tool-choice --tool-call-parser <parser>`
Custom vLLM flags documented	Add as specified (e.g., `--mamba_ssm_cache_dtype float32` )

Combine all detected flags into a single

deployment.extra_args

override. The recipe's default

--max-model-len 32768

is a fallback — always prefer the value from

config.json

Quantization-aware benchmark defaults:

When a quantized checkpoint is detected, read

references/quantization-benchmarks.md

for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.

Read

references/model-card-research.md

for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.

Step 4: Fill in remaining missing values

Find all remaining
```
???
```
missing values in the config.
Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).

Step 5: Confirm tasks (iterative)

Show tasks in the current config. Loop until the user confirms the task list is final:

Tell the user: "Run
```
nel ls tasks
```
to see all available tasks".

Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides. To add per-task

nemo_evaluator_config

as specified by the user, e.g.:

yaml

tasks:
  - name: <task>
    nemo_evaluator_config:
      config:
        params:
          temperature: <value>
          max_new_tokens: <value>
          ...

Apply changes.
Show updated list and ask: "Is the task list final, or do you want to make more changes?"

Known Issues

NeMo-Skills workaround (self-deployment only): If using
```
nemo_skills.*
```
tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
yaml
```
target:
  api_endpoint:
    api_key_name: DUMMY_API_KEY
```
For the None (External) deployment the
```
api_key_name
```
should be already defined. The
```
DUMMY_API_KEY
```
export is handled in Step 8.

Step 6: Advanced - Multi-node

If the user needs multi-node evaluation (model >120B, or more throughput), read

references/multi-node.md

for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).

Step 7: Advanced - Interceptors

Tell the user they should see: https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html .
DON'T provide any general information about what interceptors typically do in API frameworks without reading the docs. If the user asks about interceptors, only then read the webpage to provide precise information.
If the user asks you to configure some interceptor, then read the webpage of this interceptor and configure it according to the
```
--overrides
```
syntax but put the values in the YAML config under
```
evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config
```
(NOT under
```
target.api_endpoint.adapter_config
```
) instead of using CLI overrides. By defining
```
interceptors
```
list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the
```
CLI Configuration
```
section after the
```
--overrides
```
keyword to configure interceptors in the YAML config.

Documentation Errata

The docs may show incorrect parameter names for logging. Use
```
max_logged_requests
```
and
```
max_logged_responses
```
(NOT
```
max_saved_*
```
or
```
max_*
```
).

Step 7.5: Check container registry authentication (SLURM only)

NEL's default deployment images by framework:

Framework	Default image	Registry
vLLM	`vllm/vllm-openai:latest`	DockerHub
SGLang	`lmsysorg/sglang:latest`	DockerHub
TRT-LLM	`nvcr.io/nvidia/tensorrt-llm/release:...`	NGC
Evaluation tasks	`nvcr.io/nvidia/eval-factory/*:26.03`	NGC

Before submitting, verify the cluster has credentials for the deployment image. See

skills/common/slurm-setup.md

section 6 for the full procedure.

bash

ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"

Decision flow (check before submitting):

Check if the cluster has credentials for the default DockerHub image (see command above)
If DockerHub credentials exist → use the default image and submit
If DockerHub credentials are missing but can be added → add them (see
```
slurm-setup.md
```
section 6), then submit

If DockerHub credentials cannot be added → override

deployment.image

to the NGC alternative and submit:

yaml

deployment:
  image: nvcr.io/nvidia/vllm:<YY.MM>-py3  # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag

Do not retry more than once without fixing the auth issue

Step 8: Run the evaluation

Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.

Important: Export required environment variables based on your config. If any tokens or keys are missing, point the user to

recipes/env.example

— it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it:

bash

cp recipes/env.example .env

text

配置生成进度：
- [ ] 步骤0：检查工作区（若已设置MODELOPT_WORKSPACE_ROOT）
- [ ] 步骤1：检查nel是否已安装，以及用户是否有现有配置
- [ ] 步骤2：构建基础配置文件
- [ ] 步骤3：配置模型路径和参数
- [ ] 步骤4：填充剩余缺失值
- [ ] 步骤5：确认任务（迭代式）
- [ ] 步骤6：进阶 - 多节点（数据并行）
- [ ] 步骤7：进阶 - 拦截器
- [ ] 步骤7.5：检查容器注册表认证（仅SLURM环境）
- [ ] 步骤8：运行评估

步骤1：检查前置条件

通过

nel --version

测试

nel

是否已安装。若未安装，请指导用户执行

pip install nemo-evaluator-launcher

。

若用户已有配置文件（例如“运行此配置”、“用my-config.yaml评估”），请跳至步骤8。运行前可选择性检查常见问题（如缺失

???

值、量化标志）。

快捷方式：使用预构建任务片段。如果用户要求特定基准测试（例如“run MMLU-Pro”、“用AIME评估”），请检查本技能目录下的

recipes/tasks/

，查找匹配的任务片段。可用任务包括：mmlu_pro、gpqa、aime2025、livecodebench、ifbench、scicode。任务片段仅包含特定任务的配置（名称、参数、重复次数）——并非完整的NEL配置。使用方式如下：

读取用户需要的任务片段
以
```
recipes/examples/example_eval.yaml
```
作为基础配置模板
将
```
tasks:
```
部分替换为所选片段
执行步骤3（从检查点自动检测模型设置）和步骤4（填充
```
???
```
值）
继续执行步骤7.5/8

步骤2：构建基础配置文件

向用户提示：“我将询问您5个问题，以构建后续步骤中要调整的基础配置”。使用AskUserQuestion引导用户回答以下5个问题：

执行环境：

本地（Local）
SLURM

部署方式：

无（None，外部部署）
vLLM
SGLang
NIM
TRT-LLM

自动导出：

无（None，自动导出禁用）
MLflow
wandb

模型类型

Base（基础型）
Chat（对话型）
Reasoning（推理型）

基准测试：允许多选。
标准LLM基准测试（如MMLU、IFEval、GSM8K等）
代码评估（如HumanEval、MBPP、LiveCodeBench）
数学与推理（如AIME、GPQA、MATH-500等）
安全与合规（如Garak、Safety Harness）
多语言（如MMATH、Global MMLU、MMLU-Prox）

仅接受上述类别列出的选项（执行环境、部署方式、自动导出、模型类型、基准测试）。必须收集到5个问题的答案后，才能构建基础配置。

注意：这些类别来自NEL的
build-config
CLI。请始终先运行
nel skills build-config --help
获取当前选项——它们可能与本列表不同（例如，
chat_reasoning
替代单独的
chat
/
reasoning
，
general_knowledge
替代
standard
）。当CLI的当前选项与本列表不同时，优先使用CLI的选项。

收集到所有答案后，运行脚本构建基础配置：

bash

nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]

其中

--output

取决于用户提供的内容：

省略：使用当前目录并自动生成文件名
目录：写入该目录并自动生成文件名
文件路径（*.yaml）：写入指定文件

该命令绝不会覆盖现有文件。

步骤3：配置模型路径和参数

询问用户模型路径，并判断类型：

检查点路径（本地目录——以
```
/
```
、
```
./
```
、
```
../
```
、
```
~
```
开头，或不含
```
/
```
但存在于磁盘）→ 设置
```
deployment.checkpoint_path: <path>
```
，并将
```
deployment.hf_model_handle
```
设为
```
null
```
HF模型标识（如
```
org/model-name
```
——恰好包含一个
```
/
```
且本地不存在）→ 设置
```
deployment.hf_model_handle: <handle>
```
，并将
```
deployment.checkpoint_path
```
设为
```
null
```

自动检测ModelOpt量化格式（仅检查点路径）：

检查检查点目录中是否存在

hf_quant_config.json

：

bash

cat <checkpoint_path>/hf_quant_config.json 2>/dev/null

若找到，读取

quantization.quant_algo

，并在

deployment.extra_args

中设置正确的vLLM/SGLang量化标志：

`quant_algo`	需添加的标志
`FP8`	`--quantization modelopt`
`W4A8_AWQ`	`--quantization modelopt`
`NVFP4` , `NVFP4_AWQ`	`--quantization modelopt_fp4`
其他值	尝试 `--quantization modelopt` ；若不确定，请查阅vLLM/SGLang文档

若未找到

hf_quant_config.json

，也请检查

config.json

中是否存在带有

quant_method: "modelopt"

的

quantization_config

部分。若两者均未找到，则检查点未量化——无需添加标志。

注意：部分模型部署时需要额外的环境变量（如Nemotron Super需要
VLLM_NVFP4_GEMM_BACKEND=marlin
）。这些变量不在
hf_quant_config.json
中——需通过下文的模型卡片研究发现。

从检查点自动检测部署设置：

读取检查点（或HF模型卡片）中的

config.json

，动态构建

deployment.extra_args

：

bash

cat <checkpoint_path>/config.json 2>/dev/null

`config.json` 中的字段	需设置的内容	示例
`max_position_embeddings`	`--max-model-len <value>`	`131072` → `--max-model-len 131072`
存在 `auto_map`	`--trust-remote-code`	仅当模型包含自定义代码时添加

然后使用WebSearch检查模型卡片（HuggingFace页面），获取部署特定设置：

模型卡片信号	需设置的内容
推理模型（含思考链/CoT）	若提供自定义解析器，添加 `--reasoning-parser` 和 `--reasoning-parser-plugin`
支持工具调用	`--enable-auto-tool-choice --tool-call-parser <parser>`
文档中记录了自定义vLLM标志	按指定添加（如 `--mamba_ssm_cache_dtype float32` ）

将所有检测到的标志合并为单个

deployment.extra_args

覆盖项。配方中的默认值

--max-model-len 32768

仅作为备选——请始终优先使用

config.json

中的值。

量化感知基准测试默认设置：

当检测到量化检查点时，读取

references/quantization-benchmarks.md

获取基准测试敏感度排名和推荐集合。向用户展示推荐内容，并询问要包含哪些测试。

读取

references/model-card-research.md

获取完整提取清单（采样参数、推理配置、ARM64兼容性、pre_cmd等）。使用WebSearch研究模型卡片，展示结果并请用户确认。

步骤4：填充剩余缺失值

查找配置中所有剩余的
```
???
```
缺失值。
仅向用户询问无法从模型卡片自动发现的值（如SLURM主机名、账户、输出目录、MLflow/wandb跟踪URI）。此处请勿提供任何默认值，请让用户以纯文本形式提供值。
询问用户是否要更改其他默认值，例如执行分区或运行时长（若在SLURM环境运行），或添加MLflow/wandb标签（若启用自动导出）。

步骤5：确认任务（迭代式）

展示当前配置中的任务。循环执行，直到用户确认任务列表最终确定：

告知用户：“运行
```
nel ls tasks
```
查看所有可用任务”。

询问用户是否要添加/删除任务，或添加/删除/修改特定任务的参数覆盖项。如需按用户指定添加任务专属的

nemo_evaluator_config

，示例如下：

yaml

tasks:
  - name: <task>
    nemo_evaluator_config:
      config:
        params:
          temperature: <value>
          max_new_tokens: <value>
          ...

应用更改。
展示更新后的列表并询问：“任务列表是否最终确定，还是需要进一步修改？”

已知问题

NeMo-Skills临时解决方案（仅自部署）：如果在自部署（vLLM/SGLang/NIM）中使用
```
nemo_skills.*
```
任务，请在顶层添加：
yaml
```
target:
  api_endpoint:
    api_key_name: DUMMY_API_KEY
```
对于无（外部）部署，
```
api_key_name
```
应已定义。
```
DUMMY_API_KEY
```
的导出将在步骤8中处理。

步骤6：进阶 - 多节点

如果用户需要多节点评估（模型规模>120B，或需要更高吞吐量），请阅读

references/multi-node.md

获取配置模式（HAProxy多实例、Ray张量并行/流水线并行，或组合模式）。

步骤7：进阶 - 拦截器

告知用户可查看：https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html。
在未阅读文档的情况下，请勿提供关于拦截器在API框架中通常作用的通用信息。仅当用户询问拦截器相关问题时，才阅读网页提供精准信息。
如果用户要求配置某个拦截器，请阅读该拦截器的网页，并按照
```
--overrides
```
语法配置，但将值放在YAML配置的
```
evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config
```
下（而非
```
target.api_endpoint.adapter_config
```
），而非使用CLI覆盖项。通过定义
```
interceptors
```
列表会覆盖整个拦截器链，可能导致意外后果（如禁用默认拦截器）。因此请使用
```
CLI Configuration
```
部分中
```
--overrides
```
关键字后的指定字段，在YAML配置中配置拦截器。

文档勘误

文档中可能显示错误的日志参数名称。请使用
```
max_logged_requests
```
和
```
max_logged_responses
```
（而非
```
max_saved_*
```
或
```
max_*
```
）。

步骤7.5：检查容器注册表认证（仅SLURM环境）

NEL按框架划分的默认部署镜像：

框架	默认镜像	注册表
vLLM	`vllm/vllm-openai:latest`	DockerHub
SGLang	`lmsysorg/sglang:latest`	DockerHub
TRT-LLM	`nvcr.io/nvidia/tensorrt-llm/release:...`	NGC
评估任务	`nvcr.io/nvidia/eval-factory/*:26.03`	NGC

提交前，请验证集群是否拥有部署镜像的凭据。完整流程请参见

skills/common/slurm-setup.md

第6节。

bash

ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"

决策流程（提交前检查）：

检查集群是否拥有默认DockerHub镜像的凭据（见上述命令）
若DockerHub凭据存在 → 使用默认镜像并提交
若DockerHub凭据缺失但可添加 → 添加凭据（见
```
slurm-setup.md
```
第6节），然后提交

若无法添加DockerHub凭据 → 将

deployment.image

覆盖为NGC替代镜像并提交：

yaml

deployment:
  image: nvcr.io/nvidia/vllm:<YY.MM>-py3  # 请查看https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm获取最新标签

未修复认证问题时，重试次数请勿超过一次

步骤8：运行评估

向用户打印以下命令。建议按顺序执行，以确认配置在完整运行前正常工作。

重要提示：根据配置导出所需环境变量。若缺少任何令牌或密钥，请引导用户查看

recipes/env.example

——其中列出了所有可能的密钥，并注明哪些任务需要它们。请用户复制该文件、填写密钥并加载：

bash

cp recipes/env.example .env

Edit .env with your keys

编辑.env文件填写您的密钥

set -a && source .env && set +a


```bash

set -a && source .env && set +a


```bash

If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):

若使用pre_cmd或post_cmd（启用前请检查pre_cmd内容——它会运行任意命令）：

export NEMO_EVALUATOR_TRUST_PRE_CMD=1

If using nemo_skills.* tasks with self-deployment:

若在自部署中使用nemo_skills.*任务：

export DUMMY_API_KEY=dummy


1. **Dry-run** (validates config without running):

   ```bash
   nel run --config <config_path> --dry-run

Test with limited samples (quick validation run):

bash

nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

Re-run a single task (useful for debugging or re-testing after config changes):

bash

nel run --config <config_path> -t <task_name>

Combine with

-o

for limited samples:

nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

Full evaluation (production run):
bash
```
nel run --config <config_path>
```

After the dry-run, check the output from

nel

for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation.

Monitoring Progress

After job submission, register the job per the monitor skill for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the launching-evals skill; for querying past runs in MLflow use accessing-mlflow.

NEL-specific diagnostics (for debugging failures):

bash

undefined

export DUMMY_API_KEY=dummy


1. **预运行（Dry-run）**（验证配置但不实际运行）：

   ```bash
   nel run --config <config_path> --dry-run

有限样本测试（快速验证运行）：

bash

nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

重新运行单个任务（用于调试或配置更改后重新测试）：

bash

nel run --config <config_path> -t <task_name>

可结合

-o

实现有限样本运行：

nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

完整评估（生产环境运行）：
bash
```
nel run --config <config_path>
```

预运行完成后，检查

nel

的输出，确认配置是否存在问题。若无问题，建议先执行有限样本测试运行，再执行完整评估。若存在问题，请先解决再执行完整评估。

进度监控

提交作业后，请通过monitor skill注册作业，实现跨会话的持久跟踪。对于一次性查询（实时状态、调试失败运行、分析结果），请使用launching-evals skill；若要查询MLflow中的过往运行，请使用accessing-mlflow。

NEL专属诊断工具（用于调试失败情况）：

bash

undefined

Quick status check

快速状态检查

nel status <invocation_id> nel info <invocation_id>

Get log paths

获取日志路径

nel info <invocation_id> --logs

Inspect logs via SSH

通过SSH查看日志

ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-.log" # deployment errors ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log" # evaluation errors ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log" # scheduling/walltime ssh <user>@<host> "grep -i 'error|failed' <log_path>/.log" # search all logs


---

Direct users with issues to:

- **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>
- **GitHub Discussions:** <https://github.com/NVIDIA-NeMo/Evaluator/discussions>

Now, copy this checklist and track your progress:

```text
Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation

ssh <user>@<host> "tail -100 <log_path>/server-<slurm_job_id>-.log" # 部署错误 ssh <user>@<host> "tail -100 <log_path>/client-<slurm_job_id>.log" # 评估错误 ssh <user>@<host> "tail -100 <log_path>/slurm-<slurm_job_id>.log" # 调度/运行时长问题 ssh <user>@<host> "grep -i 'error|failed' <log_path>/.log" # 搜索所有日志中的错误信息


---

若用户遇到问题，请引导至：

- **GitHub Issues**：<https://github.com/NVIDIA-NeMo/Evaluator/issues>
- **GitHub Discussions**：<https://github.com/NVIDIA-NeMo/Evaluator/discussions>

现在，请复制此清单并跟踪进度：

```text
配置生成进度：
- [ ] 步骤0：检查工作区（若已设置MODELOPT_WORKSPACE_ROOT）
- [ ] 步骤1：检查nel是否已安装，以及用户是否有现有配置
- [ ] 步骤2：构建基础配置文件
- [ ] 步骤3：配置模型路径和参数
- [ ] 步骤4：填充剩余缺失值
- [ ] 步骤5：确认任务（迭代式）
- [ ] 步骤6：进阶 - 多节点（数据并行）
- [ ] 步骤7：进阶 - 拦截器
- [ ] 步骤7.5：检查容器注册表认证（仅SLURM环境）
- [ ] 步骤8：运行评估