nel-assistant

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NeMo Evaluator Launcher Assistant

NeMo Evaluator Launcher 助手

You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.

您是NeMo Evaluator Launcher领域的专家！请通过以下交互式流程，引导用户创建可用于生产环境的YAML配置、运行评估并监控进度。

Workflow

工作流程

Config Generation Progress:
- [ ] Step 1: Check if nel is installed
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 8: Run the evaluation

Step 1: Check if nel is installed

Test that

nel

is installed with

nel --version

If not, instruct the user to

pip install nemo-evaluator-launcher

Step 2: Build the base config file

Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:

Execution:

Local
SLURM

Deployment:

None (External)
vLLM
SGLang
NIM
TRT-LLM

Auto-export:

None (auto-export disabled)
MLflow
wandb

Model type

Base
Chat or Reasoning

Benchmarks: Allow for multiple choices in this question.

If Model type = Base:
1. General Knowledge
2. Coding
3. Long Context
4. Multilingual
If Model type = Chat or Reasoning:
1. Core Reasoning
2. Agentic
3. Long Context
4. Multilingual

DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.

When you have all the answers, run the script to build the base config:

bash

nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat_reasoning> --benchmarks <general_knowledge|coding|core_reasoning|agentic|long_context|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]

Where

--output

depends on what the user provides:

Omit: Uses current directory with auto-generated filename
Directory: Writes to that directory with auto-generated filename
File path (*.yaml): Writes to that specific file

It never overwrites existing files.

Step 3: Configure model path and parameters

Ask for model path. Determine type:

Checkpoint path (starts with

./

) → set

deployment.checkpoint_path: <path>

and

deployment.hf_model_handle: null

HF handle (e.g.,

org/model-name

) → set

deployment.hf_model_handle: <handle>

and

deployment.checkpoint_path: null

Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefully, the FULL text, the devil is in the details. Extract ALL relevant configurations:

Sampling params (
```
temperature
```
,
```
top_p
```
)

Context length (

deployment.extra_args: "--max-model-len <value>"

)

TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
Reasoning config (if applicable):
- reasoning on/off: use either:
  - ```
  adapter_config.custom_system_prompt
```
  (like
```
  /think
```
  ,
```
  /no_think
```
  ) and no
```
  adapter_config.params_to_add
```
  (leave
```
  params_to_add
```
  unrelated to reasoning untouched)
- ```
adapter_config.params_to_add
```
    for payload modifier (like
```
"chat_template_kwargs": {"enable_thinking": true/false}
```
    ) and no
```
adapter_config.custom_system_prompt
```
    and
```
adapter_config.use_system_prompt: false
```
    (leave
```
custom_system_prompt
```
    and
```
use_system_prompt
```
    unrelated to reasoning untouched).
- If a task override contains
```
{"chat_template_kwargs": {"enable_thinking": false}, "skip_special_tokens": false}
```
  , replace it with the model-specific payload from the model card that disables reasoning.
- For pure-chat models, remove
```
adapter_config.params_to_add
```
  completely if the model card does not define a reasoning toggle.
- reasoning effort (if it's configurable, AskUserQuestion what reasoning effort they want)
- higher
```
max_new_tokens
```
- etc.
Deployment-specific
```
extra_args
```
for vLLM/SGLang (look for the vLLM/SGLang deployment command)
Deployment-specific vLLM/SGLang versions (by default we use latest docker images, but you can control it with
```
deployment.image
```
e.g. vLLM above
```
vllm/vllm-openai:v0.11.0
```
stopped supporting
```
rope-scaling
```
arg used by Qwen models)
ARM64 / non-standard GPU compatibility: The default
```
vllm/vllm-openai
```
image only supports common GPU architectures. For ARM64 platforms or GPUs with non-standard compute capabilities (e.g., NVIDIA GB10 with sm_121), use NGC vLLM images instead:
- Example:
```
deployment.image: nvcr.io/nvidia/vllm:26.01-py3
```
- AskUserQuestion about their GPU architecture if the model card doesn't specify deployment constraints
Tool-calling requirements:
- If the selected benchmarks include
```
agentic
```
  , you MUST configure tool calling end-to-end.
- For self-deployment, extract the exact tool-calling flags/settings from the model card (for example vLLM/SGLang tool parser flags) and apply them.
- For external endpoints, confirm the endpoint already supports tool calling before proceeding.
Any preparation requirements (e.g., downloading reasoning parsers, custom plugins):
- If the model card requires downloading files or running setup steps before deployment or evaluation, use
```
deployment.pre_cmd
```
  or
```
evaluation.pre_cmd
```
  for non-local execution.
- In
```
pre_cmd
```
  script:
  - Use
```
curl
```
    instead of
```
wget
```
    as it's more widely available in Docker containers. Example:
```
pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py
```
  - Always use
```
--no-cache-dir
```
    when installing Python packages to avoid cross-device link errors in Docker containers (the pip cache and temp directories may be on different filesystems). Example:
```
pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation
```
- For local execution, do NOT rely on
```
pre_cmd
```
  . Run the preparation steps yourself on the host first, then mount the resulting files/directories into the container if needed.
- Short mount examples:
  - deployment:
```
execution.mounts.deployment: {"/absolute/path/to/reasoning_parser.py": "/vllm-workspace/reasoning_parser.py"}
```
  - evaluation:
```
execution.mounts.evaluation: {"/absolute/path/to/hf_cache": "/root/.cache/huggingface"}
```
Env vars:
- Use
```
deployment.env_vars
```
  for deployment-side settings,
```
evaluation.env_vars
```
  for evaluation-wide settings, and
```
evaluation.tasks[].env_vars
```
  for task-specific overrides.
- Supported value types:
```
host:VAR_NAME
```
  = read the value from the host env var
```
VAR_NAME
```
  ;
```
lit:value
```
  = use the literal value directly;
```
runtime:VAR_NAME
```
  = resolve
```
VAR_NAME
```
  only at runtime inside the execution environment.
Any other model-specific requirements

Remember to check

evaluation.nemo_evaluator_config

and

evaluation.tasks.*.nemo_evaluator_config

overrides too for parameters to adjust (e.g. disabling reasoning)!

Present findings, explain each setting, ask user to confirm or adjust. If no model card found, ask user directly for the above configurations.

Step 4: Fill in remaining missing values

Find all remaining
```
???
```
missing values in the config.
Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).

Step 5: Confirm tasks (iterative)

Show tasks in the current config. Loop until the user confirms the task list is final:

Tell the user: "Run
```
nel ls tasks
```
to see all available tasks".

Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides. To add per-task

nemo_evaluator_config

as specified by the user, e.g.:

yaml

tasks:
  - name: <task>
    nemo_evaluator_config:
      config:
        params:
          temperature: <value>
          max_new_tokens: <value>
          ...

Apply changes.
Show updated list and ask: "Is the task list final, or do you want to make more changes?"

Step 6: Advanced - Multi-node

There are two multi-node patterns. Ask the user which applies:

Pattern A: Multi-instance (independent instances with HAProxy)

Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."

yaml

execution:
    num_nodes: 4       # Total nodes
    num_instances: 4   # 4 independent instances → HAProxy auto-enabled

Pattern B: Multi-node single instance (Ray TP/PP across nodes)

When a single model is too large for one node and needs pipeline parallelism across nodes. Use

vllm_ray

deployment config:

yaml

defaults:
  - deployment: vllm_ray   # Built-in Ray cluster setup (replaces manual pre_cmd)

execution:
    num_nodes: 2           # Single instance spanning 2 nodes

deployment:
    tensor_parallel_size: 8
    pipeline_parallel_size: 2

Pattern A+B combined: Multi-instance with multi-node instances

For very large models needing both cross-node parallelism AND multiple instances:

yaml

defaults:
  - deployment: vllm_ray

execution:
    num_nodes: 4       # Total nodes
    num_instances: 2   # 2 instances of 2 nodes each → HAProxy auto-enabled

deployment:
    tensor_parallel_size: 8
    pipeline_parallel_size: 2

Multi-node performance tips

For multi-node deployments, add
```
switches: 1
```
to
```
execution.sbatch_extra_flags
```
to instruct SLURM to allocate all nodes on the same network switch, reducing inter-node communication latency:
yaml
```
execution:
  sbatch_extra_flags:
    switches: 1
```

Common Confusions

num_instances
controls independent deployment instances with HAProxy. data_parallel_size
controls DP replicas within a single instance.
Global data parallelism is
```
num_instances x data_parallel_size
```
(e.g., 2 instances x 8 DP each = 16 replicas).
With multi-instance,
```
parallelism
```
in task config is the total concurrent requests across all instances, not per-instance.
```
num_nodes
```
must be divisible by
```
num_instances
```
.

Step 7: Advanced - Interceptors

Tell the user they should see: https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html .
DON'T provide any general information about what interceptors typically do in API frameworks without reading the docs. If the user asks about interceptors, only then read the webpage to provide precise information.
If the user asks you to configure some interceptor, then read the webpage of this interceptor and configure it according to the
```
--overrides
```
syntax but put the values in the YAML config under
```
evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config
```
(NOT under
```
target.api_endpoint.adapter_config
```
) instead of using CLI overrides. By defining
```
interceptors
```
list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the
```
CLI Configuration
```
section after the
```
--overrides
```
keyword to configure interceptors in the YAML config.

Documentation Errata

The docs may show incorrect parameter names for logging. Use
```
max_logged_requests
```
and
```
max_logged_responses
```
(NOT
```
max_saved_*
```
or
```
max_*
```
).

Step 8: Run the evaluation

Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.

Important: Ensure required environment variables are available. Ask the user to provide

HF_TOKEN

, even if they are not using a gated model (like Llama) or dataset (like GPQA), to reduce Hugging Face rate limiting errors. Remind the user to get access to GPQA, if it's in the config ("Please, click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa"), and ask them to put missing tokens or keys (e.g.

HF_TOKEN

NVIDIA_API_KEY

api_key_name

from the config) in a

.env

file in the project root. NEL automatically reads

.env

— no need to source it manually.

bash

undefined

配置生成进度：
- [ ] 步骤1：检查nel是否已安装
- [ ] 步骤2：构建基础配置文件
- [ ] 步骤3：配置模型路径与参数
- [ ] 步骤4：补充剩余缺失值
- [ ] 步骤5：确认任务（迭代式）
- [ ] 步骤6：进阶 - 多节点（数据并行）
- [ ] 步骤7：进阶 - 拦截器
- [ ] 步骤8：运行评估

步骤1：检查nel是否已安装

通过命令

nel --version

测试nel是否已安装。

若未安装，请指导用户执行

pip install nemo-evaluator-launcher

。

步骤2：构建基础配置文件

向用户提示："我将询问您5个问题，用于构建后续步骤中要调整的基础配置"。使用询问用户问题的方式引导完成以下5个问题：

执行方式：

本地（Local）
SLURM

部署方式：

无（外部部署，None (External)）
vLLM
SGLang
NIM
TRT-LLM

自动导出：

无（自动导出禁用，None (auto-export disabled)）
MLflow
wandb

模型类型

基础型（Base）
对话或推理型（Chat or Reasoning）

基准测试：此问题允许多选。

若模型类型为基础型：
1. 通用知识（General Knowledge）
2. 代码（Coding）
3. 长上下文（Long Context）
4. 多语言（Multilingual）
若模型类型为对话或推理型：
1. 核心推理（Core Reasoning）
2. 智能体（Agentic）
3. 长上下文（Long Context）
4. 多语言（Multilingual）

仅允许选择上述各分类下列出的选项，不得使用其他选项。在构建基础配置前，您必须收集到这5个问题的所有答案。

收集到所有答案后，运行以下脚本构建基础配置：

bash

nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat_reasoning> --benchmarks <general_knowledge|coding|core_reasoning|agentic|long_context|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]

其中

--output

参数取决于用户提供的内容：

省略：使用当前目录及自动生成的文件名
目录：写入指定目录并使用自动生成的文件名
文件路径（*.yaml）：写入指定的具体文件

该命令绝不会覆盖现有文件。

步骤3：配置模型路径与参数

询问用户模型路径，并判断类型：

检查点路径（以

或

./

开头）→ 设置

deployment.checkpoint_path: <path>

且

deployment.hf_model_handle: null

HuggingFace模型标识（如

org/model-name

）→ 设置

deployment.hf_model_handle: <handle>

且

deployment.checkpoint_path: null

使用网络搜索查找模型卡片（HuggingFace、build.nvidia.com）。仔细阅读完整文本，细节至关重要。提取所有相关配置：

采样参数（
```
temperature
```
、
```
top_p
```
）

上下文长度（

deployment.extra_args: "--max-model-len <value>"

）

张量并行/数据并行（TP/DP）设置（为了合理设置，询问用户模型将部署在多少个GPU上）
推理配置（如适用）：
- 推理开关：使用以下两种方式之一：
  - ```
  adapter_config.custom_system_prompt
```
  （如
```
  /think
```
  、
```
  /no_think
```
  ），且不设置
```
  adapter_config.params_to_add
```
  （与推理无关的
```
  params_to_add
```
  保持不变）
- ```
adapter_config.params_to_add
```
    用于负载修改（如
```
"chat_template_kwargs": {"enable_thinking": true/false}
```
    ），且不设置
```
adapter_config.custom_system_prompt
```
    并将
```
adapter_config.use_system_prompt: false
```
    （与推理无关的
```
custom_system_prompt
```
    和
```
use_system_prompt
```
    保持不变）。
- 如果任务覆盖包含
```
{"chat_template_kwargs": {"enable_thinking": false}, "skip_special_tokens": false}
```
  ，请替换为模型卡片中定义的禁用推理的模型特定负载。
- 对于纯对话模型，如果模型卡片未定义推理开关，请完全移除
```
adapter_config.params_to_add
```
  。
- 推理力度（如果可配置，询问用户想要的推理力度）
- 更大的
```
max_new_tokens
```
  值
- 其他相关配置
vLLM/SGLang的部署特定
```
extra_args
```
（查找vLLM/SGLang部署命令）
vLLM/SGLang的部署特定版本（默认使用最新Docker镜像，但可通过
```
deployment.image
```
控制，例如vLLM版本高于
```
vllm/vllm-openai:v0.11.0
```
时不再支持Qwen模型使用的
```
rope-scaling
```
参数）
ARM64/非标准GPU兼容性：默认的
```
vllm/vllm-openai
```
镜像仅支持常见GPU架构。对于ARM64平台或具有非标准计算能力的GPU（如NVIDIA GB10，sm_121），请改用NGC vLLM镜像：
- 示例：
```
deployment.image: nvcr.io/nvidia/vllm:26.01-py3
```
- 如果模型卡片未指定部署限制，请询问用户的GPU架构
工具调用要求：
- 如果所选基准测试包含
```
agentic
```
  ，必须端到端配置工具调用。
- 对于自部署场景，从模型卡片中提取确切的工具调用标志/设置（例如vLLM/SGLang工具解析器标志）并应用。
- 对于外部端点，在继续之前确认端点已支持工具调用。
任何准备要求（如下载推理解析器、自定义插件）：
- 如果模型卡片要求在部署或评估前下载文件或运行设置步骤，对于非本地执行，使用
```
deployment.pre_cmd
```
  或
```
evaluation.pre_cmd
```
  。
- 在
```
pre_cmd
```
  脚本中：
  - 使用
```
curl
```
    而非
```
wget
```
    ，因为它在Docker容器中更通用。示例：
```
pre_cmd: curl -L -o reasoning_parser.py https://huggingface.co/.../reasoning_parser.py
```
  - 安装Python包时始终使用
```
--no-cache-dir
```
    ，以避免Docker容器中的跨设备链接错误（pip缓存和临时目录可能位于不同文件系统）。示例：
```
pre_cmd: pip3 install --no-cache-dir flash-attn --no-build-isolation
```
- 对于本地执行，请勿依赖
```
pre_cmd
```
  。先在主机上自行运行准备步骤，然后根据需要将生成的文件/目录挂载到容器中。
- 挂载示例：
  - 部署：
```
execution.mounts.deployment: {"/absolute/path/to/reasoning_parser.py": "/vllm-workspace/reasoning_parser.py"}
```
  - 评估：
```
execution.mounts.evaluation: {"/absolute/path/to/hf_cache": "/root/.cache/huggingface"}
```
环境变量：
- 使用
```
deployment.env_vars
```
  设置部署端配置，
```
evaluation.env_vars
```
  设置全局评估配置，
```
evaluation.tasks[].env_vars
```
  设置任务特定覆盖配置。
- 支持的值类型：
```
host:VAR_NAME
```
  = 从主机环境变量
```
VAR_NAME
```
  读取值；
```
lit:value
```
  = 直接使用字面量值；
```
runtime:VAR_NAME
```
  = 仅在执行环境运行时解析
```
VAR_NAME
```
  。
其他任何模型特定要求

请记住也要检查

evaluation.nemo_evaluator_config

和

evaluation.tasks.*.nemo_evaluator_config

覆盖配置，以调整相关参数（如禁用推理）！

展示查找结果，解释每个设置，询问用户确认或调整。如果未找到模型卡片，请直接向用户询问上述配置信息。

步骤4：补充剩余缺失值

查找配置中所有
```
???
```
缺失值。
仅向用户询问无法从模型卡片自动发现的值（如SLURM主机名、账户、输出目录、MLflow/wandb跟踪URI）。此处请勿提供任何默认值，让用户以纯文本形式提供值。
询问用户是否要更改其他默认值，例如执行分区或运行时长（如果使用SLURM），或添加MLflow/wandb标签（如果启用自动导出）。

步骤5：确认任务（迭代式）

展示当前配置中的任务。循环执行直到用户确认任务列表最终确定：

告知用户："运行
```
nel ls tasks
```
查看所有可用任务"。

询问用户是否要添加/删除任务，或添加/删除/修改任务特定参数覆盖。如需添加用户指定的任务级

nemo_evaluator_config

，示例如下：

yaml

tasks:
  - name: <task>
    nemo_evaluator_config:
      config:
        params:
          temperature: <value>
          max_new_tokens: <value>
          ...

应用更改。
展示更新后的列表并询问："任务列表是否最终确定，还是需要进一步修改？"

步骤6：进阶 - 多节点

有两种多节点模式。询问用户适用哪种模式：

模式A：多实例（带HAProxy的独立实例）

仅适用于模型参数超过120B或用户需要更高吞吐量的场景。解释："每个节点运行一个独立的部署实例。HAProxy负责在所有实例间负载均衡请求。"

yaml

execution:
    num_nodes: 4       # 总节点数
    num_instances: 4   # 4个独立实例 → HAProxy自动启用

模式B：多节点单实例（跨节点Ray TP/PP）

当单个模型过大无法在一个节点运行，需要跨节点流水线并行时使用。采用

vllm_ray

部署配置：

yaml

defaults:
  - deployment: vllm_ray   # 内置Ray集群设置（替代手动pre_cmd）

execution:
    num_nodes: 2           # 跨2个节点的单实例

deployment:
    tensor_parallel_size: 8
    pipeline_parallel_size: 2

模式A+B组合：带多节点实例的多实例

适用于既需要跨节点并行又需要多个实例的超大型模型：

yaml

defaults:
  - deployment: vllm_ray

execution:
    num_nodes: 4       # 总节点数
    num_instances: 2   # 2个实例，每个实例占2个节点 → HAProxy自动启用

deployment:
    tensor_parallel_size: 8
    pipeline_parallel_size: 2

多节点性能提示

对于多节点部署，在
```
execution.sbatch_extra_flags
```
中添加
```
switches: 1
```
，指示SLURM将所有节点分配到同一网络交换机上，减少节点间通信延迟：
yaml
```
execution:
  sbatch_extra_flags:
    switches: 1
```

常见误区

**
```
num_instances
```
控制带HAProxy的独立部署实例数量。
```
data_parallel_size
```
**控制单个实例内的DP副本数量。
全局数据并行度为
```
num_instances x data_parallel_size
```
（例如2个实例 × 每个实例8个DP = 16个副本）。
在多实例模式下，任务配置中的
```
parallelism
```
是所有实例的总并发请求数，而非单实例。
```
num_nodes
```
必须能被
```
num_instances
```
整除。

步骤7：进阶 - 拦截器

告知用户查看文档：https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html 。
在未阅读文档前，请勿提供API框架中拦截器通常功能的通用信息。仅当用户询问拦截器相关问题时，才阅读网页提供准确信息。
如果用户要求配置某个拦截器，请阅读该拦截器的网页，并按照
```
--overrides
```
语法将值放入YAML配置的
```
evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config
```
下（而非
```
target.api_endpoint.adapter_config
```
），而非使用CLI覆盖。通过定义
```
interceptors
```
列表会覆盖整个拦截器链，可能导致禁用默认拦截器等意外后果。因此请使用
```
CLI Configuration
```
部分中
```
--overrides
```
关键字后的指定字段在YAML配置中配置拦截器。

文档勘误

文档中可能显示错误的日志参数名称。请使用
```
max_logged_requests
```
和
```
max_logged_responses
```
（而非
```
max_saved_*
```
或
```
max_*
```
）。

步骤8：运行评估

向用户打印以下命令。建议按顺序执行，以确认配置在完整运行前能正常工作。

重要提示：确保所需环境变量可用。请用户提供

HF_TOKEN

，即使他们未使用 gated 模型（如Llama）或数据集（如GPQA），以减少Hugging Face的速率限制错误。提醒用户如果配置中包含GPQA，请请求访问权限（"请点击申请GPQA-Diamond访问权限：https://huggingface.co/datasets/Idavidrein/gpqa"），并要求用户将缺失的令牌或密钥（如`HF_TOKEN`、`NVIDIA_API_KEY`、配置中的`api_key_name`）放入项目根目录的`.env`文件中。NEL会自动读取`.env`文件，无需手动执行source命令。

bash

undefined

If using pre_cmd or post_cmd:

如果使用pre_cmd或post_cmd：

export NEMO_EVALUATOR_TRUST_PRE_CMD=1


1. **Dry-run** (validates config without running):

nel run --config <config_path> --dry-run


2. **Test with limited samples** (quick validation run):

nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

For multi-instance deployments, also scale down to a single instance to validate the deployment faster:

nel run --config <config_path>
-o execution.num_nodes=1
-o execution.num_instances=1
-o evaluation.nemo_evaluator_config.config.params.parallelism=5
-o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

Adjust `num_nodes` to match the number of nodes a single model instance needs (e.g., 2 for a model requiring 2-node Ray TP).

3. **Re-run a single task** (useful for debugging or re-testing after config changes):

nel run --config <config_path> -t <task_name>

Combine with `-o` for limited samples: `nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10`

4. **Full evaluation** (production run):

nel run --config <config_path>


After the dry-run, check the output from `nel` for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation.

**Monitoring Progress**

After job submission, you can monitor progress using:

1. **Check job status:**
```bash
nel status <invocation_id>
nel info <invocation_id>

Stream logs (Local execution only):
bash
```
nel logs <invocation_id>
```
Note:
```
nel logs
```
is not supported for SLURM execution.

Inspect logs via SSH (SLURM workaround):

When

nel logs

is unavailable (SLURM), use SSH to inspect logs directly:

First, get log locations:

bash

nel info <invocation_id> --logs

Then, use SSH to view logs:

Check server deployment logs:

bash

ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"

Shows vLLM server startup, model loading, and deployment errors (e.g., missing wget/curl).

Check evaluation client logs:

bash

ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"

Shows evaluation progress, task execution, and results.

Check SLURM scheduler logs:

bash

ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"

Shows job scheduling, health checks, and overall execution flow.

Search for errors:

bash

ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"

Advanced workflow: For more detailed run monitoring, debugging failed evaluations, and post-run analysis, see the

launching-evals

skill.

Direct users with issues to:

GitHub Issues: https://github.com/NVIDIA-NeMo/Evaluator/issues
GitHub Discussions: https://github.com/NVIDIA-NeMo/Evaluator/discussions

Now, copy this checklist and track your progress:

Config Generation Progress:
- [ ] Step 1: Check if nel is installed
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 8: Run the evaluation

export NEMO_EVALUATOR_TRUST_PRE_CMD=1


1. **预运行（Dry-run）**（验证配置但不实际运行）：

nel run --config <config_path> --dry-run


2. **有限样本测试**（快速验证运行）：

nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

对于多实例部署，也可缩至单实例以更快验证部署：

根据单个模型实例所需的节点数调整`num_nodes`（例如需要2节点Ray TP的模型设为2）。

3. **重新运行单个任务**（适用于调试或配置更改后重新测试）：

nel run --config <config_path> -t <task_name>

结合`-o`参数实现有限样本运行：`nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10`

4. **完整评估**（生产环境运行）：

nel run --config <config_path>


预运行后，检查`nel`输出的配置问题。如果没有问题，建议先执行有限样本测试运行，再执行完整评估。如果有问题，请先解决再执行完整评估。

**监控进度**

提交作业后，可通过以下方式监控进度：

1. **检查作业状态**：
```bash
nel status <invocation_id>
nel info <invocation_id>

流式日志（仅本地执行）：
bash
```
nel logs <invocation_id>
```
注意：
```
nel logs
```
不支持SLURM执行。

通过SSH查看日志（SLURM替代方案）：

当

nel logs

不可用时（SLURM场景），使用SSH直接查看日志：

首先获取日志位置：

bash

nel info <invocation_id> --logs

然后使用SSH查看日志：

检查服务器部署日志：

bash

ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/server-<slurm_job_id>-*.log"

显示vLLM服务器启动、模型加载和部署错误（如缺少wget/curl）。

检查评估客户端日志：

bash

ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/client-<slurm_job_id>.log"

显示评估进度、任务执行和结果。

检查SLURM调度器日志：

bash

ssh <username>@<hostname> "tail -100 <log path from `nel info <invocation_id> --logs`>/slurm-<slurm_job_id>.log"

显示作业调度、健康检查和整体执行流程。

搜索错误信息：

bash

ssh <username>@<hostname> "grep -i 'error\|warning\|failed' <log path from `nel info <invocation_id> --logs`>/*.log"

进阶工作流程：如需更详细的运行监控、失败评估调试和运行后分析，请查看

launching-evals

技能。

引导有问题的用户前往：

GitHub Issues：https://github.com/NVIDIA-NeMo/Evaluator/issues
GitHub Discussions：https://github.com/NVIDIA-NeMo/Evaluator/discussions

现在，复制此清单并跟踪进度：

配置生成进度：
- [ ] 步骤1：检查nel是否已安装
- [ ] 步骤2：构建基础配置文件
- [ ] 步骤3：配置模型路径与参数
- [ ] 步骤4：补充剩余缺失值
- [ ] 步骤5：确认任务（迭代式）
- [ ] 步骤6：进阶 - 多节点（数据并行）
- [ ] 步骤7：进阶 - 拦截器
- [ ] 步骤8：运行评估