nemotron-customize

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

nemotron-customize

Nemotron定制流程

Purpose

用途

Use this skill to turn a model-customization request into a repo-native Nemotron step pipeline. It plans the step DAG, validates artifact wiring, and creates only the YAML configs needed to run existing steps.

Use it only for inspecting, configuring, validating, running, or submitting existing Nemotron steps or multi-step training/customization pipelines. If the request is a frontend, dashboard, visualization, generic ML-advice, billing/access, or unrelated coding task, stop with a short scope note and do not inspect the step catalog or edit files in that turn.

使用此技能将模型定制请求转换为仓库原生的Nemotron步骤流水线。它会规划步骤DAG（有向无环图），验证工件关联，并仅创建运行现有步骤所需的YAML配置文件。

仅将其用于检查、配置、验证、运行或提交现有Nemotron步骤或多阶段训练/定制流水线。如果请求涉及前端、仪表盘、可视化、通用机器学习建议、计费/权限或无关编码任务，请停止操作并给出简短的范围说明，且在该轮次中不要检查步骤目录或编辑文件。

Security Notes

安全说明

This skill may use

Write

to create or modify YAML/README files and

Bash

to run repository commands. Confirm with the user before file writes or shell execution. Keep Bash usage scoped to repo-safe commands such as

uv run nemotron steps ...

python -m pytest ...

git status/diff

, and targeted validation commands. Never run environment dumps (

env

printenv

, broad

export

) or commands that expose secret values.

此技能可能会使用

Write

操作创建或修改YAML/README文件，并使用

Bash

执行仓库命令。在执行文件写入或shell命令前需与用户确认。Bash命令仅限仓库安全操作，例如

uv run nemotron steps ...

、

python -m pytest ...

、

git status/diff

以及针对性验证命令。切勿运行环境转储命令（

env

、

printenv

、宽泛的

export

）或会暴露机密值的命令。

Requirements

要求

Checkout of this Nemotron repo with
```
src/nemotron/steps/
```
present.
Invoke from the repo root. All paths in this document are repo-root-relative.
User-provided model, data, hardware, backend, and output constraints before writing configs.
Backend credentials only when the selected step needs them (translation, W&B, hosted endpoints).

已检出包含
```
src/nemotron/steps/
```
目录的Nemotron仓库。
从仓库根目录调用。本文档中所有路径均为相对仓库根目录的路径。
在编写配置文件前，用户需提供模型、数据、硬件、后端和输出约束条件。
仅当所选步骤需要时才获取后端凭据（翻译、W&B、托管端点相关）。

Limitations

限制

Does not invent new catalog steps when an existing one fits.
New Python/shell code only in Explorer mode after the gap is explicit.
Post-training deployment-only requests are out of scope.

Invocation:

/nemotron-customize

. The repo under

src/nemotron/steps/

is the source of truth; this skill orchestrates and does not duplicate per-step knowledge.

Priority order: (1) reuse existing repo code, CLIs, recipes, steps, runners, and configs; (2) add YAML configs for the user's request; (3) generate new Python/shell only when the repo cannot satisfy the request, and name the gap first.

For a command request: verify repo root, read the step catalog, read the selected

step.toml

, verify the requested config exists, read the active env TOML for any remote profile, then emit the complete command. Do not guess

--batch

profiles from examples or naming conventions.

当已有合适的目录步骤时，不会创建新的步骤。
仅在明确存在功能缺口后，才会在探索模式下生成新的Python/shell代码。
仅涉及训练后部署的请求不在范围内。

调用方式：

/nemotron-customize

。

src/nemotron/steps/

目录下的内容为事实来源；此技能仅负责编排，不会重复每个步骤的知识。

优先级顺序：(1) 重用现有仓库代码、CLI、方案、步骤、运行器和配置文件；(2) 为用户请求添加YAML配置文件；(3) 仅当仓库无法满足请求时才生成新的Python/shell代码，且需先明确说明缺口。

对于命令请求：验证仓库根目录、读取步骤目录、读取所选的

step.toml

、验证请求的配置是否存在、读取活动环境TOML文件中的远程配置文件，然后输出完整命令。不要根据示例或命名约定猜测

--batch

配置文件。

Quick Decision Tree

快速决策树

AutoModel vs Megatron-Bridge: small GPU count, Hugging Face model, LoRA/PEFT, or OpenAI-style chat JSONL → AutoModel path (
```
sft/automodel
```
or the matching PEFT AutoModel step). Large distributed training, packed Parquet/binidx data, or full fine-tuning → Megatron-Bridge, but verify against
```
hardware.md
```
and the category README first.
BYOB / MCQ benchmark inputs route to
byob/mcq
, NOT
translate/nemo_curator
. BYOB preserves the multiple-choice schema (question, choices, answer); the translate path would flatten or strip those fields. Trigger on phrases like "BYOB benchmark", "MCQ", "evaluation benchmark Parquet", "multiple-choice prep".
Curate then translate: when the user says "curate and translate", "filter then translate", or "prep data before translating", chain
```
curate/nemo_curator
```
(filter raw JSONL) →
```
translate/nemo_curator
```
(translate curated JSONL). Do not skip the curate stage.
Checkpoint conversion: route "Megatron to HF", "HF export", "convert checkpoint", or "iter_* to safetensors" to
```
convert/megatron_to_hf
```
; route "HF to Megatron" imports to
```
convert/hf_to_megatron
```
. Use a concrete
```
iter_*
```
source for Megatron exports.
Existing endpoint or checkpoint eval: route hosted endpoint smoke tests and benchmark requests to
```
eval/model_eval
```
; use
```
tiny_chat
```
for hosted chat smoke and
```
default
```
for Megatron checkpoint evaluation.
No env TOML profile present: do not invent Lepton or
```
--batch
```
profiles; ask the user or fall back to local execution.

Required inputs before finalizing configs or commands:

```
model
```
,
```
input_path
```
,
```
output_dir
```
, hardware/GPU count, backend/env profile, and any needed API key environment variable name such as
```
HF_TOKEN
```
or an evaluator key.
For translation commands, also collect
```
server.url
```
, target/source languages, and the runtime-visible input/output paths.
For BYOB, collect benchmark/source document path, stage (
```
prepare
```
,
```
generate
```
,
```
translate
```
, or
```
all
```
), target/source languages when translating, and output directory.
For conversion, collect source checkpoint path, output path, model/config source, and whether the source is HF, Megatron
```
iter_*
```
, or LoRA adapter.
For eval, collect endpoint URL/model ID or checkpoint path, task IDs, endpoint type, API-key environment variable name, and sample limit.

Response shape for recommendations:

Decision

Why

Required inputs

Config/command

Avoid

, and

Next step

. Always call out the stack to avoid when the user's constraints make it a poor fit.

AutoModel vs Megatron-Bridge：GPU数量少、使用Hugging Face模型、LoRA/PEFT或OpenAI风格聊天JSONL数据 → 选择AutoModel路径（
```
sft/automodel
```
或匹配的PEFT AutoModel步骤）。大规模分布式训练、打包的Parquet/binidx数据或全量微调 → 选择Megatron-Bridge，但需先对照
```
hardware.md
```
和分类README验证。
BYOB / MCQ基准测试输入需路由至
byob/mcq
，而非
translate/nemo_curator
。BYOB会保留选择题结构（问题、选项、答案）；翻译路径会扁平化或剥离这些字段。当出现“BYOB benchmark”、“MCQ”、“evaluation benchmark Parquet”、“multiple-choice prep”等表述时触发此规则。
先整理再翻译：当用户提到“curate and translate”、“filter then translate”或“prep data before translating”时，串联
```
curate/nemo_curator
```
（过滤原始JSONL）→
```
translate/nemo_curator
```
（翻译整理后的JSONL）。不要跳过整理阶段。
检查点转换：将“Megatron转HF”、“HF导出”、“convert checkpoint”或“iter_*转safetensors”请求路由至
```
convert/megatron_to_hf
```
；将“HF转Megatron”导入请求路由至
```
convert/hf_to_megatron
```
。Megatron导出需使用具体的
```
iter_*
```
源。
现有端点或检查点评估：将托管端点冒烟测试和基准测试请求路由至
```
eval/model_eval
```
；托管聊天冒烟测试使用
```
tiny_chat
```
，Megatron检查点评估使用
```
default
```
。
无环境TOML配置文件：不要创建Lepton或
```
--batch
```
配置文件；询问用户或回退到本地执行。

在最终确定配置文件或命令前需收集的必填输入：

```
model
```
、
```
input_path
```
、
```
output_dir
```
、硬件/GPU数量、后端/环境配置文件，以及所需的API密钥环境变量名称，例如
```
HF_TOKEN
```
或评估器密钥。
对于翻译命令，还需收集
```
server.url
```
、目标/源语言，以及运行时可见的输入/输出路径。
对于BYOB，收集基准测试/源文档路径、阶段（
```
prepare
```
、
```
generate
```
、
```
translate
```
或
```
all
```
）、翻译时的目标/源语言，以及输出目录。
对于转换操作，收集源检查点路径、输出路径、模型/配置源，以及源是否为HF、Megatron
```
iter_*
```
或LoRA适配器。
对于评估操作，收集端点URL/模型ID或检查点路径、任务ID、端点类型、API密钥环境变量名称，以及样本限制。

建议的响应结构：

Decision

（决策）、

Why

（原因）、

Required inputs

（必填输入）、

Config/command

（配置/命令）、

Avoid

（避免方案）和

Next step

（下一步）。当用户的约束条件使某方案不适合时，需明确指出应避免的方案栈。

How information is split (and where to find it)

信息拆分方式（及查找位置）

Question	Look here
What does step X consume / produce / parameterize?	`src/nemotron/steps/<cat>/<X>/step.toml`
When/why pick step X over its siblings?	`src/nemotron/steps/<cat>/<X>/README.md`
Which step in category C should I pick?	`src/nemotron/steps/<cat>/README.md`
What runner code does step X use?	`src/nemotron/steps/<cat>/<X>/step.py` → `src/nemotron/steps/_runners/`
Cross-step constraint (tokenizer lock, sequence packing, data quality, ...)	`src/nemotron/steps/patterns/<id>.md`
Artifact compatibility / `is_a` hierarchy	`src/nemotron/steps/types.toml`
GPU memory / parallelism heuristics	`src/nemotron/steps/hardware.md`
Library API extracts for exceptional code generation	`references/context/index.toml` → `references/context/<pack>.txt`
Project scaffold rules, only when repo code cannot support the request	`references/act/PROJECT.md`
Per-stage code rules, only when repo code cannot support the request	`references/act/STAGE.md`

If two sources say the same thing, the deeper, more specific one wins (

step.toml

> category

README.md

> this file).

问题	查找位置
步骤X的输入/输出/参数有哪些？	`src/nemotron/steps/<cat>/<X>/step.toml`
何时/为何选择步骤X而非其他同类步骤？	`src/nemotron/steps/<cat>/<X>/README.md`
分类C中应选择哪个步骤？	`src/nemotron/steps/<cat>/README.md`
步骤X使用哪些运行器代码？	`src/nemotron/steps/<cat>/<X>/step.py` → `src/nemotron/steps/_runners/`
跨步骤约束（分词器锁定、序列打包、数据质量等）	`src/nemotron/steps/patterns/<id>.md`
工件兼容性 / `is_a` 层级	`src/nemotron/steps/types.toml`
GPU内存 / 并行性启发规则	`src/nemotron/steps/hardware.md`
用于特殊代码生成的库API提取	`references/context/index.toml` → `references/context/<pack>.txt`
项目脚手架规则（仅当仓库代码无法支持请求时使用）	`references/act/PROJECT.md`
分阶段代码规则（仅当仓库代码无法支持请求时使用）	`references/act/STAGE.md`

如果两个来源内容相同，更深入、更具体的来源优先（

step.toml

> 分类

README.md

> 本文档）。

Instructions

操作说明

Pipeline workflow (≥2 stages): Orient → Plan → Act → Verify. Discover candidate steps, propose a DAG with validated artifact wiring, wait for approval, create the minimal YAML configs, and re-check before reporting done. Not general ML advice —

src/nemotron/steps/

is the source of truth.

Single-step command flow:

Confirm the repo root has
```
pyproject.toml
```
and
```
src/nemotron/steps/
```
.

Run

uv run nemotron steps list --json

when available; otherwise read

src/nemotron/steps/STEPS.md

Read the selected step's
```
step.toml
```
and the requested checked-in config.
For remote execution, read
```
NEMOTRON_ENV_FILE
```
or a repo-root
```
env*.toml
```
and pick an actual section whose profile matches the step.
Emit the full command in one reply; then add brief rationale for the config/profile choices. For translation, also read
```
src/nemotron/steps/translate/README.md
```
and return
```
Decision
```
,
```
Config
```
,
```
Run
```
,
```
Output
```
,
```
Env
```
.

Source tiers for command answers — Verified (CLI + manifest + config + env + dry-run all succeeded), Repo-grounded (manifest/config/env read, no dry-run), Blocked (a required repo file or env TOML is missing — name it and stop before guessing).

Canonical commands:

bash

uv run nemotron steps run <step_id> -c <config-or-path> --dry-run
uv run nemotron steps run <step_id> -c <config-or-path> --dry-run --batch <profile>
uv run nemotron steps run <step_id> -c <config-or-path> --batch <profile>

流水线工作流（≥2阶段）：定位 → 规划 → 执行 → 验证。发现候选步骤，提出经过工件关联验证的DAG，等待批准，创建最小化YAML配置文件，完成前再次检查。不提供通用机器学习建议 —

src/nemotron/steps/

为事实来源。

单步骤命令流：

确认仓库根目录包含
```
pyproject.toml
```
和
```
src/nemotron/steps/
```
。

若可用，运行

uv run nemotron steps list --json

；否则读取

src/nemotron/steps/STEPS.md

。

读取所选步骤的
```
step.toml
```
和已提交的请求配置。
对于远程执行，读取
```
NEMOTRON_ENV_FILE
```
或仓库根目录下的
```
env*.toml
```
，选择与步骤匹配的实际配置段。
在一次回复中输出完整命令；然后简要说明配置/配置文件选择的理由。对于翻译操作，还需读取
```
src/nemotron/steps/translate/README.md
```
并返回
```
Decision
```
、
```
Config
```
、
```
Run
```
、
```
Output
```
、
```
Env
```
。

命令答案的来源层级：已验证（CLI + 清单 + 配置 + 环境 + 空跑均成功）、基于仓库（已读取清单/配置/环境，未空跑）、受阻（缺少必要的仓库文件或环境TOML — 明确指出并停止，不要猜测）。

标准命令：

bash

uv run nemotron steps run <step_id> -c <config-or-path> --dry-run
uv run nemotron steps run <step_id> -c <config-or-path> --dry-run --batch <profile>
uv run nemotron steps run <step_id> -c <config-or-path> --batch <profile>

Workflow

工作流

Four phases, in order: Orient → Plan → Act → Verify. Never skip Verify. For detailed phase checklists and Explorer-mode implementation rules, read

references/WORKFLOW.md

四个阶段，顺序为：定位 → 规划 → 执行 → 验证。切勿跳过验证阶段。如需详细的阶段检查清单和探索模式实现规则，请阅读

references/WORKFLOW.md

。

Operational Nuances

操作细节

Smoke configs (
```
tiny.yaml
```
,
```
tiny_chat.yaml
```
) are wiring tests, not quality evidence.
```
${art:...}
```
references belong in recipe-backed configs; standalone YAML uses plain paths.
Keep pretraining
```
bin/idx
```
data and
```
blend.json
```
from the same Nemotron release.

冒烟测试配置文件（
```
tiny.yaml
```
、
```
tiny_chat.yaml
```
）仅用于关联测试，不能作为质量依据。
```
${art:...}
```
引用属于基于方案的配置文件；独立YAML使用纯路径。
预训练
```
bin/idx
```
数据和
```
blend.json
```
需来自同一Nemotron版本。

Examples

示例

Single step: read manifest + config + env profile, then return a complete
```
uv run nemotron steps run <step_id> -c <config> --dry-run
```
command.
Translate (one-shot command): for "translate EN → <lang>" requests, collect
```
server.url
```
,
```
model
```
, source/target language,
```
api_key_env
```
, and runtime-visible input/output paths first, then emit the full command in one reply (do not split across turns):
bash
```
uv run nemotron steps run translate/nemo_curator \
  -c <translate-config.yaml> \
  --batch <env-profile-from-env.toml>
```
Curate then translate: chain
```
curate/nemo_curator
```
→
```
translate/nemo_curator
```
. The curate stage produces filtered JSONL that becomes the translate stage input. Both steps need YAML overlays; wire curate's
```
output_dir
```
to translate's
```
input_glob
```
.
BYOB benchmark prep: route MCQ Parquet inputs through
```
byob/mcq
```
, not
```
translate/nemo_curator
```
, so the multiple-choice schema is preserved.
SFT pipeline: plan the DAG (
```
data_prep
```
→
```
sft/megatron_bridge
```
or
```
sft/automodel
```
), validate artifact edges via
```
types.toml
```
, then create the YAML overlays.

单步骤：读取清单 + 配置 + 环境配置文件，然后返回完整的
```
uv run nemotron steps run <step_id> -c <config> --dry-run
```
命令。
翻译（一次性命令）：对于“translate EN → <lang>”请求，先收集
```
server.url
```
、
```
model
```
、源/目标语言、
```
api_key_env
```
以及运行时可见的输入/输出路径，然后在一次回复中输出完整命令（不要拆分到多轮）：
bash
```
uv run nemotron steps run translate/nemo_curator \
  -c <translate-config.yaml> \
  --batch <env-profile-from-env.toml>
```
先整理再翻译：串联
```
curate/nemo_curator
```
→
```
translate/nemo_curator
```
。整理阶段生成过滤后的JSONL作为翻译阶段的输入。两个步骤都需要YAML覆盖配置；将整理阶段的
```
output_dir
```
关联到翻译阶段的
```
input_glob
```
。
BYOB基准测试准备：将MCQ Parquet输入路由至
```
byob/mcq
```
，而非
```
translate/nemo_curator
```
，以保留选择题结构。
SFT流水线：规划DAG（
```
data_prep
```
→
```
sft/megatron_bridge
```
或
```
sft/automodel
```
），通过
```
types.toml
```
验证工件关联，然后创建YAML覆盖配置。

Two modes

两种模式

Catalog mode — a step exists

目录模式 — 已有对应步骤

Fast path:

STEPS.md → category/README.md → step.toml → step.py → adapt YAML config

. Use whenever the user's request maps to a step in the catalog.

快速路径：

STEPS.md → category/README.md → step.toml → step.py → 适配YAML配置

。当用户请求可映射到目录中的步骤时使用此模式。

Explorer mode — no repo path supports it

探索模式 — 无仓库路径支持

Use only after confirming no existing step, runner, recipe, CLI, or YAML config surface can satisfy the request. Follow

references/WORKFLOW.md

仅在确认没有现有步骤、运行器、方案、CLI或YAML配置可满足请求时使用。遵循

references/WORKFLOW.md

中的规则。

Choosing a mode

模式选择

User says	Mode
"SFT with Megatron-Bridge / AutoModel"	Catalog
"DPO / RLVR / GRPO / RLHF"	Catalog: `rl/nemo_rl/*`
"Synthesize preference / SFT data"	Catalog: `sdg/data_designer`
"Translate EN → <lang> for training data"	Catalog: `translate/nemo_curator`
"Curate and translate" / "filter then translate"	Catalog chain: `curate/nemo_curator` → `translate/nemo_curator`
"Curate web text"	Catalog: `curate/nemo_curator`
"BYOB benchmark" / "MCQ benchmark prep"	Catalog: `byob/mcq` (preserves MCQ schema)
"Train with X exotic backend"	Explorer or ask
Post-training-only request	Out of scope; redirect to a more appropriate workflow.
Ambiguous	Ask

用户表述	模式
"使用Megatron-Bridge / AutoModel进行SFT"	目录模式
"DPO / RLVR / GRPO / RLHF"	目录模式： `rl/nemo_rl/*`
"生成偏好/SFT数据"	目录模式： `sdg/data_designer`
"将训练数据从EN翻译为<lang>"	目录模式： `translate/nemo_curator`
"先整理再翻译" / "过滤后翻译"	目录模式串联： `curate/nemo_curator` → `translate/nemo_curator`
"整理网络文本"	目录模式： `curate/nemo_curator`
"BYOB基准测试" / "MCQ基准测试准备"	目录模式： `byob/mcq` （保留MCQ结构）
"使用X小众后端进行训练"	探索模式或询问用户
仅涉及训练后的请求	超出范围；重定向至更合适的工作流。
表述模糊	询问用户

Boundaries

边界

Do: build pipelines from existing steps and cite

step.toml

directly; reuse repo CLIs/runners/recipes first; adapt configs (don't copy

default.yaml

blindly); ask about hardware/data/backend/output path; surface tradeoffs (Megatron-Bridge vs AutoModel, full FT vs LoRA); present the plan and wait for approval.

Don't: invent steps; skip Plan for pipelines ≥2 stages; generate Python or shell when YAML suffices; import modules outside the step's reference code; add monitoring/W&B unless asked; tune parallelism beyond

hardware.md

and

[[strategies]]

; assume GPU count; generate Slurm/Airflow/Kubeflow wrappers; handle non-training requests in this skill; modify

src/nemotron/steps/

; restate per-step rules here — link the step's

README.md

允许操作：基于现有步骤构建流水线并直接引用

step.toml

；优先重用仓库CLI/运行器/方案；适配配置文件（不要盲目复制

default.yaml

）；询问硬件/数据/后端/输出路径相关信息；说明权衡方案（Megatron-Bridge vs AutoModel、全量微调 vs LoRA）；提交计划并等待批准。

禁止操作：创建新步骤；对于≥2阶段的流水线跳过规划阶段；当YAML可满足需求时生成Python或shell代码；导入步骤参考代码之外的模块；未被请求时添加监控/W&B；超出

hardware.md

和

[[strategies]]

的范围调整并行性；假设GPU数量；生成Slurm/Airflow/Kubeflow包装器；在此技能中处理非训练请求；修改

src/nemotron/steps/

；在此处重复每个步骤的规则 — 链接步骤的

README.md

。

Troubleshooting

故障排除

Situation	Action
Artifact types do not chain	Recheck `types.toml` ; change the DAG before writing configs.
Remote profile unclear / `--batch` ambiguous	Read the active env TOML; do not guess.
Config key unclear	Read the step config, `step.py` , and shared runner before editing.
Strategy points to a missing skill file	Skip the load; use the `then:` text and flag the plan with `WARNING: <topic> docs unavailable` .
Hardware too small	Show `[[models]]` `min_gpus` ; suggest smaller model → AutoModel → LoRA.
Two failed Act attempts	Stop, explain what was tried and what failed, ask the user how to proceed.
No existing repo path matches	Check libraries cited in `step.toml [reference]` . If supported, use Explorer mode; otherwise ask.

场景	操作
工件类型无法串联	重新检查 `types.toml` ；在编写配置文件前调整DAG。
远程配置文件不明确 / `--batch` 模糊	读取活动环境TOML文件；不要猜测。
配置键不明确	在编辑前读取步骤配置、 `step.py` 和共享运行器代码。
策略指向缺失的技能文件	跳过加载；使用 `then:` 文本并在计划中标记 `WARNING: <topic> docs unavailable` 。
硬件规格过小	展示 `[[models]]` 中的 `min_gpus` ；建议使用更小的模型 → AutoModel → LoRA。
执行尝试失败两次	停止操作，说明已尝试的内容和失败原因，询问用户如何继续。
无匹配的现有仓库路径	检查 `step.toml [reference]` 中引用的库。如果支持，使用探索模式；否则询问用户。