nemotron-customize

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

nemotron-customize

Nemotron定制流程

Purpose

用途

Use this skill to turn a model-customization request into a repo-native Nemotron step pipeline. It plans the step DAG, validates artifact wiring, and creates only the YAML configs needed to run existing steps.
Use it only for inspecting, configuring, validating, running, or submitting existing Nemotron steps or multi-step training/customization pipelines. If the request is a frontend, dashboard, visualization, generic ML-advice, billing/access, or unrelated coding task, stop with a short scope note and do not inspect the step catalog or edit files in that turn.
使用此技能将模型定制请求转换为仓库原生的Nemotron步骤流水线。它会规划步骤DAG(有向无环图),验证工件关联,并仅创建运行现有步骤所需的YAML配置文件。
仅将其用于检查、配置、验证、运行或提交现有Nemotron步骤或多阶段训练/定制流水线。如果请求涉及前端、仪表盘、可视化、通用机器学习建议、计费/权限或无关编码任务,请停止操作并给出简短的范围说明,且在该轮次中不要检查步骤目录或编辑文件。

Security Notes

安全说明

This skill may use
Write
to create or modify YAML/README files and
Bash
to run repository commands. Confirm with the user before file writes or shell execution. Keep Bash usage scoped to repo-safe commands such as
uv run nemotron steps ...
,
python -m pytest ...
,
git status/diff
, and targeted validation commands. Never run environment dumps (
env
,
printenv
, broad
export
) or commands that expose secret values.
此技能可能会使用
Write
操作创建或修改YAML/README文件,并使用
Bash
执行仓库命令。在执行文件写入或shell命令前需与用户确认。Bash命令仅限仓库安全操作,例如
uv run nemotron steps ...
python -m pytest ...
git status/diff
以及针对性验证命令。切勿运行环境转储命令(
env
printenv
、宽泛的
export
)或会暴露机密值的命令。

Requirements

要求

  • Checkout of this Nemotron repo with
    src/nemotron/steps/
    present.
  • Invoke from the repo root. All paths in this document are repo-root-relative.
  • User-provided model, data, hardware, backend, and output constraints before writing configs.
  • Backend credentials only when the selected step needs them (translation, W&B, hosted endpoints).
  • 已检出包含
    src/nemotron/steps/
    目录的Nemotron仓库。
  • 从仓库根目录调用。本文档中所有路径均为相对仓库根目录的路径。
  • 在编写配置文件前,用户需提供模型、数据、硬件、后端和输出约束条件。
  • 仅当所选步骤需要时才获取后端凭据(翻译、W&B、托管端点相关)。

Limitations

限制

  • Does not invent new catalog steps when an existing one fits.
  • New Python/shell code only in Explorer mode after the gap is explicit.
  • Post-training deployment-only requests are out of scope.
Invocation:
/nemotron-customize
. The repo under
src/nemotron/steps/
is the source of truth; this skill orchestrates and does not duplicate per-step knowledge.
Priority order: (1) reuse existing repo code, CLIs, recipes, steps, runners, and configs; (2) add YAML configs for the user's request; (3) generate new Python/shell only when the repo cannot satisfy the request, and name the gap first.
For a command request: verify repo root, read the step catalog, read the selected
step.toml
, verify the requested config exists, read the active env TOML for any remote profile, then emit the complete command. Do not guess
--batch
profiles from examples or naming conventions.
  • 当已有合适的目录步骤时,不会创建新的步骤。
  • 仅在明确存在功能缺口后,才会在探索模式下生成新的Python/shell代码。
  • 仅涉及训练后部署的请求不在范围内。
调用方式:
/nemotron-customize
src/nemotron/steps/
目录下的内容为事实来源;此技能仅负责编排,不会重复每个步骤的知识。
优先级顺序:(1) 重用现有仓库代码、CLI、方案、步骤、运行器和配置文件;(2) 为用户请求添加YAML配置文件;(3) 仅当仓库无法满足请求时才生成新的Python/shell代码,且需先明确说明缺口。
对于命令请求:验证仓库根目录、读取步骤目录、读取所选的
step.toml
、验证请求的配置是否存在、读取活动环境TOML文件中的远程配置文件,然后输出完整命令。不要根据示例或命名约定猜测
--batch
配置文件。

Quick Decision Tree

快速决策树

  • AutoModel vs Megatron-Bridge: small GPU count, Hugging Face model, LoRA/PEFT, or OpenAI-style chat JSONL → AutoModel path (
    sft/automodel
    or the matching PEFT AutoModel step). Large distributed training, packed Parquet/binidx data, or full fine-tuning → Megatron-Bridge, but verify against
    hardware.md
    and the category README first.
  • BYOB / MCQ benchmark inputs route to
    byob/mcq
    , NOT
    translate/nemo_curator
    . BYOB preserves the multiple-choice schema (question, choices, answer); the translate path would flatten or strip those fields. Trigger on phrases like "BYOB benchmark", "MCQ", "evaluation benchmark Parquet", "multiple-choice prep".
  • Curate then translate: when the user says "curate and translate", "filter then translate", or "prep data before translating", chain
    curate/nemo_curator
    (filter raw JSONL) →
    translate/nemo_curator
    (translate curated JSONL). Do not skip the curate stage.
  • Checkpoint conversion: route "Megatron to HF", "HF export", "convert checkpoint", or "iter_* to safetensors" to
    convert/megatron_to_hf
    ; route "HF to Megatron" imports to
    convert/hf_to_megatron
    . Use a concrete
    iter_*
    source for Megatron exports.
  • Existing endpoint or checkpoint eval: route hosted endpoint smoke tests and benchmark requests to
    eval/model_eval
    ; use
    tiny_chat
    for hosted chat smoke and
    default
    for Megatron checkpoint evaluation.
  • No env TOML profile present: do not invent Lepton or
    --batch
    profiles; ask the user or fall back to local execution.
Required inputs before finalizing configs or commands:
  • model
    ,
    input_path
    ,
    output_dir
    , hardware/GPU count, backend/env profile, and any needed API key environment variable name such as
    HF_TOKEN
    or an evaluator key.
  • For translation commands, also collect
    server.url
    , target/source languages, and the runtime-visible input/output paths.
  • For BYOB, collect benchmark/source document path, stage (
    prepare
    ,
    generate
    ,
    translate
    , or
    all
    ), target/source languages when translating, and output directory.
  • For conversion, collect source checkpoint path, output path, model/config source, and whether the source is HF, Megatron
    iter_*
    , or LoRA adapter.
  • For eval, collect endpoint URL/model ID or checkpoint path, task IDs, endpoint type, API-key environment variable name, and sample limit.
Response shape for recommendations:
Decision
,
Why
,
Required inputs
,
Config/command
,
Avoid
, and
Next step
. Always call out the stack to avoid when the user's constraints make it a poor fit.
  • AutoModel vs Megatron-Bridge:GPU数量少、使用Hugging Face模型、LoRA/PEFT或OpenAI风格聊天JSONL数据 → 选择AutoModel路径(
    sft/automodel
    或匹配的PEFT AutoModel步骤)。大规模分布式训练、打包的Parquet/binidx数据或全量微调 → 选择Megatron-Bridge,但需先对照
    hardware.md
    和分类README验证。
  • BYOB / MCQ基准测试输入需路由至
    byob/mcq
    ,而非
    translate/nemo_curator
    。BYOB会保留选择题结构(问题、选项、答案);翻译路径会扁平化或剥离这些字段。当出现“BYOB benchmark”、“MCQ”、“evaluation benchmark Parquet”、“multiple-choice prep”等表述时触发此规则。
  • 先整理再翻译:当用户提到“curate and translate”、“filter then translate”或“prep data before translating”时,串联
    curate/nemo_curator
    (过滤原始JSONL)→
    translate/nemo_curator
    (翻译整理后的JSONL)。不要跳过整理阶段。
  • 检查点转换:将“Megatron转HF”、“HF导出”、“convert checkpoint”或“iter_*转safetensors”请求路由至
    convert/megatron_to_hf
    ;将“HF转Megatron”导入请求路由至
    convert/hf_to_megatron
    。Megatron导出需使用具体的
    iter_*
    源。
  • 现有端点或检查点评估:将托管端点冒烟测试和基准测试请求路由至
    eval/model_eval
    ;托管聊天冒烟测试使用
    tiny_chat
    ,Megatron检查点评估使用
    default
  • 无环境TOML配置文件:不要创建Lepton或
    --batch
    配置文件;询问用户或回退到本地执行。
在最终确定配置文件或命令前需收集的必填输入:
  • model
    input_path
    output_dir
    、硬件/GPU数量、后端/环境配置文件,以及所需的API密钥环境变量名称,例如
    HF_TOKEN
    或评估器密钥。
  • 对于翻译命令,还需收集
    server.url
    、目标/源语言,以及运行时可见的输入/输出路径。
  • 对于BYOB,收集基准测试/源文档路径、阶段(
    prepare
    generate
    translate
    all
    )、翻译时的目标/源语言,以及输出目录。
  • 对于转换操作,收集源检查点路径、输出路径、模型/配置源,以及源是否为HF、Megatron
    iter_*
    或LoRA适配器。
  • 对于评估操作,收集端点URL/模型ID或检查点路径、任务ID、端点类型、API密钥环境变量名称,以及样本限制。
建议的响应结构:
Decision
(决策)、
Why
(原因)、
Required inputs
(必填输入)、
Config/command
(配置/命令)、
Avoid
(避免方案)和
Next step
(下一步)。当用户的约束条件使某方案不适合时,需明确指出应避免的方案栈。

How information is split (and where to find it)

信息拆分方式(及查找位置)

QuestionLook here
What does step X consume / produce / parameterize?
src/nemotron/steps/<cat>/<X>/step.toml
When/why pick step X over its siblings?
src/nemotron/steps/<cat>/<X>/README.md
Which step in category C should I pick?
src/nemotron/steps/<cat>/README.md
What runner code does step X use?
src/nemotron/steps/<cat>/<X>/step.py
src/nemotron/steps/_runners/
Cross-step constraint (tokenizer lock, sequence packing, data quality, ...)
src/nemotron/steps/patterns/<id>.md
Artifact compatibility /
is_a
hierarchy
src/nemotron/steps/types.toml
GPU memory / parallelism heuristics
src/nemotron/steps/hardware.md
Library API extracts for exceptional code generation
references/context/index.toml
references/context/<pack>.txt
Project scaffold rules, only when repo code cannot support the request
references/act/PROJECT.md
Per-stage code rules, only when repo code cannot support the request
references/act/STAGE.md
If two sources say the same thing, the deeper, more specific one wins (
step.toml
> category
README.md
> this file).

问题查找位置
步骤X的输入/输出/参数有哪些?
src/nemotron/steps/<cat>/<X>/step.toml
何时/为何选择步骤X而非其他同类步骤?
src/nemotron/steps/<cat>/<X>/README.md
分类C中应选择哪个步骤?
src/nemotron/steps/<cat>/README.md
步骤X使用哪些运行器代码?
src/nemotron/steps/<cat>/<X>/step.py
src/nemotron/steps/_runners/
跨步骤约束(分词器锁定、序列打包、数据质量等)
src/nemotron/steps/patterns/<id>.md
工件兼容性 /
is_a
层级
src/nemotron/steps/types.toml
GPU内存 / 并行性启发规则
src/nemotron/steps/hardware.md
用于特殊代码生成的库API提取
references/context/index.toml
references/context/<pack>.txt
项目脚手架规则(仅当仓库代码无法支持请求时使用)
references/act/PROJECT.md
分阶段代码规则(仅当仓库代码无法支持请求时使用)
references/act/STAGE.md
如果两个来源内容相同,更深入、更具体的来源优先(
step.toml
> 分类
README.md
> 本文档)。

Instructions

操作说明

Pipeline workflow (≥2 stages): Orient → Plan → Act → Verify. Discover candidate steps, propose a DAG with validated artifact wiring, wait for approval, create the minimal YAML configs, and re-check before reporting done. Not general ML advice —
src/nemotron/steps/
is the source of truth.
Single-step command flow:
  1. Confirm the repo root has
    pyproject.toml
    and
    src/nemotron/steps/
    .
  2. Run
    uv run nemotron steps list --json
    when available; otherwise read
    src/nemotron/steps/STEPS.md
    .
  3. Read the selected step's
    step.toml
    and the requested checked-in config.
  4. For remote execution, read
    NEMOTRON_ENV_FILE
    or a repo-root
    env*.toml
    and pick an actual section whose profile matches the step.
  5. Emit the full command in one reply; then add brief rationale for the config/profile choices. For translation, also read
    src/nemotron/steps/translate/README.md
    and return
    Decision
    ,
    Config
    ,
    Run
    ,
    Output
    ,
    Env
    .
Source tiers for command answers — Verified (CLI + manifest + config + env + dry-run all succeeded), Repo-grounded (manifest/config/env read, no dry-run), Blocked (a required repo file or env TOML is missing — name it and stop before guessing).
Canonical commands:
bash
uv run nemotron steps run <step_id> -c <config-or-path> --dry-run
uv run nemotron steps run <step_id> -c <config-or-path> --dry-run --batch <profile>
uv run nemotron steps run <step_id> -c <config-or-path> --batch <profile>

流水线工作流(≥2阶段):定位 → 规划 → 执行 → 验证。发现候选步骤,提出经过工件关联验证的DAG,等待批准,创建最小化YAML配置文件,完成前再次检查。不提供通用机器学习建议 —
src/nemotron/steps/
为事实来源。
单步骤命令流
  1. 确认仓库根目录包含
    pyproject.toml
    src/nemotron/steps/
  2. 若可用,运行
    uv run nemotron steps list --json
    ;否则读取
    src/nemotron/steps/STEPS.md
  3. 读取所选步骤的
    step.toml
    和已提交的请求配置。
  4. 对于远程执行,读取
    NEMOTRON_ENV_FILE
    或仓库根目录下的
    env*.toml
    ,选择与步骤匹配的实际配置段。
  5. 在一次回复中输出完整命令;然后简要说明配置/配置文件选择的理由。对于翻译操作,还需读取
    src/nemotron/steps/translate/README.md
    并返回
    Decision
    Config
    Run
    Output
    Env
命令答案的来源层级:已验证(CLI + 清单 + 配置 + 环境 + 空跑均成功)、基于仓库(已读取清单/配置/环境,未空跑)、受阻(缺少必要的仓库文件或环境TOML — 明确指出并停止,不要猜测)。
标准命令
bash
uv run nemotron steps run <step_id> -c <config-or-path> --dry-run
uv run nemotron steps run <step_id> -c <config-or-path> --dry-run --batch <profile>
uv run nemotron steps run <step_id> -c <config-or-path> --batch <profile>

Workflow

工作流

Four phases, in order: Orient → Plan → Act → Verify. Never skip Verify. For detailed phase checklists and Explorer-mode implementation rules, read
references/WORKFLOW.md
.

四个阶段,顺序为:定位 → 规划 → 执行 → 验证。切勿跳过验证阶段。如需详细的阶段检查清单和探索模式实现规则,请阅读
references/WORKFLOW.md

Operational Nuances

操作细节

  • Smoke configs (
    tiny.yaml
    ,
    tiny_chat.yaml
    ) are wiring tests, not quality evidence.
  • ${art:...}
    references belong in recipe-backed configs; standalone YAML uses plain paths.
  • Keep pretraining
    bin/idx
    data and
    blend.json
    from the same Nemotron release.
  • 冒烟测试配置文件(
    tiny.yaml
    tiny_chat.yaml
    )仅用于关联测试,不能作为质量依据。
  • ${art:...}
    引用属于基于方案的配置文件;独立YAML使用纯路径。
  • 预训练
    bin/idx
    数据和
    blend.json
    需来自同一Nemotron版本。

Examples

示例

  • Single step: read manifest + config + env profile, then return a complete
    uv run nemotron steps run <step_id> -c <config> --dry-run
    command.
  • Translate (one-shot command): for "translate EN → <lang>" requests, collect
    server.url
    ,
    model
    , source/target language,
    api_key_env
    , and runtime-visible input/output paths first, then emit the full command in one reply (do not split across turns):
    bash
    uv run nemotron steps run translate/nemo_curator \
      -c <translate-config.yaml> \
      --batch <env-profile-from-env.toml>
  • Curate then translate: chain
    curate/nemo_curator
    translate/nemo_curator
    . The curate stage produces filtered JSONL that becomes the translate stage input. Both steps need YAML overlays; wire curate's
    output_dir
    to translate's
    input_glob
    .
  • BYOB benchmark prep: route MCQ Parquet inputs through
    byob/mcq
    , not
    translate/nemo_curator
    , so the multiple-choice schema is preserved.
  • SFT pipeline: plan the DAG (
    data_prep
    sft/megatron_bridge
    or
    sft/automodel
    ), validate artifact edges via
    types.toml
    , then create the YAML overlays.

  • 单步骤:读取清单 + 配置 + 环境配置文件,然后返回完整的
    uv run nemotron steps run <step_id> -c <config> --dry-run
    命令。
  • 翻译(一次性命令):对于“translate EN → <lang>”请求,先收集
    server.url
    model
    、源/目标语言、
    api_key_env
    以及运行时可见的输入/输出路径,然后在一次回复中输出完整命令(不要拆分到多轮):
    bash
    uv run nemotron steps run translate/nemo_curator \
      -c <translate-config.yaml> \
      --batch <env-profile-from-env.toml>
  • 先整理再翻译:串联
    curate/nemo_curator
    translate/nemo_curator
    。整理阶段生成过滤后的JSONL作为翻译阶段的输入。两个步骤都需要YAML覆盖配置;将整理阶段的
    output_dir
    关联到翻译阶段的
    input_glob
  • BYOB基准测试准备:将MCQ Parquet输入路由至
    byob/mcq
    ,而非
    translate/nemo_curator
    ,以保留选择题结构。
  • SFT流水线:规划DAG(
    data_prep
    sft/megatron_bridge
    sft/automodel
    ),通过
    types.toml
    验证工件关联,然后创建YAML覆盖配置。

Two modes

两种模式

Catalog mode — a step exists

目录模式 — 已有对应步骤

Fast path:
STEPS.md → category/README.md → step.toml → step.py → adapt YAML config
. Use whenever the user's request maps to a step in the catalog.
快速路径:
STEPS.md → category/README.md → step.toml → step.py → 适配YAML配置
。当用户请求可映射到目录中的步骤时使用此模式。

Explorer mode — no repo path supports it

探索模式 — 无仓库路径支持

Use only after confirming no existing step, runner, recipe, CLI, or YAML config surface can satisfy the request. Follow
references/WORKFLOW.md
.
仅在确认没有现有步骤、运行器、方案、CLI或YAML配置可满足请求时使用。遵循
references/WORKFLOW.md
中的规则。

Choosing a mode

模式选择

User saysMode
"SFT with Megatron-Bridge / AutoModel"Catalog
"DPO / RLVR / GRPO / RLHF"Catalog:
rl/nemo_rl/*
"Synthesize preference / SFT data"Catalog:
sdg/data_designer
"Translate EN → <lang> for training data"Catalog:
translate/nemo_curator
"Curate and translate" / "filter then translate"Catalog chain:
curate/nemo_curator
translate/nemo_curator
"Curate web text"Catalog:
curate/nemo_curator
"BYOB benchmark" / "MCQ benchmark prep"Catalog:
byob/mcq
(preserves MCQ schema)
"Train with X exotic backend"Explorer or ask
Post-training-only requestOut of scope; redirect to a more appropriate workflow.
AmbiguousAsk

用户表述模式
"使用Megatron-Bridge / AutoModel进行SFT"目录模式
"DPO / RLVR / GRPO / RLHF"目录模式:
rl/nemo_rl/*
"生成偏好/SFT数据"目录模式:
sdg/data_designer
"将训练数据从EN翻译为<lang>"目录模式:
translate/nemo_curator
"先整理再翻译" / "过滤后翻译"目录模式串联:
curate/nemo_curator
translate/nemo_curator
"整理网络文本"目录模式:
curate/nemo_curator
"BYOB基准测试" / "MCQ基准测试准备"目录模式:
byob/mcq
(保留MCQ结构)
"使用X小众后端进行训练"探索模式或询问用户
仅涉及训练后的请求超出范围;重定向至更合适的工作流。
表述模糊询问用户

Boundaries

边界

Do: build pipelines from existing steps and cite
step.toml
directly; reuse repo CLIs/runners/recipes first; adapt configs (don't copy
default.yaml
blindly); ask about hardware/data/backend/output path; surface tradeoffs (Megatron-Bridge vs AutoModel, full FT vs LoRA); present the plan and wait for approval.
Don't: invent steps; skip Plan for pipelines ≥2 stages; generate Python or shell when YAML suffices; import modules outside the step's reference code; add monitoring/W&B unless asked; tune parallelism beyond
hardware.md
and
[[strategies]]
; assume GPU count; generate Slurm/Airflow/Kubeflow wrappers; handle non-training requests in this skill; modify
src/nemotron/steps/
; restate per-step rules here — link the step's
README.md
.
允许操作:基于现有步骤构建流水线并直接引用
step.toml
;优先重用仓库CLI/运行器/方案;适配配置文件(不要盲目复制
default.yaml
);询问硬件/数据/后端/输出路径相关信息;说明权衡方案(Megatron-Bridge vs AutoModel、全量微调 vs LoRA);提交计划并等待批准。
禁止操作:创建新步骤;对于≥2阶段的流水线跳过规划阶段;当YAML可满足需求时生成Python或shell代码;导入步骤参考代码之外的模块;未被请求时添加监控/W&B;超出
hardware.md
[[strategies]]
的范围调整并行性;假设GPU数量;生成Slurm/Airflow/Kubeflow包装器;在此技能中处理非训练请求;修改
src/nemotron/steps/
;在此处重复每个步骤的规则 — 链接步骤的
README.md

Troubleshooting

故障排除

SituationAction
Artifact types do not chainRecheck
types.toml
; change the DAG before writing configs.
Remote profile unclear /
--batch
ambiguous
Read the active env TOML; do not guess.
Config key unclearRead the step config,
step.py
, and shared runner before editing.
Strategy points to a missing skill fileSkip the load; use the
then:
text and flag the plan with
WARNING: <topic> docs unavailable
.
Hardware too smallShow
[[models]]
min_gpus
; suggest smaller model → AutoModel → LoRA.
Two failed Act attemptsStop, explain what was tried and what failed, ask the user how to proceed.
No existing repo path matchesCheck libraries cited in
step.toml [reference]
. If supported, use Explorer mode; otherwise ask.
场景操作
工件类型无法串联重新检查
types.toml
;在编写配置文件前调整DAG。
远程配置文件不明确 /
--batch
模糊
读取活动环境TOML文件;不要猜测。
配置键不明确在编辑前读取步骤配置、
step.py
和共享运行器代码。
策略指向缺失的技能文件跳过加载;使用
then:
文本并在计划中标记
WARNING: <topic> docs unavailable
硬件规格过小展示
[[models]]
中的
min_gpus
;建议使用更小的模型 → AutoModel → LoRA。
执行尝试失败两次停止操作,说明已尝试的内容和失败原因,询问用户如何继续。
无匹配的现有仓库路径检查
step.toml [reference]
中引用的库。如果支持,使用探索模式;否则询问用户。