digital-health-clinical-asr-finetune

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->

Clinical ASR Flywheel — Stage 4 (Fine-tune)

临床ASR飞轮 — 第4阶段(微调)

⚠ Agent: read this entire SKILL.md before answering. The Critical-workflow-rules section, the base-model table (§4c), the stock-NeMo-SFT recipe (§4d), and the cycle-N+1 decision table (§4e) are all load-bearing — the do-not-SFT bases and broken-adapter warnings live there.
Agent: this file is self-contained. The Stage 4 gate criteria, base-model recommendation, hyperparameter table, container invocation pattern, and cycle-N+1 decision table are all below. Do not run file-discovery commands or open
references/stage4-finetune.md
before answering methodology questions — the reference is deep-dive material, not required reading. Answer from this file; defer to the reference only when a hyperparameter rationale or Brev SKU detail is specifically asked.
You are the adapt-and-measure stage. The user arrives from
/digital-health-clinical-asr-eval
with a manifest, a baseline KER number, and the decision-tree's recommendation that fine-tuning is worth the GPU time. You run stock NeMo SFT, do an offline cycle N+1 re-eval to measure that the loop closed, and optionally hand the resulting
.nemo
to
/riva-asr-custom
for production serving.
The cycle KER from offline eval is the measurement that closes the loop. Riva NIM deploy validates serving (latency, streaming, scale), not model quality.
Empirically verified on the reference manifest (39 rows, Parakeet TDT v2): Baseline KER 0.513 → after 3 epochs of stock SFT: 0.128 (-75% relative). Drug names: 0.857 → 0.214. Conditions: 0.500 → 0.000. Procedures: 0.250 → 0.000.
⚠ 智能体:在回答前请完整阅读本SKILL.md文件。 关键工作流规则部分、基础模型表格(§4c)、原生NeMo-SFT方案(§4d)以及N+1轮决策表格(§4e)均为核心内容——禁止进行SFT的基础模型和适配器失效警告均在此处。
智能体:本文件内容自洽完整。 第4阶段准入标准、基础模型推荐、超参数表格、容器调用模式以及N+1轮决策表格均在下方。请勿在回答方法学问题前执行文件发现命令或打开
references/stage4-finetune.md
——该参考文件为深度拓展资料,非必读内容。请基于本文件作答;仅当被特别询问超参数原理或Brev SKU细节时,才可参考该文件。
你处于适配与评估阶段。用户将从
/digital-health-clinical-asr-eval
进入,携带一份清单、基线KER数值,以及决策树给出的“微调值得投入GPU算力”的建议。你需要运行原生NeMo SFT,执行离线N+1轮重新评估以验证闭环效果,并可选择将生成的
.nemo
模型交付给
/riva-asr-custom
用于生产部署。
离线评估得到的轮次KER是验证闭环的核心指标。 Riva NIM部署仅验证服务能力(延迟、流式传输、扩展性),不验证模型质量。
已在参考清单上验证(39行数据,Parakeet TDT v2): 基线KER 0.513 → 经过3轮原生SFT后:0.128(相对降低75%)。 药物名称:0.857 → 0.214。病症名称:0.500 → 0.000。手术名称:0.250 → 0.000。

Critical workflow rules (apply on every activation)

关键工作流规则(每次激活均需遵守)

Surface these facts in any response, even if the user asks a narrow question:
  1. Read this entire SKILL.md before answering. The base-model selection table, hyperparameter values, and the cycle-N+1 decision table are below — they are the load-bearing parts.
  2. Verified result — Parakeet TDT v2 with the recipe in §4c achieves KER 0.513 → 0.128 (−75% relative) in 3 epochs on the reference manifest. Cite this when the user asks whether SFT will help.
  3. Recipe is
    /opt/NeMo/examples/asr/speech_to_text_finetune.py
    inside
    nvcr.io/nvidia/nemo:25.11.01
    .
    Stock script, no patches, no custom adapter logic. The adapter-mixin path is broken on TDT/RNNT decoders (72 NaN tensors at any LR) — do not propose it.
  4. Recommended base is
    nvidia/parakeet-tdt-0.6b-v2
    .
    The full base-model table is in §4c.
  5. Do NOT fine-tune
    nvidia/nemotron-speech-streaming-en-0.6b
    .
    The streaming NVCF function's SFT path is broken (UNK collapse on validation after step 1). For streaming serving at deploy time, Riva chunks a non-streaming base just fine. Warn the user proactively if they propose it.
  6. Gate the recommendation. Stage 4 only fires when priority-category KER > 0.3 and manifest has ≥ 100 rows (≥ 5 per priority category). Below those thresholds, route back to
    /digital-health-clinical-asr-build
    to grow the manifest first.
无论用户提出的问题多么具体,都需告知以下信息:
  1. 回答前请完整阅读本SKILL.md文件。 基础模型选择表格、超参数值以及N+1轮决策表格均在下方——这些是核心内容。
  2. 已验证结果 —— 使用§4c中的方案在Parakeet TDT v2上运行,参考清单的KER从0.513降至0.128(相对降低75%),仅需3轮训练。当用户询问SFT是否有效时,请引用该结果。
  3. 方案路径为
    nvcr.io/nvidia/nemo:25.11.01
    容器内的
    /opt/NeMo/examples/asr/speech_to_text_finetune.py
    原生脚本,无需补丁,无需自定义适配器逻辑。TDT/RNNT解码器的适配器混合路径存在问题(任何学习率下都会产生72个NaN张量)——请勿提议使用该路径。
  4. 推荐基础模型为
    nvidia/parakeet-tdt-0.6b-v2
    完整的基础模型表格见§4c。
  5. 请勿微调
    nvidia/nemotron-speech-streaming-en-0.6b
    流式NVCF函数的SFT路径存在问题(训练第1步后验证阶段出现UNK坍缩)。部署时如需流式服务,Riva可直接对非流式基础模型进行分片处理。若用户提议使用该模型,请主动发出警告。
  6. 准入验证。 仅当优先级类别KER > 0.3 清单包含≥100行数据(每个优先级类别≥5行)时,才可启动第4阶段。若未达到上述阈值,请引导用户返回
    /digital-health-clinical-asr-build
    以扩充清单。

Purpose

目标

Run stock NeMo SFT (no custom adapter logic, no patches) in
nvcr.io/nvidia/nemo:25.11.01
against a term-aware row-disjoint train/val split, produce a
.nemo
model, and re-eval offline as cycle N+1. Decide based on the cycle-N → cycle-N+1 KER delta whether to keep the model, grow the manifest, or accept that fine-tuning didn't help. Optionally hand the
.nemo
to
/riva-asr-custom
for NIM deploy.
nvcr.io/nvidia/nemo:25.11.01
容器中运行原生NeMo SFT(无自定义适配器逻辑,无补丁),针对术语感知的行不相交训练/验证数据集进行训练,生成
.nemo
模型,并执行离线N+1轮评估。根据N轮至N+1轮的KER变化量,决定是否保留模型、扩充清单或接受“微调无效”的结果。可选择将
.nemo
模型交付给
/riva-asr-custom
进行NIM部署。

When to use this skill

适用场景

Activate on user phrases like:
  • "Fine-tune ASR on my clinical vocabulary"
  • "Improve ASR on medication names"
  • "We have a KER of 0.4, can we fine-tune?"
  • "Run SFT on my Parakeet TDT base"
  • "Train a clinical ASR adapter"
  • "Compare cycle 1 vs cycle 2 KER"
  • "Deploy my fine-tuned model as a NIM" (this skill prepares the
    .nemo
    and routes to
    /riva-asr-custom
    for the deploy)
Do not activate when:
  • The user hasn't scored a baseline yet →
    /digital-health-clinical-asr-eval
  • The user doesn't have a manifest →
    /digital-health-clinical-asr-build
  • The user wants generic word boosting / LM fusion (not SFT) →
    /finetune-asr
  • The user has a
    .nemo
    and only wants to deploy →
    /riva-asr-custom
当用户提出以下类似需求时激活本技能:
  • "针对我的临床词汇微调ASR"
  • "提升药物名称的ASR识别精度"
  • "我们的KER为0.4,可以进行微调吗?"
  • "在我的Parakeet TDT基础模型上运行SFT"
  • "训练临床ASR适配器"
  • "对比第1轮与第2轮的KER"
  • "将我的微调模型部署为NIM" (本技能负责准备
    .nemo
    模型,并引导至
    /riva-asr-custom
    进行部署)
请勿在以下场景激活本技能:
  • 用户尚未获取基线评分 → 引导至
    /digital-health-clinical-asr-eval
  • 用户没有清单 → 引导至
    /digital-health-clinical-asr-build
  • 用户需要通用词汇增强/语言模型融合(非SFT) → 引导至
    /finetune-asr
  • 用户已有
    .nemo
    模型且仅需部署 → 引导至
    /riva-asr-custom

Prerequisites

前置条件

  • A cycle-N manifest + cycle-N eval result from
    /digital-health-clinical-asr-eval
    . The priority-category KER must be > 0.3 (Stage 4 gate). The manifest should have ≥ 100 rows total, and ≥ 5 rows per priority
    entity_category
    , for a believable post-tune signal.
  • A CUDA host — 24 GB VRAM is comfortable for Parakeet TDT 0.6B at
    batch_size=4
    with
    bf16-mixed
    ; 16 GB works with smaller batch. No local GPU? Use Brev — recommended SKU is L40S 48 GB.
  • The NeMo container:
    nvcr.io/nvidia/nemo:25.11.01
    . Pull once:
    docker pull nvcr.io/nvidia/nemo:25.11.01
    .
  • NVIDIA Container Toolkit + Docker — covered by
    /riva-nim-setup
    if not already installed.
  • A train/val split stratified by
    entity_category
    (recipe sketch in Step 4b below).
  • /riva-asr-custom
    installed if you intend to deploy. Pure-research SFT runs without it.
  • 来自
    /digital-health-clinical-asr-eval
    的N轮清单 + N轮评估结果
    。优先级类别KER必须>0.3(第4阶段准入门槛)。清单应包含≥100行数据,且每个优先级
    entity_category
    ≥5行,以确保微调后的结果可信。
  • CUDA主机 —— 24 GB显存可轻松运行Parakeet TDT 0.6B模型(
    batch_size=4
    bf16-mixed
    精度);16 GB显存可通过减小批次大小运行。若无本地GPU,可使用Brev——推荐SKU为L40S 48 GB。
  • NeMo容器
    nvcr.io/nvidia/nemo:25.11.01
    。仅需拉取一次:
    docker pull nvcr.io/nvidia/nemo:25.11.01
  • NVIDIA Container Toolkit + Docker —— 若未安装,可通过
    /riva-nim-setup
    完成配置。
  • entity_category
    分层的训练/验证数据集拆分
    (方案概要见下方第4b步)。
  • 若计划部署,需安装
    /riva-asr-custom
    。纯研究用途的SFT无需安装该技能。

Instructions

操作步骤

4a. Provision a GPU host (skip if you already have one)

4a. 部署GPU主机(若已有则跳过)

Stage 4 needs a CUDA host with ≥ 16 GB VRAM (24 GB comfortable). If you have a local one that fits, skip this section. If not, use Brev — NVIDIA's per-second-billed GPU host service. Recommended SKU: L40S 48 GB.
Cost disclosure — surface this to the user before any
brev create
.
L40S 48 GB runs $1.50/hr at time of writing; a 3-epoch SFT run on a 100-row manifest finishes in 15–30 minutes ($0.40–$0.75 of compute). The real risk is forgetting to stop the instance — overnight idle on L40S is ~$36, a week of idle is ~$250. Mitigations: (a) always wrap the workflow in a script that ends with
brev stop
; (b) set a calendar reminder when you start; (c)
brev delete
instead of
brev stop
if you don't need to keep the disk (
stop
keeps disk at $0.10/GB-month — 200 GB ≈ $20/month of latent cost). Confirm the user accepts the per-hour cost shape and the idle risk before spinning anything up.
Full setup walkthrough — CLI install (download-then-run, not curl-pipe), SKU choice, disk sizing, SSH config — is in
references/stage4-finetune.md
(§Brev provisioning).
Short happy-path once the CLI is installed. Do not run
brev create
until the user has explicitly typed
YES
at the confirmation prompt below
— the gate is mandatory, not advisory, because everything after it bills against the user's account by the second:
bash
brev login                                  # browser auth
第4阶段需要显存≥16 GB的CUDA主机(24 GB为舒适配置)。若已有符合要求的本地主机,可跳过本节。若无,可使用Brev——NVIDIA的按秒计费GPU主机服务。推荐SKU:L40S 48 GB。
成本披露 —— 在执行任何
brev create
命令前需告知用户
。撰写本文时,L40S 48 GB的运行成本约为$1.50/小时;针对100行清单的3轮SFT训练需15–30分钟(约$0.40–$0.75的算力成本)。主要风险是忘记停止实例——L40S实例闲置一晚约花费$36,闲置一周约花费$250。缓解措施:(a) 始终将工作流封装在以
brev stop
结尾的脚本中;(b) 启动实例时设置日历提醒;(c) 若无需保留磁盘,使用
brev delete
而非
brev stop
stop
会保留磁盘,成本为$0.10/GB/月——200 GB磁盘约产生$20/月的潜在成本)。在启动任何实例前,需确认用户接受按小时计费模式以及闲置风险。
完整的部署指南——CLI安装(下载后运行,而非curl管道安装)、SKU选择、磁盘大小配置、SSH配置——见
references/stage4-finetune.md
(§Brev部署章节)。
安装CLI后的快速操作路径。在用户明确输入
YES
确认前,请勿执行
brev create
命令
——该准入验证为强制要求,因为后续操作会按秒从用户账户扣费:
bash
brev login                                  # 浏览器认证

Mandatory cost-confirmation gate — do NOT skip or auto-answer this.

强制成本确认门槛 —— 请勿跳过或自动回答。

echo "About to provision: digital-health-clinical-asr-sft on L40S 48 GB." echo "Cost shape: ~$1.50/hr while running; ~$36/night if left idle; ~$20/mo disk if you 'stop' instead of 'delete'." read -rp "Type YES to provision (anything else cancels): " confirm [ "$confirm" = "YES" ] || { echo "Cancelled — no GPU instance was created."; exit 1; }
brev create digital-health-clinical-asr-sft
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi brev ssh-config # writes /.ssh/config entries rsync -avz ./cycle1/ digital-health-clinical-asr-sft:/cycle1/ brev shell digital-health-clinical-asr-sft # drops into the instance nvidia-smi # confirm GPU docker pull nvcr.io/nvidia/nemo:25.11.01 # ~12 GB, once per instance

When done, **always halt billing**: `brev stop digital-health-clinical-asr-sft` (keeps disk) or `brev delete digital-health-clinical-asr-sft` (frees it). For path rewriting laptop → Brev → NeMo container, see `references/container-paths.md`.
echo "即将部署:digital-health-clinical-asr-sft,使用L40S 48 GB GPU。" echo "成本模式:运行时约$1.50/小时;闲置一晚约$36;若使用'stop'而非'delete',磁盘成本约$20/月。" read -rp "请输入YES确认部署(输入其他内容则取消):" confirm [ "$confirm" = "YES" ] || { echo "已取消 —— 未创建任何GPU实例。"; exit 1; }
brev create digital-health-clinical-asr-sft
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi brev ssh-config # 写入~/.ssh/config配置 rsync -avz ./cycle1/ digital-health-clinical-asr-sft:~/cycle1/ brev shell digital-health-clinical-asr-sft # 进入实例 nvidia-smi # 确认GPU状态 docker pull nvcr.io/nvidia/nemo:25.11.01 # 约12 GB,每个实例仅需拉取一次

完成操作后,**务必停止计费**:使用`brev stop digital-health-clinical-asr-sft`(保留磁盘)或`brev delete digital-health-clinical-asr-sft`(释放磁盘)。关于笔记本电脑→Brev→NeMo容器的路径重写,见`references/container-paths.md`。

4b. Term-aware train/val split

4b. 术语感知的训练/验证数据集拆分

Row-disjoint, stratified by
entity_category
, default val fraction 0.2.
The same
term
may appear on both sides via different rows (different voice, context, noise). That's expected and desirable — it measures acoustic + contextual robustness on the trained vocabulary, which is the standard ASR adaptation metric.
Singleton categories (one row total) get forced to train with a warning. If any priority category has < 5 rows, bail to
/digital-health-clinical-asr-build
— held-out validation will be too noisy to attribute movement.
Sketch:
python
undefined
行不相交,按
entity_category
分层,默认验证集比例为0.2。
同一**
term
**可能通过不同行(不同语音、语境、噪音)出现在训练集和验证集中。这是预期且合理的——可衡量训练词汇在声学+语境层面的鲁棒性,这是标准的ASR适配指标。
单一行的类别(仅1行数据)将被强制归入训练集并发出警告。若任何优先级类别包含<5行数据,请引导至
/digital-health-clinical-asr-build
——留出的验证集噪声过大,无法准确评估效果变化。
方案概要:
python
undefined

After loading manifest.jsonl into a list of dicts
rows
:

将manifest.jsonl加载为字典列表
rows
后:

from collections import defaultdict import random random.seed(42)
by_cat = defaultdict(list) for r in rows: by_cat[r["entity_category"]].append(r)
train, val = [], [] for cat, cat_rows in by_cat.items(): random.shuffle(cat_rows) if len(cat_rows) < 2: train.extend(cat_rows) print(f"warning: singleton category {cat}, forced to train") continue n_val = max(1, int(0.2 * len(cat_rows))) val.extend(cat_rows[:n_val]) train.extend(cat_rows[n_val:])

Write `train.jsonl` and `validation.jsonl` alongside the manifest. **These are the inputs to `speech_to_text_finetune.py`.**
from collections import defaultdict import random random.seed(42)
by_cat = defaultdict(list) for r in rows: by_cat[r["entity_category"]].append(r)
train, val = [], [] for cat, cat_rows in by_cat.items(): random.shuffle(cat_rows) if len(cat_rows) < 2: train.extend(cat_rows) print(f"警告:单一行类别{cat},已强制归入训练集") continue n_val = max(1, int(0.2 * len(cat_rows))) val.extend(cat_rows[:n_val]) train.extend(cat_rows[n_val:])

将`train.jsonl`和`validation.jsonl`写入清单所在目录。**这些是`speech_to_text_finetune.py`的输入文件。**

4c. Choose the base model

4c. 选择基础模型

BaseSFT viabilityNotes
nvidia/parakeet-tdt-0.6b-v2
Empirically verified (KER 0.513 → 0.128 in 3 epochs, −75% relative)NVIDIA's current English ASR default. Stock NeMo SFT recipe works end-to-end. Recommended.
nvidia/nemotron-speech-streaming-en-0.6b
Don't use for SFTNVCF function is streaming-only; SFT path unreliable (UNK collapse on validation after first training step). For streaming serving, Riva chunks a non-streaming base just fine.
Other Parakeet/Conformer bases (1.1B, CTC, RNNT,
stt_en_conformer_ctc_large
) + decoder → NIM container mapping:
references/stage4-finetune.md
. If the user asks to fine-tune Nemotron Speech Streaming, warn about the collapse and recommend Parakeet TDT v2.
基础模型SFT可行性说明
nvidia/parakeet-tdt-0.6b-v2
已验证有效(3轮训练后KER从0.513降至0.128,相对降低75%)NVIDIA当前的英文ASR默认模型。原生NeMo SFT方案可端到端运行。推荐使用。
nvidia/nemotron-speech-streaming-en-0.6b
请勿用于SFTNVCF函数仅支持流式传输;SFT路径不可靠(训练第1步后验证阶段出现UNK坍缩)。如需流式服务,Riva可直接对非流式基础模型进行分片处理。
其他Parakeet/Conformer基础模型(1.1B、CTC、RNNT、
stt_en_conformer_ctc_large
)+ 解码器→NIM容器映射见
references/stage4-finetune.md
。若用户询问是否微调Nemotron Speech Streaming模型,请警告其存在坍缩问题并推荐Parakeet TDT v2

4d. Stock NeMo SFT

4d. 原生NeMo SFT

In the NeMo container, invoke
/opt/NeMo/examples/asr/speech_to_text_finetune.py
directly. No custom adapter logic. No patches. The stock NeMo SFT script is the verified working recipe.
Hyperparameters (verified on Parakeet TDT v2, 39-row manifest):
init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision:                  bf16-mixed       # required for TDT numerical stability
lr:                         3e-4             # CosineAnnealing schedule
warmup_steps:               5                # tiny manifest; bump to 500 at production scale
epochs:                     3                # smoke; 10-30 for production
batch_size:                 4                # fits 16 GB VRAM; raise to 16 on L40S 48 GB
gradient_clip_val:          1.0              # defensive
Container invocation:
docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.py
with
model.train_ds.manifest_filepath=/workspace/train.jsonl
,
model.validation_ds.manifest_filepath=/workspace/validation.jsonl
,
init_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2
, and the hyperparameter overrides from the table above. Full docker-run line with config-path / config-name flags:
references/stage4-finetune.md
§Container invocation.
Manifest paths inside the container. Host paths (e.g.
$HOME/…
) don't resolve in
/workspace
. Rewrite snippet:
references/container-paths.md
.
The training run writes
adapted_model.nemo
and a
training_run_info.json
summary. Both go into a per-cycle subdirectory of the user's choice (e.g.
cycle<N>/models/<run>/
; the layout doesn't matter as long as it's consistent across cycles).
在NeMo容器中直接调用
/opt/NeMo/examples/asr/speech_to_text_finetune.py
无自定义适配器逻辑。无需补丁。 原生NeMo SFT脚本为已验证的可行方案。
超参数(已在Parakeet TDT v2、39行清单上验证):
init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision:                  bf16-mixed       # TDT数值稳定性要求
lr:                         3e-4             # CosineAnnealing学习率调度
warmup_steps:               5                # 小型清单;生产规模可提升至500
epochs:                     3                # 测试用;生产环境需10-30轮
batch_size:                 4                # 适配16 GB显存;L40S 48 GB可提升至16
gradient_clip_val:          1.0              # 防御性设置
容器调用命令
docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.py
,并添加参数
model.train_ds.manifest_filepath=/workspace/train.jsonl
model.validation_ds.manifest_filepath=/workspace/validation.jsonl
init_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2
以及上述表格中的超参数覆盖项。完整的docker-run命令(含配置路径/配置名称参数)见
references/stage4-finetune.md
§容器调用章节。
容器内的清单路径。主机路径(如
$HOME/…
)无法在
/workspace
中解析。路径重写代码片段见
references/container-paths.md
训练过程会生成
adapted_model.nemo
training_run_info.json
摘要文件。两者需存入用户指定的每轮子目录(如
cycle<N>/models/<run>/
;只要各轮次目录结构一致,具体布局无关紧要)。

4e. Offline cycle N+1 eval — close the loop

4e. 离线N+1轮评估 —— 验证闭环

Re-transcribe the cycle's audio with the fine-tuned
.nemo
using NeMo's offline
transcribe()
. No Riva needed — this is measurement, not serving. NeMo's offline path runs the same encoder + decoder graph the Riva NIM eventually serves.
Sketch:
python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])
Score the same four metrics (WER/CER/KER/SER) and the same five-section leaderboard the eval skill produces. Write them as
leaderboard_cycle<N+1>.md
. Compare against
leaderboard_cycle<N>.md
.
Decision table — cycle-N+1 vs cycle-N:
ResultAction
KER dropped meaningfully on targeted categories (e.g. drug KER −20% or more, relative)✅ Keep the
.nemo
. Update the leaderboard. Advance to Step 4f if you want to deploy.
KER moved a little, you wanted moreLoop back to
/digital-health-clinical-asr-build
, expand the manifest. Tiny manifests rarely benefit from hyperparameter tweaks — signal density beats LR sweeps.
KER got worseOverfit on a tiny manifest. Bail to
/digital-health-clinical-asr-build
and grow before retraining. Don't tune harder on the same data.
No measurable changeSome categories may already be in the base model's vocab. Sanity-check per-category numbers before concluding training "didn't help."
使用微调后的
.nemo
模型,通过NeMo的离线
transcribe()
函数重新转录本轮音频。无需Riva——这是评估环节,而非服务环节。NeMo的离线路径与Riva NIM最终部署的编码器+解码器图完全一致。
方案概要:
python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])
计算与评估技能相同的四项指标(WER/CER/KER/SER)以及相同的五部分排行榜,写入
leaderboard_cycle<N+1>.md
。与
leaderboard_cycle<N>.md
进行对比。
决策表格 —— N+1轮 vs N轮:
结果操作
目标类别的KER显著下降(如药物类别KER相对降低≥20%)✅ 保留
.nemo
模型。更新排行榜。若需部署,进入第4f步。
KER略有下降,但未达预期返回
/digital-health-clinical-asr-build
,扩充清单。小型清单几乎无法从超参数调优中获益——信号密度优于学习率扫描。
KER上升小型清单过拟合。返回
/digital-health-clinical-asr-build
扩充清单后再重新训练。请勿在同一数据集上进一步调优。
无明显变化部分类别可能已包含在基础模型的词汇表中。在得出“训练无效”结论前,请检查各分类的具体数值。

4f. (Optional) Deploy as a Riva NIM

4f.(可选)部署为Riva NIM

Hand the
.nemo
to
/riva-asr-custom
. Pass the source architecture explicitly
/riva-asr-custom
can't reliably detect CTC vs RNNT vs TDT from the
.nemo
alone, and the wrong NIM container produces a broken RMIR with no clear error:
Source decoder
riva-build
flag
NIM container family
Conformer-CTC
decoder=greedy_ctc
parakeet-*-ctc-*
Conformer-RNNT
decoder=nemo
parakeet-rnnt-*
Conformer-TDT (default)
decoder=nemo
parakeet-tdt-*
Cache-Aware RNNT (Nemotron streaming)
decoder=nemo
nemotron-streaming-*
⚠ SFT broken on this base, see Limitations
After deploy: re-run
/digital-health-clinical-asr-eval
against the new endpoint (
ASR_ENDPOINT=localhost:50051
) to validate that production-serving numbers match offline numbers. Any divergence is in Riva preprocessing or
riva-build
flags, not the model. Route to
/riva-asr-custom
.
.nemo
模型交付给
/riva-asr-custom
需明确传递源架构——
/riva-asr-custom
无法从
.nemo
模型中可靠检测CTC/RNNT/TDT类型,错误的NIM容器会生成失效的RMIR且无明确错误提示:
源解码器
riva-build
参数
NIM容器系列
Conformer-CTC
decoder=greedy_ctc
parakeet-*-ctc-*
Conformer-RNNT
decoder=nemo
parakeet-rnnt-*
Conformer-TDT(默认)
decoder=nemo
parakeet-tdt-*
缓存感知RNNT(Nemotron流式)
decoder=nemo
nemotron-streaming-*
⚠ 该基础模型的SFT路径存在问题,见限制部分
部署完成后:针对新端点(
ASR_ENDPOINT=localhost:50051
)重新运行
/digital-health-clinical-asr-eval
,验证生产服务数值与离线数值是否一致。任何差异均源于Riva预处理或
riva-build
参数,与模型无关。请引导至
/riva-asr-custom

Examples

示例

Scenario A — gate met. User: "Drug KER 0.42, 130 rows. SFT?" → Yes (gate cleared).
parakeet-tdt-0.6b-v2
(verified 0.513 → 0.128). No local GPU? Step 4a (Brev) → 4b (split) → 4d (stock SFT) → 4e (offline re-eval). If cycle-2 drug KER drops ≥ 20% relative, keep the
.nemo
; otherwise back to
/digital-health-clinical-asr-build
.
Scenario B — Nemotron Streaming. User: "SFT
nvidia/nemotron-speech-streaming-en-0.6b
?"
→ No (UNK collapse). Substitute
parakeet-tdt-0.6b-v2
. Riva chunks non-streaming bases for streaming serving — base doesn't need to be streaming-native.
Scenario C — cycle 2 KER unchanged. User: "KER barely moved." → Back to
/digital-health-clinical-asr-build
. Signal density beats LR sweeps. If
magpie_g2p
rows are bad but
merriam-webster
rows are good, the gap is pronunciation-coverage —
/digital-health-clinical-asr-build
Step 2d.
场景A —— 符合准入标准。用户:“药物类别KER为0.42,共130行数据。可以进行SFT吗?” → 可以(符合准入标准)。使用
parakeet-tdt-0.6b-v2
(已验证从0.513降至0.128)。若无本地GPU?执行第4a步(Brev部署)→ 4b步(数据集拆分)→ 4d步(原生SFT)→ 4e步(离线重新评估)。若第2轮药物类别KER相对降低≥20%,则保留
.nemo
模型;否则返回
/digital-health-clinical-asr-build
场景B —— Nemotron流式模型。用户:“可以微调
nvidia/nemotron-speech-streaming-en-0.6b
吗?”
→ 不可以(会出现UNK坍缩)。建议替换为
parakeet-tdt-0.6b-v2
。Riva可对非流式基础模型进行分片处理以实现流式服务——基础模型无需原生支持流式传输。
场景C —— 第2轮KER无变化。用户:“KER几乎没有变化。” → 返回
/digital-health-clinical-asr-build
。信号密度优于学习率扫描。若
magpie_g2p
行数据质量差但
merriam-webster
行数据质量好,说明存在发音覆盖缺口——执行
/digital-health-clinical-asr-build
第2d步。

Artifacts produced

生成的产物

  • train.jsonl
    ,
    validation.jsonl
    — term-aware split (Step 4b)
  • adapted_model.nemo
    — fine-tuned model (Step 4d)
  • training_run_info.json
    — hyperparameters, dataset stats, end-of-train metrics
  • offline_hyps.jsonl
    — cycle-N+1 transcription hypotheses (Step 4e)
  • leaderboard_cycle<N+1>.md
    — cycle-N+1 five-section leaderboard
  • (optional, after Step 4f) a deployed NIM endpoint (delegated to
    /riva-asr-custom
    )
  • train.jsonl
    validation.jsonl
    —— 术语感知的数据集拆分(第4b步)
  • adapted_model.nemo
    —— 微调后的模型(第4d步)
  • training_run_info.json
    —— 超参数、数据集统计信息、训练结束指标
  • offline_hyps.jsonl
    —— N+1轮转录结果(第4e步)
  • leaderboard_cycle<N+1>.md
    —— N+1轮五部分排行榜
  • (可选,第4f步后) 已部署的NIM端点(由
    /riva-asr-custom
    负责)

Troubleshooting

故障排查

  • Stage 4 training collapses to all-UNK after first step → you're on the cache-aware streaming RNNT base (
    nemotron-speech-streaming-en-0.6b
    ). Route to
    nvidia/parakeet-tdt-0.6b-v2
    (the recommended default) or
    nvidia/stt_en_conformer_ctc_large
    (legacy fallback). The streaming RNNT SFT path is broken; do not retry with different hyperparameters.
  • Manifest paths don't resolve inside the NeMo container → host paths (e.g.
    $HOME/…
    ) need rewriting to
    /workspace/…
    . See
    references/container-paths.md
    for the rewrite snippet.
  • Cycle N+1 KER unchanged from cycle N → on
    parakeet-tdt-0.6b-v2
    with the recipe above, this almost always means manifest signal density is too low. Grow the manifest first; don't sweep LR. (If you're on an older adapter-style recipe instead of stock SFT, the adapter weights may not have moved off zero-init — switch to stock SFT.)
  • Cycle N+1 KER got worse → overfit on a tiny manifest. Bail to
    /digital-health-clinical-asr-build
    and grow.
  • Riva-served numbers diverge from offline numbers → the gap is in Riva preprocessing or
    riva-build
    flags, not the model. Route to
    /riva-asr-custom
    .
  • bf16-mixed
    precision errors
    → some GPUs (older Turing, all Volta) don't support BF16. Drop to
    fp32
    and reduce
    batch_size
    . Use
    fp16-mixed
    only if
    fp32
    is too slow — fp16 with TDT decoders can produce NaN losses, so check loss curves early.
  • OOM during training on 24 GB GPU → drop
    batch_size
    to 2, raise
    accumulate_grad_batches
    to 2 to keep the effective batch size constant.
  • 第4阶段训练第1步后出现全UNK坍缩 → 你使用的是缓存感知流式RNNT基础模型(
    nemotron-speech-streaming-en-0.6b
    )。请切换至
    nvidia/parakeet-tdt-0.6b-v2
    (推荐默认模型)或
    nvidia/stt_en_conformer_ctc_large
    (旧版备选)。流式RNNT的SFT路径存在问题;请勿尝试调整超参数重试。
  • NeMo容器内无法解析清单路径 → 主机路径(如
    $HOME/…
    )需重写为
    /workspace/…
    。路径重写代码片段见
    references/container-paths.md
  • N+1轮KER与N轮无变化 → 在
    parakeet-tdt-0.6b-v2
    上使用上述方案时,这几乎总是意味着清单信号密度过低。请先扩充清单;请勿进行学习率扫描。(若你使用的是旧版适配器方案而非原生SFT,适配器权重可能未从零初始化状态更新——请切换至原生SFT。)
  • N+1轮KER上升 → 小型清单过拟合。返回
    /digital-health-clinical-asr-build
    扩充清单。
  • Riva服务数值与离线数值存在差异 → 差异源于Riva预处理或
    riva-build
    参数,与模型无关。请引导至
    /riva-asr-custom
  • bf16-mixed
    精度错误
    → 部分GPU(旧版Turing、所有Volta)不支持BF16。请切换至
    fp32
    精度并减小
    batch_size
    。仅当
    fp32
    速度过慢时才使用
    fp16-mixed
    ——TDT解码器使用fp16可能产生NaN损失,请尽早检查损失曲线。
  • 24 GB GPU训练时出现OOM → 将
    batch_size
    降至2,将
    accumulate_grad_batches
    提升至2以保持有效批次大小不变。

Limitations

限制

  • Adapter-style SFT on TDT/RNNT decoders is broken. Empirically confirmed: an earlier LinearAdapter-mixin recipe produces 72 NaN tensors at any LR on TDT and RNNT decoders. Resolved by switching to NeMo's stock full-model SFT (
    speech_to_text_finetune.py
    ) — which is what this skill recommends. Do not attempt adapter SFT on TDT/RNNT bases.
  • Don't SFT
    nemotron-speech-streaming-en-0.6b
    .
    The streaming-only NVCF function's SFT path is unreliable (UNK collapse). For streaming serving at deploy time, Riva chunks a non-streaming base.
  • Tiny manifests overfit fast. Below ~100 rows total or ~5 rows per priority category, cycle-N+1 numbers are noisy. Grow before trusting a small KER drop.
  • English-only by default. The base-model table is en-US-specific. Other locales need a different base + a re-validated SFT recipe.
  • No turn-key driver. The user writes their own training-driver layout — output paths, run naming, leaderboard re-rendering. The methodology and recipes transfer; exact cycle-1 numbers depend on the user's manifest.
  • TDT/RNNT解码器的适配器式SFT存在问题。已验证:早期的LinearAdapter-mixin方案在TDT和RNNT解码器上,任何学习率下都会产生72个NaN张量。解决方案是切换至NeMo的原生全模型SFT
    speech_to_text_finetune.py
    )——这也是本技能推荐的方案。请勿在TDT/RNNT基础模型上尝试适配器SFT。
  • 请勿对
    nemotron-speech-streaming-en-0.6b
    进行SFT
    。仅支持流式传输的NVCF函数的SFT路径不可靠(会出现UNK坍缩)。部署时如需流式服务,Riva可对非流式基础模型进行分片处理。
  • 小型清单易过拟合。总数据量<100行或每个优先级类别<5行时,N+1轮数值噪声过大。在信任小幅KER下降前,请先扩充清单。
  • 默认仅支持英文。基础模型表格针对美式英语。其他语言区域需使用不同的基础模型并重新验证SFT方案。
  • 无开箱即用的驱动程序。用户需自行编写训练驱动程序的布局——输出路径、运行命名、排行榜重新生成。方法学和方案可复用;第1轮的具体数值取决于用户的清单。

Next steps

后续步骤

  • Deploy the
    .nemo
    as a NIM:
    /riva-asr-custom
    (pass the source architecture explicitly).
  • Grow the manifest for cycle N+2:
    /digital-health-clinical-asr-build
    .
  • Re-score the cycle:
    /digital-health-clinical-asr-eval
    (against the new endpoint or the new
    .nemo
    directly).
  • Lateral for word boosting / LM fusion / non-clinical SFT recipes:
    /finetune-asr
    .
  • .nemo
    模型部署为NIM
    /riva-asr-custom
    (需明确传递源架构)。
  • 扩充清单以进行N+2轮训练
    /digital-health-clinical-asr-build
  • 重新评估本轮结果
    /digital-health-clinical-asr-eval
    (针对新端点或新
    .nemo
    模型)。
  • 词汇增强/语言模型融合/非临床SFT方案
    /finetune-asr

References

参考资料

  • references/stage4-finetune.md
    — base-model selection table, hyperparameter rationale, decoder → NIM container mapping, decision tree comparing cycle-N+1 to cycle-N
  • references/container-paths.md
    — host →
    /workspace/
    path rewriting for cross-host manifest portability (laptop ↔ Brev ↔ NeMo container)
  • references/stage4-finetune.md
    —— 基础模型选择表格、超参数原理、解码器→NIM容器映射、N+1轮与N轮对比决策树
  • references/container-paths.md
    —— 主机→
    /workspace/
    路径重写,实现跨主机清单可移植性(笔记本电脑 ↔ Brev ↔ NeMo容器)