digital-health-clinical-asr-finetune
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Clinical ASR Flywheel — Stage 4 (Fine-tune)
临床ASR飞轮 — 第4阶段(微调)
⚠ Agent: read this entire SKILL.md before answering. The Critical-workflow-rules section, the base-model table (§4c), the stock-NeMo-SFT recipe (§4d), and the cycle-N+1 decision table (§4e) are all load-bearing — the do-not-SFT bases and broken-adapter warnings live there.
Agent: this file is self-contained. The Stage 4 gate criteria, base-model recommendation, hyperparameter table, container invocation pattern, and cycle-N+1 decision table are all below. Do not run file-discovery commands or openbefore answering methodology questions — the reference is deep-dive material, not required reading. Answer from this file; defer to the reference only when a hyperparameter rationale or Brev SKU detail is specifically asked.references/stage4-finetune.md
You are the adapt-and-measure stage. The user arrives from with a manifest, a baseline KER number, and the decision-tree's recommendation that fine-tuning is worth the GPU time. You run stock NeMo SFT, do an offline cycle N+1 re-eval to measure that the loop closed, and optionally hand the resulting to for production serving.
/digital-health-clinical-asr-eval.nemo/riva-asr-customThe cycle KER from offline eval is the measurement that closes the loop. Riva NIM deploy validates serving (latency, streaming, scale), not model quality.
Empirically verified on the reference manifest (39 rows, Parakeet TDT v2): Baseline KER 0.513 → after 3 epochs of stock SFT: 0.128 (-75% relative). Drug names: 0.857 → 0.214. Conditions: 0.500 → 0.000. Procedures: 0.250 → 0.000.
⚠ 智能体:在回答前请完整阅读本SKILL.md文件。 关键工作流规则部分、基础模型表格(§4c)、原生NeMo-SFT方案(§4d)以及N+1轮决策表格(§4e)均为核心内容——禁止进行SFT的基础模型和适配器失效警告均在此处。
智能体:本文件内容自洽完整。 第4阶段准入标准、基础模型推荐、超参数表格、容器调用模式以及N+1轮决策表格均在下方。请勿在回答方法学问题前执行文件发现命令或打开——该参考文件为深度拓展资料,非必读内容。请基于本文件作答;仅当被特别询问超参数原理或Brev SKU细节时,才可参考该文件。references/stage4-finetune.md
你处于适配与评估阶段。用户将从进入,携带一份清单、基线KER数值,以及决策树给出的“微调值得投入GPU算力”的建议。你需要运行原生NeMo SFT,执行离线N+1轮重新评估以验证闭环效果,并可选择将生成的模型交付给用于生产部署。
/digital-health-clinical-asr-eval.nemo/riva-asr-custom离线评估得到的轮次KER是验证闭环的核心指标。 Riva NIM部署仅验证服务能力(延迟、流式传输、扩展性),不验证模型质量。
已在参考清单上验证(39行数据,Parakeet TDT v2): 基线KER 0.513 → 经过3轮原生SFT后:0.128(相对降低75%)。 药物名称:0.857 → 0.214。病症名称:0.500 → 0.000。手术名称:0.250 → 0.000。
Critical workflow rules (apply on every activation)
关键工作流规则(每次激活均需遵守)
Surface these facts in any response, even if the user asks a narrow question:
- Read this entire SKILL.md before answering. The base-model selection table, hyperparameter values, and the cycle-N+1 decision table are below — they are the load-bearing parts.
- Verified result — Parakeet TDT v2 with the recipe in §4c achieves KER 0.513 → 0.128 (−75% relative) in 3 epochs on the reference manifest. Cite this when the user asks whether SFT will help.
- Recipe is inside
/opt/NeMo/examples/asr/speech_to_text_finetune.py. Stock script, no patches, no custom adapter logic. The adapter-mixin path is broken on TDT/RNNT decoders (72 NaN tensors at any LR) — do not propose it.nvcr.io/nvidia/nemo:25.11.01 - Recommended base is . The full base-model table is in §4c.
nvidia/parakeet-tdt-0.6b-v2 - Do NOT fine-tune . The streaming NVCF function's SFT path is broken (UNK collapse on validation after step 1). For streaming serving at deploy time, Riva chunks a non-streaming base just fine. Warn the user proactively if they propose it.
nvidia/nemotron-speech-streaming-en-0.6b - Gate the recommendation. Stage 4 only fires when priority-category KER > 0.3 and manifest has ≥ 100 rows (≥ 5 per priority category). Below those thresholds, route back to to grow the manifest first.
/digital-health-clinical-asr-build
无论用户提出的问题多么具体,都需告知以下信息:
- 回答前请完整阅读本SKILL.md文件。 基础模型选择表格、超参数值以及N+1轮决策表格均在下方——这些是核心内容。
- 已验证结果 —— 使用§4c中的方案在Parakeet TDT v2上运行,参考清单的KER从0.513降至0.128(相对降低75%),仅需3轮训练。当用户询问SFT是否有效时,请引用该结果。
- 方案路径为容器内的
nvcr.io/nvidia/nemo:25.11.01。 原生脚本,无需补丁,无需自定义适配器逻辑。TDT/RNNT解码器的适配器混合路径存在问题(任何学习率下都会产生72个NaN张量)——请勿提议使用该路径。/opt/NeMo/examples/asr/speech_to_text_finetune.py - 推荐基础模型为。 完整的基础模型表格见§4c。
nvidia/parakeet-tdt-0.6b-v2 - 请勿微调。 流式NVCF函数的SFT路径存在问题(训练第1步后验证阶段出现UNK坍缩)。部署时如需流式服务,Riva可直接对非流式基础模型进行分片处理。若用户提议使用该模型,请主动发出警告。
nvidia/nemotron-speech-streaming-en-0.6b - 准入验证。 仅当优先级类别KER > 0.3 且清单包含≥100行数据(每个优先级类别≥5行)时,才可启动第4阶段。若未达到上述阈值,请引导用户返回以扩充清单。
/digital-health-clinical-asr-build
Purpose
目标
Run stock NeMo SFT (no custom adapter logic, no patches) in against a term-aware row-disjoint train/val split, produce a model, and re-eval offline as cycle N+1. Decide based on the cycle-N → cycle-N+1 KER delta whether to keep the model, grow the manifest, or accept that fine-tuning didn't help. Optionally hand the to for NIM deploy.
nvcr.io/nvidia/nemo:25.11.01.nemo.nemo/riva-asr-custom在容器中运行原生NeMo SFT(无自定义适配器逻辑,无补丁),针对术语感知的行不相交训练/验证数据集进行训练,生成模型,并执行离线N+1轮评估。根据N轮至N+1轮的KER变化量,决定是否保留模型、扩充清单或接受“微调无效”的结果。可选择将模型交付给进行NIM部署。
nvcr.io/nvidia/nemo:25.11.01.nemo.nemo/riva-asr-customWhen to use this skill
适用场景
Activate on user phrases like:
- "Fine-tune ASR on my clinical vocabulary"
- "Improve ASR on medication names"
- "We have a KER of 0.4, can we fine-tune?"
- "Run SFT on my Parakeet TDT base"
- "Train a clinical ASR adapter"
- "Compare cycle 1 vs cycle 2 KER"
- "Deploy my fine-tuned model as a NIM" (this skill prepares the and routes to
.nemofor the deploy)/riva-asr-custom
Do not activate when:
- The user hasn't scored a baseline yet →
/digital-health-clinical-asr-eval - The user doesn't have a manifest →
/digital-health-clinical-asr-build - The user wants generic word boosting / LM fusion (not SFT) →
/finetune-asr - The user has a and only wants to deploy →
.nemo/riva-asr-custom
当用户提出以下类似需求时激活本技能:
- "针对我的临床词汇微调ASR"
- "提升药物名称的ASR识别精度"
- "我们的KER为0.4,可以进行微调吗?"
- "在我的Parakeet TDT基础模型上运行SFT"
- "训练临床ASR适配器"
- "对比第1轮与第2轮的KER"
- "将我的微调模型部署为NIM" (本技能负责准备模型,并引导至
.nemo进行部署)/riva-asr-custom
请勿在以下场景激活本技能:
- 用户尚未获取基线评分 → 引导至
/digital-health-clinical-asr-eval - 用户没有清单 → 引导至
/digital-health-clinical-asr-build - 用户需要通用词汇增强/语言模型融合(非SFT) → 引导至
/finetune-asr - 用户已有模型且仅需部署 → 引导至
.nemo/riva-asr-custom
Prerequisites
前置条件
- A cycle-N manifest + cycle-N eval result from . The priority-category KER must be > 0.3 (Stage 4 gate). The manifest should have ≥ 100 rows total, and ≥ 5 rows per priority
/digital-health-clinical-asr-eval, for a believable post-tune signal.entity_category - A CUDA host — 24 GB VRAM is comfortable for Parakeet TDT 0.6B at with
batch_size=4; 16 GB works with smaller batch. No local GPU? Use Brev — recommended SKU is L40S 48 GB.bf16-mixed - The NeMo container: . Pull once:
nvcr.io/nvidia/nemo:25.11.01.docker pull nvcr.io/nvidia/nemo:25.11.01 - NVIDIA Container Toolkit + Docker — covered by if not already installed.
/riva-nim-setup - A train/val split stratified by (recipe sketch in Step 4b below).
entity_category - installed if you intend to deploy. Pure-research SFT runs without it.
/riva-asr-custom
- 来自的N轮清单 + N轮评估结果。优先级类别KER必须>0.3(第4阶段准入门槛)。清单应包含≥100行数据,且每个优先级
/digital-health-clinical-asr-eval≥5行,以确保微调后的结果可信。entity_category - CUDA主机 —— 24 GB显存可轻松运行Parakeet TDT 0.6B模型(,
batch_size=4精度);16 GB显存可通过减小批次大小运行。若无本地GPU,可使用Brev——推荐SKU为L40S 48 GB。bf16-mixed - NeMo容器:。仅需拉取一次:
nvcr.io/nvidia/nemo:25.11.01。docker pull nvcr.io/nvidia/nemo:25.11.01 - NVIDIA Container Toolkit + Docker —— 若未安装,可通过完成配置。
/riva-nim-setup - 按分层的训练/验证数据集拆分(方案概要见下方第4b步)。
entity_category - 若计划部署,需安装。纯研究用途的SFT无需安装该技能。
/riva-asr-custom
Instructions
操作步骤
4a. Provision a GPU host (skip if you already have one)
4a. 部署GPU主机(若已有则跳过)
Stage 4 needs a CUDA host with ≥ 16 GB VRAM (24 GB comfortable). If you have a local one that fits, skip this section. If not, use Brev — NVIDIA's per-second-billed GPU host service. Recommended SKU: L40S 48 GB.
Cost disclosure — surface this to the user before any . L40S 48 GB runs $1.50/hr at time of writing; a 3-epoch SFT run on a 100-row manifest finishes in 15–30 minutes ($0.40–$0.75 of compute). The real risk is forgetting to stop the instance — overnight idle on L40S is ~$36, a week of idle is ~$250. Mitigations: (a) always wrap the workflow in a script that ends with ; (b) set a calendar reminder when you start; (c) instead of if you don't need to keep the disk ( keeps disk at $0.10/GB-month — 200 GB ≈ $20/month of latent cost). Confirm the user accepts the per-hour cost shape and the idle risk before spinning anything up.
brev createbrev stopbrev deletebrev stopstopFull setup walkthrough — CLI install (download-then-run, not curl-pipe), SKU choice, disk sizing, SSH config — is in (§Brev provisioning).
references/stage4-finetune.mdShort happy-path once the CLI is installed. Do not run until the user has explicitly typed at the confirmation prompt below — the gate is mandatory, not advisory, because everything after it bills against the user's account by the second:
brev createYESbash
brev login # browser auth第4阶段需要显存≥16 GB的CUDA主机(24 GB为舒适配置)。若已有符合要求的本地主机,可跳过本节。若无,可使用Brev——NVIDIA的按秒计费GPU主机服务。推荐SKU:L40S 48 GB。
成本披露 —— 在执行任何命令前需告知用户。撰写本文时,L40S 48 GB的运行成本约为$1.50/小时;针对100行清单的3轮SFT训练需15–30分钟(约$0.40–$0.75的算力成本)。主要风险是忘记停止实例——L40S实例闲置一晚约花费$36,闲置一周约花费$250。缓解措施:(a) 始终将工作流封装在以结尾的脚本中;(b) 启动实例时设置日历提醒;(c) 若无需保留磁盘,使用而非(会保留磁盘,成本为$0.10/GB/月——200 GB磁盘约产生$20/月的潜在成本)。在启动任何实例前,需确认用户接受按小时计费模式以及闲置风险。
brev createbrev stopbrev deletebrev stopstop完整的部署指南——CLI安装(下载后运行,而非curl管道安装)、SKU选择、磁盘大小配置、SSH配置——见(§Brev部署章节)。
references/stage4-finetune.md安装CLI后的快速操作路径。在用户明确输入确认前,请勿执行命令——该准入验证为强制要求,因为后续操作会按秒从用户账户扣费:
YESbrev createbash
brev login # 浏览器认证Mandatory cost-confirmation gate — do NOT skip or auto-answer this.
强制成本确认门槛 —— 请勿跳过或自动回答。
echo "About to provision: digital-health-clinical-asr-sft on L40S 48 GB."
echo "Cost shape: ~$1.50/hr while running; ~$36/night if left idle; ~$20/mo disk if you 'stop' instead of 'delete'."
read -rp "Type YES to provision (anything else cancels): " confirm
[ "$confirm" = "YES" ] || { echo "Cancelled — no GPU instance was created."; exit 1; }
brev create digital-health-clinical-asr-sft
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi brev ssh-config # writes/.ssh/config entries
rsync -avz ./cycle1/ digital-health-clinical-asr-sft:/cycle1/
brev shell digital-health-clinical-asr-sft # drops into the instance
nvidia-smi # confirm GPU
docker pull nvcr.io/nvidia/nemo:25.11.01 # ~12 GB, once per instance
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi brev ssh-config # writes
When done, **always halt billing**: `brev stop digital-health-clinical-asr-sft` (keeps disk) or `brev delete digital-health-clinical-asr-sft` (frees it). For path rewriting laptop → Brev → NeMo container, see `references/container-paths.md`.echo "即将部署:digital-health-clinical-asr-sft,使用L40S 48 GB GPU。"
echo "成本模式:运行时约$1.50/小时;闲置一晚约$36;若使用'stop'而非'delete',磁盘成本约$20/月。"
read -rp "请输入YES确认部署(输入其他内容则取消):" confirm
[ "$confirm" = "YES" ] || { echo "已取消 —— 未创建任何GPU实例。"; exit 1; }
brev create digital-health-clinical-asr-sft
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi brev ssh-config # 写入~/.ssh/config配置 rsync -avz ./cycle1/ digital-health-clinical-asr-sft:~/cycle1/ brev shell digital-health-clinical-asr-sft # 进入实例 nvidia-smi # 确认GPU状态 docker pull nvcr.io/nvidia/nemo:25.11.01 # 约12 GB,每个实例仅需拉取一次
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi brev ssh-config # 写入~/.ssh/config配置 rsync -avz ./cycle1/ digital-health-clinical-asr-sft:~/cycle1/ brev shell digital-health-clinical-asr-sft # 进入实例 nvidia-smi # 确认GPU状态 docker pull nvcr.io/nvidia/nemo:25.11.01 # 约12 GB,每个实例仅需拉取一次
完成操作后,**务必停止计费**:使用`brev stop digital-health-clinical-asr-sft`(保留磁盘)或`brev delete digital-health-clinical-asr-sft`(释放磁盘)。关于笔记本电脑→Brev→NeMo容器的路径重写,见`references/container-paths.md`。4b. Term-aware train/val split
4b. 术语感知的训练/验证数据集拆分
Row-disjoint, stratified by , default val fraction 0.2.
entity_categoryThe same may appear on both sides via different rows (different voice, context, noise). That's expected and desirable — it measures acoustic + contextual robustness on the trained vocabulary, which is the standard ASR adaptation metric.
termSingleton categories (one row total) get forced to train with a warning. If any priority category has < 5 rows, bail to — held-out validation will be too noisy to attribute movement.
/digital-health-clinical-asr-buildSketch:
python
undefined行不相交,按分层,默认验证集比例为0.2。
entity_category同一****可能通过不同行(不同语音、语境、噪音)出现在训练集和验证集中。这是预期且合理的——可衡量训练词汇在声学+语境层面的鲁棒性,这是标准的ASR适配指标。
term单一行的类别(仅1行数据)将被强制归入训练集并发出警告。若任何优先级类别包含<5行数据,请引导至——留出的验证集噪声过大,无法准确评估效果变化。
/digital-health-clinical-asr-build方案概要:
python
undefinedAfter loading manifest.jsonl into a list of dicts rows
:
rows将manifest.jsonl加载为字典列表rows
后:
rowsfrom collections import defaultdict
import random
random.seed(42)
by_cat = defaultdict(list)
for r in rows:
by_cat[r["entity_category"]].append(r)
train, val = [], []
for cat, cat_rows in by_cat.items():
random.shuffle(cat_rows)
if len(cat_rows) < 2:
train.extend(cat_rows)
print(f"warning: singleton category {cat}, forced to train")
continue
n_val = max(1, int(0.2 * len(cat_rows)))
val.extend(cat_rows[:n_val])
train.extend(cat_rows[n_val:])
Write `train.jsonl` and `validation.jsonl` alongside the manifest. **These are the inputs to `speech_to_text_finetune.py`.**from collections import defaultdict
import random
random.seed(42)
by_cat = defaultdict(list)
for r in rows:
by_cat[r["entity_category"]].append(r)
train, val = [], []
for cat, cat_rows in by_cat.items():
random.shuffle(cat_rows)
if len(cat_rows) < 2:
train.extend(cat_rows)
print(f"警告:单一行类别{cat},已强制归入训练集")
continue
n_val = max(1, int(0.2 * len(cat_rows)))
val.extend(cat_rows[:n_val])
train.extend(cat_rows[n_val:])
将`train.jsonl`和`validation.jsonl`写入清单所在目录。**这些是`speech_to_text_finetune.py`的输入文件。**4c. Choose the base model
4c. 选择基础模型
| Base | SFT viability | Notes |
|---|---|---|
| ✅ Empirically verified (KER 0.513 → 0.128 in 3 epochs, −75% relative) | NVIDIA's current English ASR default. Stock NeMo SFT recipe works end-to-end. Recommended. |
| ❌ Don't use for SFT | NVCF function is streaming-only; SFT path unreliable (UNK collapse on validation after first training step). For streaming serving, Riva chunks a non-streaming base just fine. |
Other Parakeet/Conformer bases (1.1B, CTC, RNNT, ) + decoder → NIM container mapping: . If the user asks to fine-tune Nemotron Speech Streaming, warn about the collapse and recommend Parakeet TDT v2.
stt_en_conformer_ctc_largereferences/stage4-finetune.md| 基础模型 | SFT可行性 | 说明 |
|---|---|---|
| ✅ 已验证有效(3轮训练后KER从0.513降至0.128,相对降低75%) | NVIDIA当前的英文ASR默认模型。原生NeMo SFT方案可端到端运行。推荐使用。 |
| ❌ 请勿用于SFT | NVCF函数仅支持流式传输;SFT路径不可靠(训练第1步后验证阶段出现UNK坍缩)。如需流式服务,Riva可直接对非流式基础模型进行分片处理。 |
其他Parakeet/Conformer基础模型(1.1B、CTC、RNNT、)+ 解码器→NIM容器映射见。若用户询问是否微调Nemotron Speech Streaming模型,请警告其存在坍缩问题并推荐Parakeet TDT v2。
stt_en_conformer_ctc_largereferences/stage4-finetune.md4d. Stock NeMo SFT
4d. 原生NeMo SFT
In the NeMo container, invoke directly. No custom adapter logic. No patches. The stock NeMo SFT script is the verified working recipe.
/opt/NeMo/examples/asr/speech_to_text_finetune.pyHyperparameters (verified on Parakeet TDT v2, 39-row manifest):
init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision: bf16-mixed # required for TDT numerical stability
lr: 3e-4 # CosineAnnealing schedule
warmup_steps: 5 # tiny manifest; bump to 500 at production scale
epochs: 3 # smoke; 10-30 for production
batch_size: 4 # fits 16 GB VRAM; raise to 16 on L40S 48 GB
gradient_clip_val: 1.0 # defensiveContainer invocation: with , , , and the hyperparameter overrides from the table above. Full docker-run line with config-path / config-name flags: §Container invocation.
docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.pymodel.train_ds.manifest_filepath=/workspace/train.jsonlmodel.validation_ds.manifest_filepath=/workspace/validation.jsonlinit_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2references/stage4-finetune.mdManifest paths inside the container. Host paths (e.g. ) don't resolve in . Rewrite snippet: .
$HOME/…/workspacereferences/container-paths.mdThe training run writes and a summary. Both go into a per-cycle subdirectory of the user's choice (e.g. ; the layout doesn't matter as long as it's consistent across cycles).
adapted_model.nemotraining_run_info.jsoncycle<N>/models/<run>/在NeMo容器中直接调用。无自定义适配器逻辑。无需补丁。 原生NeMo SFT脚本为已验证的可行方案。
/opt/NeMo/examples/asr/speech_to_text_finetune.py超参数(已在Parakeet TDT v2、39行清单上验证):
init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision: bf16-mixed # TDT数值稳定性要求
lr: 3e-4 # CosineAnnealing学习率调度
warmup_steps: 5 # 小型清单;生产规模可提升至500
epochs: 3 # 测试用;生产环境需10-30轮
batch_size: 4 # 适配16 GB显存;L40S 48 GB可提升至16
gradient_clip_val: 1.0 # 防御性设置容器调用命令:,并添加参数、、以及上述表格中的超参数覆盖项。完整的docker-run命令(含配置路径/配置名称参数)见 §容器调用章节。
docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.pymodel.train_ds.manifest_filepath=/workspace/train.jsonlmodel.validation_ds.manifest_filepath=/workspace/validation.jsonlinit_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2references/stage4-finetune.md容器内的清单路径。主机路径(如)无法在中解析。路径重写代码片段见。
$HOME/…/workspacereferences/container-paths.md训练过程会生成和摘要文件。两者需存入用户指定的每轮子目录(如;只要各轮次目录结构一致,具体布局无关紧要)。
adapted_model.nemotraining_run_info.jsoncycle<N>/models/<run>/4e. Offline cycle N+1 eval — close the loop
4e. 离线N+1轮评估 —— 验证闭环
Re-transcribe the cycle's audio with the fine-tuned using NeMo's offline . No Riva needed — this is measurement, not serving. NeMo's offline path runs the same encoder + decoder graph the Riva NIM eventually serves.
.nemotranscribe()Sketch:
python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])Score the same four metrics (WER/CER/KER/SER) and the same five-section leaderboard the eval skill produces. Write them as . Compare against .
leaderboard_cycle<N+1>.mdleaderboard_cycle<N>.mdDecision table — cycle-N+1 vs cycle-N:
| Result | Action |
|---|---|
| KER dropped meaningfully on targeted categories (e.g. drug KER −20% or more, relative) | ✅ Keep the |
| KER moved a little, you wanted more | Loop back to |
| KER got worse | Overfit on a tiny manifest. Bail to |
| No measurable change | Some categories may already be in the base model's vocab. Sanity-check per-category numbers before concluding training "didn't help." |
使用微调后的模型,通过NeMo的离线函数重新转录本轮音频。无需Riva——这是评估环节,而非服务环节。NeMo的离线路径与Riva NIM最终部署的编码器+解码器图完全一致。
.nemotranscribe()方案概要:
python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])计算与评估技能相同的四项指标(WER/CER/KER/SER)以及相同的五部分排行榜,写入。与进行对比。
leaderboard_cycle<N+1>.mdleaderboard_cycle<N>.md决策表格 —— N+1轮 vs N轮:
| 结果 | 操作 |
|---|---|
| 目标类别的KER显著下降(如药物类别KER相对降低≥20%) | ✅ 保留 |
| KER略有下降,但未达预期 | 返回 |
| KER上升 | 小型清单过拟合。返回 |
| 无明显变化 | 部分类别可能已包含在基础模型的词汇表中。在得出“训练无效”结论前,请检查各分类的具体数值。 |
4f. (Optional) Deploy as a Riva NIM
4f.(可选)部署为Riva NIM
Hand the to . Pass the source architecture explicitly — can't reliably detect CTC vs RNNT vs TDT from the alone, and the wrong NIM container produces a broken RMIR with no clear error:
.nemo/riva-asr-custom/riva-asr-custom.nemo| Source decoder | | NIM container family |
|---|---|---|
| Conformer-CTC | | |
| Conformer-RNNT | | |
| Conformer-TDT (default) | | |
| Cache-Aware RNNT (Nemotron streaming) | | |
After deploy: re-run against the new endpoint () to validate that production-serving numbers match offline numbers. Any divergence is in Riva preprocessing or flags, not the model. Route to .
/digital-health-clinical-asr-evalASR_ENDPOINT=localhost:50051riva-build/riva-asr-custom将模型交付给。需明确传递源架构——无法从模型中可靠检测CTC/RNNT/TDT类型,错误的NIM容器会生成失效的RMIR且无明确错误提示:
.nemo/riva-asr-custom/riva-asr-custom.nemo| 源解码器 | | NIM容器系列 |
|---|---|---|
| Conformer-CTC | | |
| Conformer-RNNT | | |
| Conformer-TDT(默认) | | |
| 缓存感知RNNT(Nemotron流式) | | |
部署完成后:针对新端点()重新运行,验证生产服务数值与离线数值是否一致。任何差异均源于Riva预处理或参数,与模型无关。请引导至。
ASR_ENDPOINT=localhost:50051/digital-health-clinical-asr-evalriva-build/riva-asr-customExamples
示例
Scenario A — gate met. User: "Drug KER 0.42, 130 rows. SFT?" → Yes (gate cleared). (verified 0.513 → 0.128). No local GPU? Step 4a (Brev) → 4b (split) → 4d (stock SFT) → 4e (offline re-eval). If cycle-2 drug KER drops ≥ 20% relative, keep the ; otherwise back to .
parakeet-tdt-0.6b-v2.nemo/digital-health-clinical-asr-buildScenario B — Nemotron Streaming. User: "SFT ?" → No (UNK collapse). Substitute . Riva chunks non-streaming bases for streaming serving — base doesn't need to be streaming-native.
nvidia/nemotron-speech-streaming-en-0.6bparakeet-tdt-0.6b-v2Scenario C — cycle 2 KER unchanged. User: "KER barely moved." → Back to . Signal density beats LR sweeps. If rows are bad but rows are good, the gap is pronunciation-coverage — Step 2d.
/digital-health-clinical-asr-buildmagpie_g2pmerriam-webster/digital-health-clinical-asr-build场景A —— 符合准入标准。用户:“药物类别KER为0.42,共130行数据。可以进行SFT吗?” → 可以(符合准入标准)。使用(已验证从0.513降至0.128)。若无本地GPU?执行第4a步(Brev部署)→ 4b步(数据集拆分)→ 4d步(原生SFT)→ 4e步(离线重新评估)。若第2轮药物类别KER相对降低≥20%,则保留模型;否则返回。
parakeet-tdt-0.6b-v2.nemo/digital-health-clinical-asr-build场景B —— Nemotron流式模型。用户:“可以微调吗?” → 不可以(会出现UNK坍缩)。建议替换为。Riva可对非流式基础模型进行分片处理以实现流式服务——基础模型无需原生支持流式传输。
nvidia/nemotron-speech-streaming-en-0.6bparakeet-tdt-0.6b-v2场景C —— 第2轮KER无变化。用户:“KER几乎没有变化。” → 返回。信号密度优于学习率扫描。若行数据质量差但行数据质量好,说明存在发音覆盖缺口——执行第2d步。
/digital-health-clinical-asr-buildmagpie_g2pmerriam-webster/digital-health-clinical-asr-buildArtifacts produced
生成的产物
- ,
train.jsonl— term-aware split (Step 4b)validation.jsonl - — fine-tuned model (Step 4d)
adapted_model.nemo - — hyperparameters, dataset stats, end-of-train metrics
training_run_info.json - — cycle-N+1 transcription hypotheses (Step 4e)
offline_hyps.jsonl - — cycle-N+1 five-section leaderboard
leaderboard_cycle<N+1>.md - (optional, after Step 4f) a deployed NIM endpoint (delegated to )
/riva-asr-custom
- 、
train.jsonl—— 术语感知的数据集拆分(第4b步)validation.jsonl - —— 微调后的模型(第4d步)
adapted_model.nemo - —— 超参数、数据集统计信息、训练结束指标
training_run_info.json - —— N+1轮转录结果(第4e步)
offline_hyps.jsonl - —— N+1轮五部分排行榜
leaderboard_cycle<N+1>.md - (可选,第4f步后) 已部署的NIM端点(由负责)
/riva-asr-custom
Troubleshooting
故障排查
- Stage 4 training collapses to all-UNK after first step → you're on the cache-aware streaming RNNT base (). Route to
nemotron-speech-streaming-en-0.6b(the recommended default) ornvidia/parakeet-tdt-0.6b-v2(legacy fallback). The streaming RNNT SFT path is broken; do not retry with different hyperparameters.nvidia/stt_en_conformer_ctc_large - Manifest paths don't resolve inside the NeMo container → host paths (e.g. ) need rewriting to
$HOME/…. See/workspace/…for the rewrite snippet.references/container-paths.md - Cycle N+1 KER unchanged from cycle N → on with the recipe above, this almost always means manifest signal density is too low. Grow the manifest first; don't sweep LR. (If you're on an older adapter-style recipe instead of stock SFT, the adapter weights may not have moved off zero-init — switch to stock SFT.)
parakeet-tdt-0.6b-v2 - Cycle N+1 KER got worse → overfit on a tiny manifest. Bail to and grow.
/digital-health-clinical-asr-build - Riva-served numbers diverge from offline numbers → the gap is in Riva preprocessing or flags, not the model. Route to
riva-build./riva-asr-custom - precision errors → some GPUs (older Turing, all Volta) don't support BF16. Drop to
bf16-mixedand reducefp32. Usebatch_sizeonly iffp16-mixedis too slow — fp16 with TDT decoders can produce NaN losses, so check loss curves early.fp32 - OOM during training on 24 GB GPU → drop to 2, raise
batch_sizeto 2 to keep the effective batch size constant.accumulate_grad_batches
- 第4阶段训练第1步后出现全UNK坍缩 → 你使用的是缓存感知流式RNNT基础模型()。请切换至
nemotron-speech-streaming-en-0.6b(推荐默认模型)或nvidia/parakeet-tdt-0.6b-v2(旧版备选)。流式RNNT的SFT路径存在问题;请勿尝试调整超参数重试。nvidia/stt_en_conformer_ctc_large - NeMo容器内无法解析清单路径 → 主机路径(如)需重写为
$HOME/…。路径重写代码片段见/workspace/…。references/container-paths.md - N+1轮KER与N轮无变化 → 在上使用上述方案时,这几乎总是意味着清单信号密度过低。请先扩充清单;请勿进行学习率扫描。(若你使用的是旧版适配器方案而非原生SFT,适配器权重可能未从零初始化状态更新——请切换至原生SFT。)
parakeet-tdt-0.6b-v2 - N+1轮KER上升 → 小型清单过拟合。返回扩充清单。
/digital-health-clinical-asr-build - Riva服务数值与离线数值存在差异 → 差异源于Riva预处理或参数,与模型无关。请引导至
riva-build。/riva-asr-custom - 精度错误 → 部分GPU(旧版Turing、所有Volta)不支持BF16。请切换至
bf16-mixed精度并减小fp32。仅当batch_size速度过慢时才使用fp32——TDT解码器使用fp16可能产生NaN损失,请尽早检查损失曲线。fp16-mixed - 24 GB GPU训练时出现OOM → 将降至2,将
batch_size提升至2以保持有效批次大小不变。accumulate_grad_batches
Limitations
限制
- Adapter-style SFT on TDT/RNNT decoders is broken. Empirically confirmed: an earlier LinearAdapter-mixin recipe produces 72 NaN tensors at any LR on TDT and RNNT decoders. Resolved by switching to NeMo's stock full-model SFT () — which is what this skill recommends. Do not attempt adapter SFT on TDT/RNNT bases.
speech_to_text_finetune.py - Don't SFT . The streaming-only NVCF function's SFT path is unreliable (UNK collapse). For streaming serving at deploy time, Riva chunks a non-streaming base.
nemotron-speech-streaming-en-0.6b - Tiny manifests overfit fast. Below ~100 rows total or ~5 rows per priority category, cycle-N+1 numbers are noisy. Grow before trusting a small KER drop.
- English-only by default. The base-model table is en-US-specific. Other locales need a different base + a re-validated SFT recipe.
- No turn-key driver. The user writes their own training-driver layout — output paths, run naming, leaderboard re-rendering. The methodology and recipes transfer; exact cycle-1 numbers depend on the user's manifest.
- TDT/RNNT解码器的适配器式SFT存在问题。已验证:早期的LinearAdapter-mixin方案在TDT和RNNT解码器上,任何学习率下都会产生72个NaN张量。解决方案是切换至NeMo的原生全模型SFT()——这也是本技能推荐的方案。请勿在TDT/RNNT基础模型上尝试适配器SFT。
speech_to_text_finetune.py - 请勿对进行SFT。仅支持流式传输的NVCF函数的SFT路径不可靠(会出现UNK坍缩)。部署时如需流式服务,Riva可对非流式基础模型进行分片处理。
nemotron-speech-streaming-en-0.6b - 小型清单易过拟合。总数据量<100行或每个优先级类别<5行时,N+1轮数值噪声过大。在信任小幅KER下降前,请先扩充清单。
- 默认仅支持英文。基础模型表格针对美式英语。其他语言区域需使用不同的基础模型并重新验证SFT方案。
- 无开箱即用的驱动程序。用户需自行编写训练驱动程序的布局——输出路径、运行命名、排行榜重新生成。方法学和方案可复用;第1轮的具体数值取决于用户的清单。
Next steps
后续步骤
- Deploy the as a NIM:
.nemo(pass the source architecture explicitly)./riva-asr-custom - Grow the manifest for cycle N+2: .
/digital-health-clinical-asr-build - Re-score the cycle: (against the new endpoint or the new
/digital-health-clinical-asr-evaldirectly)..nemo - Lateral for word boosting / LM fusion / non-clinical SFT recipes: .
/finetune-asr
- 将模型部署为NIM:
.nemo(需明确传递源架构)。/riva-asr-custom - 扩充清单以进行N+2轮训练:。
/digital-health-clinical-asr-build - 重新评估本轮结果:(针对新端点或新
/digital-health-clinical-asr-eval模型)。.nemo - 词汇增强/语言模型融合/非临床SFT方案:。
/finetune-asr
References
参考资料
- — base-model selection table, hyperparameter rationale, decoder → NIM container mapping, decision tree comparing cycle-N+1 to cycle-N
references/stage4-finetune.md - — host →
references/container-paths.mdpath rewriting for cross-host manifest portability (laptop ↔ Brev ↔ NeMo container)/workspace/
- —— 基础模型选择表格、超参数原理、解码器→NIM容器映射、N+1轮与N轮对比决策树
references/stage4-finetune.md - —— 主机→
references/container-paths.md路径重写,实现跨主机清单可移植性(笔记本电脑 ↔ Brev ↔ NeMo容器)/workspace/