<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Clinical ASR Flywheel — Stage 4 (Fine-tune)
⚠ Agent: read this entire SKILL.md before answering. The Critical-workflow-rules section, the base-model table (§4c), the stock-NeMo-SFT recipe (§4d), and the cycle-N+1 decision table (§4e) are all load-bearing — the do-not-SFT bases and broken-adapter warnings live there.
Agent: this file is self-contained. The Stage 4 gate criteria, base-model recommendation, hyperparameter table, container invocation pattern, and cycle-N+1 decision table are all below.
Do not run file-discovery commands or open
references/stage4-finetune.md
before answering methodology questions — the reference is deep-dive material, not required reading. Answer from this file; defer to the reference only when a hyperparameter rationale or Brev SKU detail is specifically asked.
You are the
adapt-and-measure stage. The user arrives from
/digital-health-clinical-asr-eval
with a manifest, a baseline KER number, and the decision-tree's recommendation that fine-tuning is worth the GPU time. You run stock NeMo SFT, do an offline cycle N+1 re-eval to
measure that the loop closed, and optionally hand the resulting
to
for production serving.
The cycle KER from offline eval is the measurement that closes the loop. Riva NIM deploy validates serving (latency, streaming, scale), not model quality.
Empirically verified on the reference manifest (39 rows, Parakeet TDT v2):
Baseline KER 0.513 → after 3 epochs of stock SFT: 0.128 (-75% relative).
Drug names: 0.857 → 0.214. Conditions: 0.500 → 0.000. Procedures: 0.250 → 0.000.
Critical workflow rules (apply on every activation)
Surface these facts in any response, even if the user asks a narrow question:
- Read this entire SKILL.md before answering. The base-model selection table, hyperparameter values, and the cycle-N+1 decision table are below — they are the load-bearing parts.
- Verified result — Parakeet TDT v2 with the recipe in §4c achieves KER 0.513 → 0.128 (−75% relative) in 3 epochs on the reference manifest. Cite this when the user asks whether SFT will help.
- Recipe is
/opt/NeMo/examples/asr/speech_to_text_finetune.py
inside nvcr.io/nvidia/nemo:25.11.01
. Stock script, no patches, no custom adapter logic. The adapter-mixin path is broken on TDT/RNNT decoders (72 NaN tensors at any LR) — do not propose it.
- Recommended base is
nvidia/parakeet-tdt-0.6b-v2
. The full base-model table is in §4c.
- Do NOT fine-tune
nvidia/nemotron-speech-streaming-en-0.6b
. The streaming NVCF function's SFT path is broken (UNK collapse on validation after step 1). For streaming serving at deploy time, Riva chunks a non-streaming base just fine. Warn the user proactively if they propose it.
- Gate the recommendation. Stage 4 only fires when priority-category KER > 0.3 and manifest has ≥ 100 rows (≥ 5 per priority category). Below those thresholds, route back to
/digital-health-clinical-asr-build
to grow the manifest first.
Purpose
Run
stock NeMo SFT (no custom adapter logic, no patches) in
nvcr.io/nvidia/nemo:25.11.01
against a term-aware row-disjoint train/val split, produce a
model, and re-eval offline as cycle N+1. Decide based on the cycle-N → cycle-N+1 KER delta whether to keep the model, grow the manifest, or accept that fine-tuning didn't help. Optionally hand the
to
for NIM deploy.
When to use this skill
Activate on user phrases like:
- "Fine-tune ASR on my clinical vocabulary"
- "Improve ASR on medication names"
- "We have a KER of 0.4, can we fine-tune?"
- "Run SFT on my Parakeet TDT base"
- "Train a clinical ASR adapter"
- "Compare cycle 1 vs cycle 2 KER"
- "Deploy my fine-tuned model as a NIM" (this skill prepares the and routes to for the deploy)
Do not activate when:
- The user hasn't scored a baseline yet →
/digital-health-clinical-asr-eval
- The user doesn't have a manifest →
/digital-health-clinical-asr-build
- The user wants generic word boosting / LM fusion (not SFT) →
- The user has a and only wants to deploy →
Prerequisites
- A cycle-N manifest + cycle-N eval result from
/digital-health-clinical-asr-eval
. The priority-category KER must be > 0.3 (Stage 4 gate). The manifest should have ≥ 100 rows total, and ≥ 5 rows per priority , for a believable post-tune signal.
- A CUDA host — 24 GB VRAM is comfortable for Parakeet TDT 0.6B at with ; 16 GB works with smaller batch. No local GPU? Use Brev — recommended SKU is L40S 48 GB.
- The NeMo container:
nvcr.io/nvidia/nemo:25.11.01
. Pull once: docker pull nvcr.io/nvidia/nemo:25.11.01
.
- NVIDIA Container Toolkit + Docker — covered by if not already installed.
- A train/val split stratified by (recipe sketch in Step 4b below).
- installed if you intend to deploy. Pure-research SFT runs without it.
Instructions
4a. Provision a GPU host (skip if you already have one)
Stage 4 needs a CUDA host with ≥ 16 GB VRAM (24 GB comfortable). If you have a local one that fits, skip this section. If not, use Brev — NVIDIA's per-second-billed GPU host service. Recommended SKU: L40S 48 GB.
Cost disclosure — surface this to the user before any . L40S 48 GB runs
$1.50/hr at time of writing; a 3-epoch SFT run on a 100-row manifest finishes in 15–30 minutes ($0.40–$0.75 of compute). The real risk is
forgetting to stop the instance — overnight idle on L40S is ~$36, a week of idle is ~$250. Mitigations: (a) always wrap the workflow in a script that ends with
; (b) set a calendar reminder when you start; (c)
instead of
if you don't need to keep the disk (
keeps disk at $0.10/GB-month — 200 GB ≈ $20/month of latent cost). Confirm the user accepts the per-hour cost shape and the idle risk before spinning anything up.
Full setup walkthrough — CLI install (download-then-run, not curl-pipe), SKU choice, disk sizing, SSH config — is in
references/stage4-finetune.md
(§Brev provisioning).
Short happy-path once the CLI is installed.
Do not run until the user has explicitly typed at the confirmation prompt below — the gate is mandatory, not advisory, because everything after it bills against the user's account by the second:
bash
brev login # browser auth
# Mandatory cost-confirmation gate — do NOT skip or auto-answer this.
echo "About to provision: digital-health-clinical-asr-sft on L40S 48 GB."
echo "Cost shape: ~\$1.50/hr while running; ~\$36/night if left idle; ~\$20/mo disk if you 'stop' instead of 'delete'."
read -rp "Type YES to provision (anything else cancels): " confirm
[ "$confirm" = "YES" ] || { echo "Cancelled — no GPU instance was created."; exit 1; }
brev create digital-health-clinical-asr-sft \
--gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi
brev ssh-config # writes ~/.ssh/config entries
rsync -avz ./cycle1/ digital-health-clinical-asr-sft:~/cycle1/
brev shell digital-health-clinical-asr-sft # drops into the instance
nvidia-smi # confirm GPU
docker pull nvcr.io/nvidia/nemo:25.11.01 # ~12 GB, once per instance
When done,
always halt billing:
brev stop digital-health-clinical-asr-sft
(keeps disk) or
brev delete digital-health-clinical-asr-sft
(frees it). For path rewriting laptop → Brev → NeMo container, see
references/container-paths.md
.
4b. Term-aware train/val split
Row-disjoint, stratified by , default val fraction 0.2.
The
same may appear on both sides via different rows (different voice, context, noise). That's expected and desirable — it measures acoustic + contextual robustness on the trained vocabulary, which is the standard ASR adaptation metric.
Singleton categories (one row total) get forced to train with a warning. If any priority category has < 5 rows,
bail to /digital-health-clinical-asr-build
— held-out validation will be too noisy to attribute movement.
Sketch:
python
# After loading manifest.jsonl into a list of dicts `rows`:
from collections import defaultdict
import random
random.seed(42)
by_cat = defaultdict(list)
for r in rows:
by_cat[r["entity_category"]].append(r)
train, val = [], []
for cat, cat_rows in by_cat.items():
random.shuffle(cat_rows)
if len(cat_rows) < 2:
train.extend(cat_rows)
print(f"warning: singleton category {cat}, forced to train")
continue
n_val = max(1, int(0.2 * len(cat_rows)))
val.extend(cat_rows[:n_val])
train.extend(cat_rows[n_val:])
Write
and
alongside the manifest.
These are the inputs to speech_to_text_finetune.py
.
4c. Choose the base model
| Base | SFT viability | Notes |
|---|
nvidia/parakeet-tdt-0.6b-v2
| ✅ Empirically verified (KER 0.513 → 0.128 in 3 epochs, −75% relative) | NVIDIA's current English ASR default. Stock NeMo SFT recipe works end-to-end. Recommended. |
nvidia/nemotron-speech-streaming-en-0.6b
| ❌ Don't use for SFT | NVCF function is streaming-only; SFT path unreliable (UNK collapse on validation after first training step). For streaming serving, Riva chunks a non-streaming base just fine. |
Other Parakeet/Conformer bases (1.1B, CTC, RNNT,
stt_en_conformer_ctc_large
) + decoder → NIM container mapping:
references/stage4-finetune.md
. If the user asks to fine-tune Nemotron Speech Streaming,
warn about the collapse and recommend Parakeet TDT v2.
4d. Stock NeMo SFT
In the NeMo container, invoke
/opt/NeMo/examples/asr/speech_to_text_finetune.py
directly.
No custom adapter logic. No patches. The stock NeMo SFT script is the verified working recipe.
Hyperparameters (verified on Parakeet TDT v2, 39-row manifest):
init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision: bf16-mixed # required for TDT numerical stability
lr: 3e-4 # CosineAnnealing schedule
warmup_steps: 5 # tiny manifest; bump to 500 at production scale
epochs: 3 # smoke; 10-30 for production
batch_size: 4 # fits 16 GB VRAM; raise to 16 on L40S 48 GB
gradient_clip_val: 1.0 # defensive
Container invocation:
docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.py
with
model.train_ds.manifest_filepath=/workspace/train.jsonl
,
model.validation_ds.manifest_filepath=/workspace/validation.jsonl
,
init_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2
, and the hyperparameter overrides from the table above. Full docker-run line with config-path / config-name flags:
references/stage4-finetune.md
§Container invocation.
Manifest paths inside the container. Host paths (e.g.
) don't resolve in
. Rewrite snippet:
references/container-paths.md
.
The training run writes
and a
summary. Both go into a per-cycle subdirectory of the user's choice (e.g.
; the layout doesn't matter as long as it's consistent across cycles).
4e. Offline cycle N+1 eval — close the loop
Re-transcribe the cycle's audio with the fine-tuned
using NeMo's offline
.
No Riva needed — this is measurement, not serving. NeMo's offline path runs the same encoder + decoder graph the Riva NIM eventually serves.
Sketch:
python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])
Score the same four metrics (WER/CER/KER/SER) and the same five-section leaderboard the eval skill produces. Write them as
leaderboard_cycle<N+1>.md
. Compare against
.
Decision table — cycle-N+1 vs cycle-N:
| Result | Action |
|---|
| KER dropped meaningfully on targeted categories (e.g. drug KER −20% or more, relative) | ✅ Keep the . Update the leaderboard. Advance to Step 4f if you want to deploy. |
| KER moved a little, you wanted more | Loop back to /digital-health-clinical-asr-build
, expand the manifest. Tiny manifests rarely benefit from hyperparameter tweaks — signal density beats LR sweeps. |
| KER got worse | Overfit on a tiny manifest. Bail to /digital-health-clinical-asr-build
and grow before retraining. Don't tune harder on the same data. |
| No measurable change | Some categories may already be in the base model's vocab. Sanity-check per-category numbers before concluding training "didn't help." |
4f. (Optional) Deploy as a Riva NIM
Hand the
to
.
Pass the source architecture explicitly —
can't reliably detect CTC vs RNNT vs TDT from the
alone, and the wrong NIM container produces a broken RMIR with no clear error:
| Source decoder | flag | NIM container family |
|---|
| Conformer-CTC | | |
| Conformer-RNNT | | |
| Conformer-TDT (default) | | |
| Cache-Aware RNNT (Nemotron streaming) | | ⚠ SFT broken on this base, see Limitations |
After deploy: re-run
/digital-health-clinical-asr-eval
against the new endpoint (
ASR_ENDPOINT=localhost:50051
) to validate that production-serving numbers match offline numbers. Any divergence is in Riva preprocessing or
flags, not the model. Route to
.
Examples
Scenario A — gate met. User:
"Drug KER 0.42, 130 rows. SFT?" → Yes (gate cleared).
(verified 0.513 → 0.128). No local GPU? Step 4a (Brev) → 4b (split) → 4d (stock SFT) → 4e (offline re-eval). If cycle-2 drug KER drops ≥ 20% relative, keep the
; otherwise back to
/digital-health-clinical-asr-build
.
Scenario B — Nemotron Streaming. User:
"SFT nvidia/nemotron-speech-streaming-en-0.6b
?" → No (UNK collapse). Substitute
. Riva chunks non-streaming bases for streaming serving — base doesn't need to be streaming-native.
Scenario C — cycle 2 KER unchanged. User:
"KER barely moved." → Back to
/digital-health-clinical-asr-build
. Signal density beats LR sweeps. If
rows are bad but
rows are good, the gap is pronunciation-coverage —
/digital-health-clinical-asr-build
Step 2d.
Artifacts produced
- , — term-aware split (Step 4b)
- — fine-tuned model (Step 4d)
- — hyperparameters, dataset stats, end-of-train metrics
- — cycle-N+1 transcription hypotheses (Step 4e)
leaderboard_cycle<N+1>.md
— cycle-N+1 five-section leaderboard
- (optional, after Step 4f) a deployed NIM endpoint (delegated to )
Troubleshooting
- Stage 4 training collapses to all-UNK after first step → you're on the cache-aware streaming RNNT base (
nemotron-speech-streaming-en-0.6b
). Route to nvidia/parakeet-tdt-0.6b-v2
(the recommended default) or nvidia/stt_en_conformer_ctc_large
(legacy fallback). The streaming RNNT SFT path is broken; do not retry with different hyperparameters.
- Manifest paths don't resolve inside the NeMo container → host paths (e.g. ) need rewriting to . See
references/container-paths.md
for the rewrite snippet.
- Cycle N+1 KER unchanged from cycle N → on with the recipe above, this almost always means manifest signal density is too low. Grow the manifest first; don't sweep LR. (If you're on an older adapter-style recipe instead of stock SFT, the adapter weights may not have moved off zero-init — switch to stock SFT.)
- Cycle N+1 KER got worse → overfit on a tiny manifest. Bail to
/digital-health-clinical-asr-build
and grow.
- Riva-served numbers diverge from offline numbers → the gap is in Riva preprocessing or flags, not the model. Route to .
- precision errors → some GPUs (older Turing, all Volta) don't support BF16. Drop to and reduce . Use only if is too slow — fp16 with TDT decoders can produce NaN losses, so check loss curves early.
- OOM during training on 24 GB GPU → drop to 2, raise to 2 to keep the effective batch size constant.
Limitations
- Adapter-style SFT on TDT/RNNT decoders is broken. Empirically confirmed: an earlier LinearAdapter-mixin recipe produces 72 NaN tensors at any LR on TDT and RNNT decoders. Resolved by switching to NeMo's stock full-model SFT (
speech_to_text_finetune.py
) — which is what this skill recommends. Do not attempt adapter SFT on TDT/RNNT bases.
- Don't SFT
nemotron-speech-streaming-en-0.6b
. The streaming-only NVCF function's SFT path is unreliable (UNK collapse). For streaming serving at deploy time, Riva chunks a non-streaming base.
- Tiny manifests overfit fast. Below ~100 rows total or ~5 rows per priority category, cycle-N+1 numbers are noisy. Grow before trusting a small KER drop.
- English-only by default. The base-model table is en-US-specific. Other locales need a different base + a re-validated SFT recipe.
- No turn-key driver. The user writes their own training-driver layout — output paths, run naming, leaderboard re-rendering. The methodology and recipes transfer; exact cycle-1 numbers depend on the user's manifest.
Next steps
- Deploy the as a NIM: (pass the source architecture explicitly).
- Grow the manifest for cycle N+2:
/digital-health-clinical-asr-build
.
- Re-score the cycle:
/digital-health-clinical-asr-eval
(against the new endpoint or the new directly).
- Lateral for word boosting / LM fusion / non-clinical SFT recipes: .
References
references/stage4-finetune.md
— base-model selection table, hyperparameter rationale, decoder → NIM container mapping, decision tree comparing cycle-N+1 to cycle-N
references/container-paths.md
— host → path rewriting for cross-host manifest portability (laptop ↔ Brev ↔ NeMo container)