Clinical ASR Flywheel — Stage 4 (Fine-tune)

⚠ Agent: read this entire SKILL.md before answering. The Critical-workflow-rules section, the base-model table (§4c), the stock-NeMo-SFT recipe (§4d), and the cycle-N+1 decision table (§4e) are all load-bearing — the do-not-SFT bases and broken-adapter warnings live there.

Agent: this file is self-contained. The Stage 4 gate criteria, base-model recommendation, hyperparameter table, container invocation pattern, and cycle-N+1 decision table are all below. Do not run file-discovery commands or open
references/stage4-finetune.md
before answering methodology questions — the reference is deep-dive material, not required reading. Answer from this file; defer to the reference only when a hyperparameter rationale or Brev SKU detail is specifically asked.

You are the adapt-and-measure stage. The user arrives from

/digital-health-clinical-asr-eval

with a manifest, a baseline KER number, and the decision-tree's recommendation that fine-tuning is worth the GPU time. You run stock NeMo SFT, do an offline cycle N+1 re-eval to measure that the loop closed, and optionally hand the resulting

.nemo

/riva-asr-custom

for production serving.

The cycle KER from offline eval is the measurement that closes the loop. Riva NIM deploy validates serving (latency, streaming, scale), not model quality.

Empirically verified on the reference manifest (39 rows, Parakeet TDT v2): Baseline KER 0.513 → after 3 epochs of stock SFT: 0.128 (-75% relative). Drug names: 0.857 → 0.214. Conditions: 0.500 → 0.000. Procedures: 0.250 → 0.000.

Critical workflow rules (apply on every activation)

Surface these facts in any response, even if the user asks a narrow question:

Read this entire SKILL.md before answering. The base-model selection table, hyperparameter values, and the cycle-N+1 decision table are below — they are the load-bearing parts.
Verified result — Parakeet TDT v2 with the recipe in §4c achieves KER 0.513 → 0.128 (−75% relative) in 3 epochs on the reference manifest. Cite this when the user asks whether SFT will help.
Recipe is
/opt/NeMo/examples/asr/speech_to_text_finetune.py
inside
nvcr.io/nvidia/nemo:25.11.01
. Stock script, no patches, no custom adapter logic. The adapter-mixin path is broken on TDT/RNNT decoders (72 NaN tensors at any LR) — do not propose it.
Recommended base is
nvidia/parakeet-tdt-0.6b-v2
. The full base-model table is in §4c.
Do NOT fine-tune
nvidia/nemotron-speech-streaming-en-0.6b
. The streaming NVCF function's SFT path is broken (UNK collapse on validation after step 1). For streaming serving at deploy time, Riva chunks a non-streaming base just fine. Warn the user proactively if they propose it.
Gate the recommendation. Stage 4 only fires when priority-category KER > 0.3 and manifest has ≥ 100 rows (≥ 5 per priority category). Below those thresholds, route back to
```
/digital-health-clinical-asr-build
```
to grow the manifest first.

Purpose

Run stock NeMo SFT (no custom adapter logic, no patches) in

nvcr.io/nvidia/nemo:25.11.01

against a term-aware row-disjoint train/val split, produce a

.nemo

model, and re-eval offline as cycle N+1. Decide based on the cycle-N → cycle-N+1 KER delta whether to keep the model, grow the manifest, or accept that fine-tuning didn't help. Optionally hand the

.nemo

/riva-asr-custom

for NIM deploy.

When to use this skill

Activate on user phrases like:

"Fine-tune ASR on my clinical vocabulary"
"Improve ASR on medication names"
"We have a KER of 0.4, can we fine-tune?"
"Run SFT on my Parakeet TDT base"
"Train a clinical ASR adapter"
"Compare cycle 1 vs cycle 2 KER"
"Deploy my fine-tuned model as a NIM" (this skill prepares the
.nemo
and routes to
/riva-asr-custom
for the deploy)

Do not activate when:

The user hasn't scored a baseline yet →
```
/digital-health-clinical-asr-eval
```
The user doesn't have a manifest →
```
/digital-health-clinical-asr-build
```
The user wants generic word boosting / LM fusion (not SFT) →
```
/finetune-asr
```
The user has a
```
.nemo
```
and only wants to deploy →
```
/riva-asr-custom
```

Prerequisites

A cycle-N manifest + cycle-N eval result from
```
/digital-health-clinical-asr-eval
```
. The priority-category KER must be > 0.3 (Stage 4 gate). The manifest should have ≥ 100 rows total, and ≥ 5 rows per priority
```
entity_category
```
, for a believable post-tune signal.
A CUDA host — 24 GB VRAM is comfortable for Parakeet TDT 0.6B at
```
batch_size=4
```
with
```
bf16-mixed
```
; 16 GB works with smaller batch. No local GPU? Use Brev — recommended SKU is L40S 48 GB.

The NeMo container:

nvcr.io/nvidia/nemo:25.11.01

. Pull once:

docker pull nvcr.io/nvidia/nemo:25.11.01

NVIDIA Container Toolkit + Docker — covered by
```
/riva-nim-setup
```
if not already installed.
A train/val split stratified by
```
entity_category
```
(recipe sketch in Step 4b below).
/riva-asr-custom
installed if you intend to deploy. Pure-research SFT runs without it.

Instructions

4a. Provision a GPU host (skip if you already have one)

Stage 4 needs a CUDA host with ≥ 16 GB VRAM (24 GB comfortable). If you have a local one that fits, skip this section. If not, use Brev — NVIDIA's per-second-billed GPU host service. Recommended SKU: L40S 48 GB.

Cost disclosure — surface this to the user before any
brev create
. L40S 48 GB runs ~~$1.50/hr at time of writing; a 3-epoch SFT run on a 100-row manifest finishes in 15–30 minutes (~~$0.40–$0.75 of compute). The real risk is forgetting to stop the instance — overnight idle on L40S is ~$36, a week of idle is ~$250. Mitigations: (a) always wrap the workflow in a script that ends with

brev stop

; (b) set a calendar reminder when you start; (c)

brev delete

instead of

brev stop

if you don't need to keep the disk (

stop

keeps disk at $0.10/GB-month — 200 GB ≈ $20/month of latent cost). Confirm the user accepts the per-hour cost shape and the idle risk before spinning anything up.

Full setup walkthrough — CLI install (download-then-run, not curl-pipe), SKU choice, disk sizing, SSH config — is in

references/stage4-finetune.md

(§Brev provisioning).

Short happy-path once the CLI is installed. Do not run
brev create
until the user has explicitly typed
YES
at the confirmation prompt below — the gate is mandatory, not advisory, because everything after it bills against the user's account by the second:

bash

brev login                                  # browser auth

# Mandatory cost-confirmation gate — do NOT skip or auto-answer this.
echo "About to provision: digital-health-clinical-asr-sft on L40S 48 GB."
echo "Cost shape: ~\$1.50/hr while running; ~\$36/night if left idle; ~\$20/mo disk if you 'stop' instead of 'delete'."
read -rp "Type YES to provision (anything else cancels): " confirm
[ "$confirm" = "YES" ] || { echo "Cancelled — no GPU instance was created."; exit 1; }

brev create digital-health-clinical-asr-sft \
  --gpu l40s:1 --image ubuntu-22-04-cuda-12-4 --disk 200gi
brev ssh-config                             # writes ~/.ssh/config entries
rsync -avz ./cycle1/ digital-health-clinical-asr-sft:~/cycle1/
brev shell digital-health-clinical-asr-sft            # drops into the instance
nvidia-smi                                  # confirm GPU
docker pull nvcr.io/nvidia/nemo:25.11.01    # ~12 GB, once per instance

When done, always halt billing:

brev stop digital-health-clinical-asr-sft

(keeps disk) or

brev delete digital-health-clinical-asr-sft

(frees it). For path rewriting laptop → Brev → NeMo container, see

references/container-paths.md

4b. Term-aware train/val split

Row-disjoint, stratified by
entity_category
, default val fraction 0.2.

The same
term
may appear on both sides via different rows (different voice, context, noise). That's expected and desirable — it measures acoustic + contextual robustness on the trained vocabulary, which is the standard ASR adaptation metric.

Singleton categories (one row total) get forced to train with a warning. If any priority category has < 5 rows, bail to
/digital-health-clinical-asr-build
— held-out validation will be too noisy to attribute movement.

Sketch:

python

# After loading manifest.jsonl into a list of dicts `rows`:
from collections import defaultdict
import random
random.seed(42)

by_cat = defaultdict(list)
for r in rows:
    by_cat[r["entity_category"]].append(r)

train, val = [], []
for cat, cat_rows in by_cat.items():
    random.shuffle(cat_rows)
    if len(cat_rows) < 2:
        train.extend(cat_rows)
        print(f"warning: singleton category {cat}, forced to train")
        continue
    n_val = max(1, int(0.2 * len(cat_rows)))
    val.extend(cat_rows[:n_val])
    train.extend(cat_rows[n_val:])

Write

train.jsonl

and

validation.jsonl

alongside the manifest. These are the inputs to
speech_to_text_finetune.py
.

4c. Choose the base model

Base	SFT viability	Notes
`nvidia/parakeet-tdt-0.6b-v2`	✅ Empirically verified (KER 0.513 → 0.128 in 3 epochs, −75% relative)	NVIDIA's current English ASR default. Stock NeMo SFT recipe works end-to-end. Recommended.
`nvidia/nemotron-speech-streaming-en-0.6b`	❌ Don't use for SFT	NVCF function is streaming-only; SFT path unreliable (UNK collapse on validation after first training step). For streaming serving, Riva chunks a non-streaming base just fine.

Other Parakeet/Conformer bases (1.1B, CTC, RNNT,

stt_en_conformer_ctc_large

) + decoder → NIM container mapping:

references/stage4-finetune.md

. If the user asks to fine-tune Nemotron Speech Streaming, warn about the collapse and recommend Parakeet TDT v2.

4d. Stock NeMo SFT

In the NeMo container, invoke

/opt/NeMo/examples/asr/speech_to_text_finetune.py

directly. No custom adapter logic. No patches. The stock NeMo SFT script is the verified working recipe.

Hyperparameters (verified on Parakeet TDT v2, 39-row manifest):

init_from_pretrained_model: nvidia/parakeet-tdt-0.6b-v2
precision:                  bf16-mixed       # required for TDT numerical stability
lr:                         3e-4             # CosineAnnealing schedule
warmup_steps:               5                # tiny manifest; bump to 500 at production scale
epochs:                     3                # smoke; 10-30 for production
batch_size:                 4                # fits 16 GB VRAM; raise to 16 on L40S 48 GB
gradient_clip_val:          1.0              # defensive

Container invocation:

docker run --gpus all --rm -it -v "$PWD:/workspace" nvcr.io/nvidia/nemo:25.11.01 python /opt/NeMo/examples/asr/speech_to_text_finetune.py

with

model.train_ds.manifest_filepath=/workspace/train.jsonl

model.validation_ds.manifest_filepath=/workspace/validation.jsonl

init_from_pretrained_model=nvidia/parakeet-tdt-0.6b-v2

, and the hyperparameter overrides from the table above. Full docker-run line with config-path / config-name flags:

references/stage4-finetune.md

§Container invocation.

Manifest paths inside the container. Host paths (e.g.

$HOME/…

) don't resolve in

/workspace

. Rewrite snippet:

references/container-paths.md

The training run writes

adapted_model.nemo

and a

training_run_info.json

summary. Both go into a per-cycle subdirectory of the user's choice (e.g.

cycle<N>/models/<run>/

; the layout doesn't matter as long as it's consistent across cycles).

4e. Offline cycle N+1 eval — close the loop

Re-transcribe the cycle's audio with the fine-tuned

.nemo

using NeMo's offline

transcribe()

. No Riva needed — this is measurement, not serving. NeMo's offline path runs the same encoder + decoder graph the Riva NIM eventually serves.

Sketch:

python

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("adapted_model.nemo")
hyps = model.transcribe(["audio/row1.wav", "audio/row2.wav", ...])

Score the same four metrics (WER/CER/KER/SER) and the same five-section leaderboard the eval skill produces. Write them as

leaderboard_cycle<N+1>.md

. Compare against

leaderboard_cycle<N>.md

Decision table — cycle-N+1 vs cycle-N:

Result	Action
KER dropped meaningfully on targeted categories (e.g. drug KER −20% or more, relative)	✅ Keep the `.nemo` . Update the leaderboard. Advance to Step 4f if you want to deploy.
KER moved a little, you wanted more	Loop back to `/digital-health-clinical-asr-build` , expand the manifest. Tiny manifests rarely benefit from hyperparameter tweaks — signal density beats LR sweeps.
KER got worse	Overfit on a tiny manifest. Bail to `/digital-health-clinical-asr-build` and grow before retraining. Don't tune harder on the same data.
No measurable change	Some categories may already be in the base model's vocab. Sanity-check per-category numbers before concluding training "didn't help."

4f. (Optional) Deploy as a Riva NIM

Hand the

.nemo

/riva-asr-custom

. Pass the source architecture explicitly —

/riva-asr-custom

can't reliably detect CTC vs RNNT vs TDT from the

.nemo

alone, and the wrong NIM container produces a broken RMIR with no clear error:

Source decoder	`riva-build` flag	NIM container family
Conformer-CTC	`decoder=greedy_ctc`	`parakeet--ctc-`
Conformer-RNNT	`decoder=nemo`	`parakeet-rnnt-*`
Conformer-TDT (default)	`decoder=nemo`	`parakeet-tdt-*`
Cache-Aware RNNT (Nemotron streaming)	`decoder=nemo`	`nemotron-streaming-*` ⚠ SFT broken on this base, see Limitations

After deploy: re-run

/digital-health-clinical-asr-eval

against the new endpoint (

ASR_ENDPOINT=localhost:50051

) to validate that production-serving numbers match offline numbers. Any divergence is in Riva preprocessing or

riva-build

flags, not the model. Route to

/riva-asr-custom

Examples

Scenario A — gate met. User: "Drug KER 0.42, 130 rows. SFT?" → Yes (gate cleared).

parakeet-tdt-0.6b-v2

(verified 0.513 → 0.128). No local GPU? Step 4a (Brev) → 4b (split) → 4d (stock SFT) → 4e (offline re-eval). If cycle-2 drug KER drops ≥ 20% relative, keep the

.nemo

; otherwise back to

/digital-health-clinical-asr-build

Scenario B — Nemotron Streaming. User: "SFT
nvidia/nemotron-speech-streaming-en-0.6b
?" → No (UNK collapse). Substitute

parakeet-tdt-0.6b-v2

. Riva chunks non-streaming bases for streaming serving — base doesn't need to be streaming-native.

Scenario C — cycle 2 KER unchanged. User: "KER barely moved." → Back to

/digital-health-clinical-asr-build

. Signal density beats LR sweeps. If

magpie_g2p

rows are bad but

merriam-webster

rows are good, the gap is pronunciation-coverage —

/digital-health-clinical-asr-build

Step 2d.

Artifacts produced

```
train.jsonl
```
,
```
validation.jsonl
```
— term-aware split (Step 4b)
```
adapted_model.nemo
```
— fine-tuned model (Step 4d)
```
training_run_info.json
```
— hyperparameters, dataset stats, end-of-train metrics
```
offline_hyps.jsonl
```
— cycle-N+1 transcription hypotheses (Step 4e)
```
leaderboard_cycle<N+1>.md
```
— cycle-N+1 five-section leaderboard
(optional, after Step 4f) a deployed NIM endpoint (delegated to
```
/riva-asr-custom
```
)

Troubleshooting

Stage 4 training collapses to all-UNK after first step → you're on the cache-aware streaming RNNT base (
```
nemotron-speech-streaming-en-0.6b
```
). Route to
```
nvidia/parakeet-tdt-0.6b-v2
```
(the recommended default) or
```
nvidia/stt_en_conformer_ctc_large
```
(legacy fallback). The streaming RNNT SFT path is broken; do not retry with different hyperparameters.
Manifest paths don't resolve inside the NeMo container → host paths (e.g.
```
$HOME/…
```
) need rewriting to
```
/workspace/…
```
. See
```
references/container-paths.md
```
for the rewrite snippet.
Cycle N+1 KER unchanged from cycle N → on
```
parakeet-tdt-0.6b-v2
```
with the recipe above, this almost always means manifest signal density is too low. Grow the manifest first; don't sweep LR. (If you're on an older adapter-style recipe instead of stock SFT, the adapter weights may not have moved off zero-init — switch to stock SFT.)
Cycle N+1 KER got worse → overfit on a tiny manifest. Bail to
```
/digital-health-clinical-asr-build
```
and grow.
Riva-served numbers diverge from offline numbers → the gap is in Riva preprocessing or
```
riva-build
```
flags, not the model. Route to
```
/riva-asr-custom
```
.
bf16-mixed
precision errors → some GPUs (older Turing, all Volta) don't support BF16. Drop to
```
fp32
```
and reduce
```
batch_size
```
. Use
```
fp16-mixed
```
only if
```
fp32
```
is too slow — fp16 with TDT decoders can produce NaN losses, so check loss curves early.
OOM during training on 24 GB GPU → drop
```
batch_size
```
to 2, raise
```
accumulate_grad_batches
```
to 2 to keep the effective batch size constant.

Limitations

Adapter-style SFT on TDT/RNNT decoders is broken. Empirically confirmed: an earlier LinearAdapter-mixin recipe produces 72 NaN tensors at any LR on TDT and RNNT decoders. Resolved by switching to NeMo's stock full-model SFT (
```
speech_to_text_finetune.py
```
) — which is what this skill recommends. Do not attempt adapter SFT on TDT/RNNT bases.
Don't SFT
nemotron-speech-streaming-en-0.6b
. The streaming-only NVCF function's SFT path is unreliable (UNK collapse). For streaming serving at deploy time, Riva chunks a non-streaming base.
Tiny manifests overfit fast. Below ~100 rows total or ~5 rows per priority category, cycle-N+1 numbers are noisy. Grow before trusting a small KER drop.
English-only by default. The base-model table is en-US-specific. Other locales need a different base + a re-validated SFT recipe.
No turn-key driver. The user writes their own training-driver layout — output paths, run naming, leaderboard re-rendering. The methodology and recipes transfer; exact cycle-1 numbers depend on the user's manifest.

Next steps

Deploy the
.nemo
as a NIM:
```
/riva-asr-custom
```
(pass the source architecture explicitly).
Grow the manifest for cycle N+2:
```
/digital-health-clinical-asr-build
```
.
Re-score the cycle:
```
/digital-health-clinical-asr-eval
```
(against the new endpoint or the new
```
.nemo
```
directly).
Lateral for word boosting / LM fusion / non-clinical SFT recipes:
```
/finetune-asr
```
.

References

```
references/stage4-finetune.md
```
— base-model selection table, hyperparameter rationale, decoder → NIM container mapping, decision tree comparing cycle-N+1 to cycle-N
```
references/container-paths.md
```
— host →
```
/workspace/
```
path rewriting for cross-host manifest portability (laptop ↔ Brev ↔ NeMo container)

digital-health-clinical-asr-finetune

NPX Install

Tags

SKILL.md Content