digital-health-clinical-asr-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Clinical ASR Flywheel — Stage 3 (Eval)
临床ASR飞轮 — 第三阶段(评估)
⚠ Agent: read the Critical Workflow Rules section below before answering. This SKILL.md is self-contained —,evals/, andreferences/are pointers, not load-bearing. Answer methodology questions from this file directly; only invoke tools when the user explicitly asks to execute against a real manifest.assets/
You are the score-and-route stage. The user arrives with a NeMo-format (either from or carried in from elsewhere). You transcribe it via the chosen ASR NIM, score four metrics, produce a five-section leaderboard, and read the decision tree to decide whether the user should advance to , loop back to , or stop and harden the eval.
manifest.jsonl/digital-health-clinical-asr-build/digital-health-clinical-asr-finetune/digital-health-clinical-asr-buildThis skill does not generate audio. If the manifest is missing or empty, send the user back to .
/digital-health-clinical-asr-build⚠ Agent:在作答前请阅读下方的关键工作流规则部分。 本SKILL.md文件内容完整独立 —、evals/和references/仅为指向性链接,非核心依赖。直接从本文件回答方法学相关问题;仅当用户明确要求针对真实清单执行操作时,才调用工具。assets/
你处于打分与路由阶段。用户将携带NeMo格式的文件(来自或其他渠道)前来。你需通过选定的ASR NIM完成转录,计算四项指标,生成五部分的排行榜,并根据决策树判断用户应进入、返回,还是停止并强化评估流程。
manifest.jsonl/digital-health-clinical-asr-build/digital-health-clinical-asr-finetune/digital-health-clinical-asr-build本技能不生成音频。 若清单缺失或为空,将用户引导回。
/digital-health-clinical-asr-buildAudio leaves your environment — disclose this to the user before any clip is sent
音频将离开你的环境 — 在发送任何音频片段前告知用户
This stage transmits each manifest row's WAV file plus its reference text to an external NVIDIA service. Surface this before invoking the first ASR call:
| Service | What gets sent | When |
|---|---|---|
NVIDIA NVCF Parakeet/Nemotron ASR ( | Every audio clip referenced by the manifest (raw PCM bytes), plus the reference transcript and the clinical-extension metadata for scoring | Step 3b, one call per manifest row |
The clips should be synthetic audio generated by Stage 2 (Magpie TTS over a user-curated term list) — not real patient audio. Do not pass real ASR recordings, real patient encounters, or any PHI through this skill. Scoring then runs locally (pure-Python WER/CER/KER/SER, or if installed). The scoring step itself does not transmit anything; only the ASR step does.
jiwer本阶段会将清单中每一行对应的WAV文件及其参考文本传输至外部NVIDIA服务。在发起首次ASR调用前,需向用户说明以下信息:
| 服务 | 传输内容 | 时机 |
|---|---|---|
NVIDIA NVCF Parakeet/Nemotron ASR ( | 清单引用的每个音频片段(原始PCM字节)、参考转录文本,以及用于打分的临床扩展元数据 | 步骤3b,每一行清单发起一次调用 |
这些音频片段应为第二阶段生成的合成音频(基于用户整理的术语列表,通过Magpie TTS生成)—— 不可为真实患者音频。请勿通过本技能传输真实ASR录音、真实患者诊疗记录或任何受保护健康信息(PHI)。 打分步骤在本地运行(纯Python实现的WER/CER/KER/SER,或使用已安装的库)。打分过程本身不会传输任何内容;仅ASR转录步骤会传输数据。
jiwerCritical workflow rules (apply on every activation)
关键工作流规则(每次激活时均需遵守)
For methodology questions (leaderboard structure, KER definition, decision tree), answer from this file. Don't invoke tools, call other skills, or run scripts unless the user explicitly asks to execute against a real manifest. Surface these facts in any response:
- Off-ramp first. If the user is asking about something outside scoring, route and stop without running any workflow:
- ASR model-catalog selection / comparison / alternative NIMs →
/riva-asr - ASR auth (API keys, bearer tokens, function IDs) →
/riva-asr - ASR gRPC protocol, streaming, batching, chunking, retries →
/riva-asr - NIM deploy / /
riva-build→riva-deploy/riva-asr-custom - NGC / Docker / NVIDIA Container Toolkit →
/riva-nim-setup - No manifest yet →
/digital-health-clinical-asr-build - Wants to fine-tune now with a known KER →
/digital-health-clinical-asr-finetune
- ASR model-catalog selection / comparison / alternative NIMs →
- Default ASR NIM is (NVCF function-id
nvidia/parakeet-tdt-0.6b-v2, offline gRPC). Env-var overrides:d3fe9151-442b-4204-a70d-5fcc597fd610(leaderboard display name),ASR_MODEL_NAME(swap to a different hosted NIM — e.g. Whisper Large v3ASR_NVCF_FUNCTION_IDwhile the Parakeet backend is faulting, or a fine-tuned NIM),b702f636-…(self-hosted gRPC; takes precedence). Echo the chosen NIM and the resolved function-id back before spending API credits.ASR_ENDPOINT - ASR transcription is inlined in Step 3b (NVCF gRPC + , same auth pattern as Stage 1). For deeper protocol/auth questions, alternative NIM catalogs, or self-hosted Riva NIM configuration, defer to
riva.client.ASRService.offline_recognize./riva-asr - KER is the headline. Per-row check: the flagged words must appear in order, contiguous, adjacent in the normalized hypothesis.
termis a miss. Aggregate WER hides clinically dangerous failures; both are reported, KER is the gate.cefazolin → cefa zolin - The by-split is the most informative single number in the leaderboard. The
ipa_sourcevsmerriam-websterdelta proves the SSML override pipeline is doing real work. Read it aloud to the user.magpie_g2p - Special-case routing. rows good,
merriam-websterrows bad → pronunciation-coverage gap, not a model gap. Route back tomagpie_g2pStep 2d. Do NOT recommend/digital-health-clinical-asr-buildas a first response./digital-health-clinical-asr-finetune - Five-section leaderboard order. Headline (WER/CER/KER/SER) → KER by → KER by
entity_category→ KER byipa_source→ Per-term KER worst-first. The by-noise_levelsection is mandatory; it is the proof the SSML pipeline works.ipa_source
对于方法学问题(排行榜结构、KER定义、决策树),直接从本文件中提取答案。除非用户明确要求针对真实清单执行操作,否则不要调用工具、其他技能或运行脚本。在任何回复中需明确以下事实:
- 优先分流。 若用户询问的内容不属于打分范畴,直接路由至对应技能并终止当前工作流:
- ASR模型目录选择/对比/替代NIM →
/riva-asr - ASR认证(API密钥、Bearer令牌、函数ID) →
/riva-asr - ASR gRPC协议、流式传输、批量处理、分块、重试 →
/riva-asr - NIM部署 / /
riva-build→riva-deploy/riva-asr-custom - NGC / Docker / NVIDIA容器工具包 →
/riva-nim-setup - 尚未生成清单 →
/digital-health-clinical-asr-build - 已知KER值,希望立即进行微调 →
/digital-health-clinical-asr-finetune
- ASR模型目录选择/对比/替代NIM →
- 默认ASR NIM为(NVCF函数ID:
nvidia/parakeet-tdt-0.6b-v2,离线gRPC)。可通过环境变量覆盖:d3fe9151-442b-4204-a70d-5fcc597fd610(排行榜显示名称)、ASR_MODEL_NAME(切换至其他托管NIM — 例如Parakeet后端故障时使用Whisper Large v3ASR_NVCF_FUNCTION_ID,或使用微调后的NIM)、b702f636-…(自托管gRPC;优先级最高)。在消耗API额度前,需向用户反馈选定的NIM及解析后的函数ID。ASR_ENDPOINT - ASR转录内嵌于步骤3b(NVCF gRPC + ,与第一阶段使用相同的认证模式)。若涉及更深层的协议/认证问题、替代NIM目录或自托管Riva NIM配置,请转至
riva.client.ASRService.offline_recognize。/riva-asr - KER为核心指标。 逐行检查:标记的词汇必须在归一化后的假设文本中按顺序连续相邻出现。例如
term判定为识别失败。聚合WER会掩盖具有临床风险的错误;两项指标均会报告,但KER是决策的核心依据。cefazolin → cefa zolin - 基于的拆分数据是排行榜中最具参考价值的单一数值。
ipa_source与merriam-webster之间的差值可证明SSML覆盖流程确实发挥作用。需向用户明确说明这部分数据。magpie_g2p - 特殊路由场景。 行表现良好,但
merriam-webster行表现不佳 → 存在发音覆盖缺口,而非模型缺口。引导用户返回magpie_g2p的步骤2d。请勿首先推荐/digital-health-clinical-asr-build。/digital-health-clinical-asr-finetune - 五部分排行榜顺序。 总览(WER/CER/KER/SER)→ 按拆分的KER → 按
entity_category拆分的KER → 按ipa_source拆分的KER → 按KER从高到低排序的逐术语统计。基于noise_level的拆分部分为必填项,用于证明SSML流程的有效性。ipa_source
Purpose
目标
Score a clinical-ASR manifest, produce a five-section KER leaderboard, and route the user via the post-eval decision tree. Methodology details (metric definitions, normalization, leaderboard order, special-case routing) live in Critical Workflow Rules above and Instructions below.
为临床ASR清单打分,生成五部分的KER排行榜,并根据评估后的决策树为用户提供路由指引。方法学细节(指标定义、归一化规则、排行榜顺序、特殊路由场景)见上方的关键工作流规则及下方的操作说明。
When to use this skill
何时使用本技能
Activate on user phrases like:
- "Score my ASR manifest"
- "What's the KER on Parakeet TDT v2?"
- "Run the eval on cycle-N"
- "Compare two ASR models on the clinical benchmark"
- "Generate the leaderboard"
- "I have a manifest.jsonl, how do I score it?"
- "Why is KER 0.4 when WER is 0.07?"
- "Should we fine-tune?" (this is the eval-side question — the post-eval decision tree lives in this skill)
Literal-keyword non-activation check — if the user's message contains any of , , , , , , , , , , , , , , , or asks "which ASR model is best" / "compare models" / "vendor differences" — do NOT activate the scoring workflow. Apply Critical Workflow Rule #1 above to route to the right sibling skill and stop. This applies even if the user mentions "KER" or "eval" alongside the keyword.
authenticateAPI keybearerfunction IDgRPCstreamingchunkingbatchingtranscription retryriva-buildriva-deployNIM deployNGCDockerContainer Toolkit当用户说出以下类似表述时激活本技能:
- "为我的ASR清单打分"
- "Parakeet TDT v2的KER值是多少?"
- "对第N轮循环执行评估"
- "在临床基准上对比两款ASR模型"
- "生成排行榜"
- "我有manifest.jsonl文件,如何为它打分?"
- "为什么KER是0.4而WER是0.07?"
- "我们应该进行微调吗?"(这属于评估侧问题 — 评估后的决策树包含在本技能中)
字面关键词排除激活规则 — 若用户消息包含以下任意关键词:、、、、、、、、、、、、、、,或询问"哪款ASR模型最好"/"对比模型"/"厂商差异" — 请勿激活打分工作流。应用关键工作流规则第1条,将用户路由至对应兄弟技能并终止当前流程。即使用户同时提及"KER"或"eval",该规则依然适用。
authenticateAPI keybearerfunction IDgRPCstreamingchunkingbatchingtranscription retryriva-buildriva-deployNIM deployNGCDockerContainer ToolkitPrerequisites
前置条件
- A NeMo-format manifest with the clinical extension fields (,
term,entity_category,ipa_source,voice_id,noise_level). The schema is documented in the build skill'scontext_type.references/manifest-schema.md - exported (Stage 1 prerequisite still applies).
NVIDIA_API_KEY - +
nvidia-riva-clientinstalled (Stage 1 prerequisite). For self-hosted Riva NIM details, seesoundfileOption B./riva-asr - Audio files actually present on disk — run the audio-existence pre-flight from the manifest-schema reference before spending API credits.
- NeMo格式的清单,包含临床扩展字段(、
term、entity_category、ipa_source、voice_id、noise_level)。清单架构文档见构建技能的context_type。references/manifest-schema.md - 已导出(第一阶段的前置条件仍然适用)。
NVIDIA_API_KEY - 已安装+
nvidia-riva-client(第一阶段的前置条件)。关于自托管Riva NIM的详细信息,请见soundfile的选项B。/riva-asr - 音频文件实际存在于磁盘 — 在消耗API额度前,需根据清单架构参考文档执行音频存在性预检查。
Instructions
操作说明
3a. Pick the ASR NIM
3a. 选择ASR NIM
Default: via NVCF gRPC (offline), function-id . NVIDIA's current English ASR recommendation — fastest/cheapest in the catalog, and supported in NeMo's stock SFT recipe so the Stage 3 baseline and a Stage 4 fine-tune ride the same model family.
nvidia/parakeet-tdt-0.6b-v2d3fe9151-442b-4204-a70d-5fcc597fd610Three runtime env-var override knobs ( for leaderboard display, to swap to a different hosted NIM, for self-hosted gRPC) plus the full alternate-NIM catalog (Parakeet TDT 1.1B, Parakeet CTC 1.1B, Whisper Large v3, Nemotron streaming) with function IDs and call-shape notes: .
ASR_MODEL_NAMEASR_NVCF_FUNCTION_IDASR_ENDPOINTreferences/offline-asr-recipe.mdEcho the chosen NIM, the resolved function-id, and any env-var overrides to the user before spending API credits. A 200-row manifest on hosted Parakeet TDT v2 is cheap; an accidental run against the wrong model on a 1,000-row manifest is not.
默认选项:通过NVCF gRPC(离线)使用,函数ID为。这是NVIDIA当前推荐的英文ASR模型 — 目录中速度最快、成本最低,且支持NeMo的标准SFT训练流程,因此第三阶段的基线模型与第四阶段的微调模型属于同一模型家族。
nvidia/parakeet-tdt-0.6b-v2d3fe9151-442b-4204-a70d-5fcc597fd610提供三个运行时环境变量用于覆盖默认设置(用于排行榜显示名称,用于切换至其他托管NIM,用于自托管gRPC),以及完整的替代NIM目录(Parakeet TDT 1.1B、Parakeet CTC 1.1B、Whisper Large v3、Nemotron流式模型),包含函数ID和调用形态说明:详见。
ASR_MODEL_NAMEASR_NVCF_FUNCTION_IDASR_ENDPOINTreferences/offline-asr-recipe.md在消耗API额度前,需向用户反馈选定的NIM、解析后的函数ID及任何环境变量覆盖设置。使用托管Parakeet TDT v2处理200行清单的成本较低;但若意外使用错误模型处理1000行清单,成本则会很高。
3b. Transcribe
3b. 转录
For each row in , transcribe and write (one JSON object per row, JSONL or a JSON array — caller's choice):
manifest.jsonlaudio_filepathper_sample.jsonjson
{
"audio_filepath": "...",
"ref": "<row.text>",
"hyp": "<asr output>",
"term": "<row.term>",
"entity_category": "<row.entity_category>",
"ipa_source": "<row.ipa_source>",
"voice_id": "<row.voice_id>",
"noise_level": "<row.noise_level>",
"context_type": "<row.context_type>"
}Recipe (full Python in ): opens an offline gRPC stream to NVCF (or to if set for self-hosted Riva), calls per row — sentences in a clinical manifest are ≤ 30 s so no streaming/batching needed — and writes the JSONL above. Same shape as the Stage 1 setup smoke test. The agent harness passes explicitly; the recipe reads the three env-var overrides (, , ) at the top so auditors see the knobs in one place.
references/offline-asr-recipe.mdtranscribe_manifest(api_key, manifest_path, out_path, language_code="en-US")ASR_ENDPOINTriva.client.ASRService.offline_recognizeauth_forapi_keyASR_NVCF_FUNCTION_IDASR_MODEL_NAMEASR_ENDPOINTWhisper fallback (when Parakeet's NVCF backend faults with from Triton) and self-hosted Riva NIM () env-var patterns: see (§Whisper fallback, §Self-hosted Riva NIM).
CUDA illegal-memory-accessASR_ENDPOINT=localhost:50051references/offline-asr-recipe.mdResilience knobs deferred to the user. If NVCF returns mid-batch, the loop raises on that row; re-run from the failing row. Streaming/batching/retry-with-backoff are out of scope — see .
RESOURCE_EXHAUSTED/riva-asr针对中的每一行,转录对应的音频,并写入文件(每行一个JSON对象,格式可为JSONL或JSON数组 — 由调用方选择):
manifest.jsonlaudio_filepathper_sample.jsonjson
{
"audio_filepath": "...",
"ref": "<row.text>",
"hyp": "<asr output>",
"term": "<row.term>",
"entity_category": "<row.entity_category>",
"ipa_source": "<row.ipa_source>",
"voice_id": "<row.voice_id>",
"noise_level": "<row.noise_level>",
"context_type": "<row.context_type>"
}实现流程(完整Python代码见):会打开与NVCF的离线gRPC流(若设置了则连接至自托管Riva),逐行调用 — 临床清单中的句子时长≤30秒,因此无需流式传输/批量处理 — 并写入上述JSONL文件。认证形态与第一阶段的设置冒烟测试相同。Agent工具会显式传入;流程会在顶部读取三个环境变量覆盖设置(、、),以便审计人员在一处查看所有配置项。
references/offline-asr-recipe.mdtranscribe_manifest(api_key, manifest_path, out_path, language_code="en-US")ASR_ENDPOINTriva.client.ASRService.offline_recognizeapi_keyASR_NVCF_FUNCTION_IDASR_MODEL_NAMEASR_ENDPOINTWhisper fallback方案(当Parakeet的NVCF后端因Triton的错误故障时)和自托管Riva NIM()的环境变量配置模式:详见(§Whisper fallback、§Self-hosted Riva NIM)。
CUDA illegal-memory-accessASR_ENDPOINT=localhost:50051references/offline-asr-recipe.md弹性处理由用户自行负责。若NVCF在批量处理中途返回错误,循环会在该行抛出异常;需从失败行重新运行。流式传输/批量处理/退避重试不在本技能范围内 — 请见。
RESOURCE_EXHAUSTED/riva-asr3c. Score four metrics
3c. 计算四项指标
For every row, compute:
| Metric | What it measures | Why we keep it |
|---|---|---|
| WER | Word error rate (Levenshtein on tokens, after normalization) | Industry standard; blunt instrument for clinical |
| CER | Character error rate | Catches near-misses on long compound names |
| KER ★ | Keyword error rate — did the flagged | Headline clinical signal |
| SER | Sentence error rate (1 if any wrong, 0 if perfect) | Sanity bound; what the doctor experiences |
Normalization (apply to both and before all four metrics):
refhyp- Lowercase.
- NFKD-normalize (smart quotes → ASCII, etc.).
- Strip punctuation except hyphen.
- Collapse whitespace runs to a single space.
Inline scoring recipes — / / / / / (pure-Python, no dependency): see . Aggregate across rows by taking for each metric.
normalizeedit_distancewercerkerserjiwerreferences/scoring-recipes.mdmean(per-row score)Strict KER — term words must appear in order, adjacent in the normalized hypothesis. This is conservative: counts as a miss. That's the right call clinically — a downstream pharmacy lookup will fail on the misspelled token.
cefazolin → cefa zolinKER does not punish surrounding errors. A row where the term is correct and the rest of the sentence is garbage still scores KER=0; the WER on that row will surface the broader problem separately.
针对每一行,计算以下指标:
| 指标 | 衡量内容 | 保留原因 |
|---|---|---|
| WER | 词错误率(归一化后基于令牌的Levenshtein距离) | 行业标准;但对于临床场景而言是较为粗略的指标 |
| CER | 字符错误率 | 捕捉长复合名称的近似错误 |
| KER ★ | 关键词错误率 — 标记的 | 核心临床信号指标 |
| SER | 句子错误率(存在错误则为1,完全正确则为0) | 合理性边界;反映医生实际体验 |
归一化规则(计算四项指标前,需对和均执行以下步骤):
refhyp- 转换为小写。
- NFKD归一化(智能引号转换为ASCII等)。
- 去除标点符号除连字符外。
- 将连续空白字符压缩为单个空格。
内嵌打分流程 — / / / / / (纯Python实现,无依赖):详见。通过计算每行得分的平均值,得到各指标的聚合结果。
normalizeedit_distancewercerkerserjiwerreferences/scoring-recipes.md严格KER规则 — 术语词汇必须在归一化后的假设文本中按顺序相邻出现。这一规则较为保守:例如会被判定为识别失败。这在临床场景中是正确的决策 — 下游药房检索会因拼写错误的令牌而失败。
cefazolin → cefa zolinKER不会惩罚术语以外的错误。若某一行的术语识别正确,但句子其余部分完全错误,KER仍为0;该行的WER会单独反映整体问题。
3d. Breakdowns + leaderboard
3d. 细分统计 + 排行榜
Write a five-section markdown leaderboard, in this order:
- Headline — overall WER, CER, KER, SER for the chosen model.
- KER by — drug vs procedure vs anatomy vs ... This is what the user actually cares about for deployment.
entity_category - KER by — the most informative single number in the leaderboard. The delta between
ipa_sourceandmerriam-websterrows is the proof the SSML override pipeline is doing real work. Read this section aloud to the user.magpie_g2p - KER by — clinical environments are loud.
noise_levelrows are closer to reality thansnr_5db.clean - Per-term KER (worst first) — these are your Stage 4 fine-tune targets.
A representative split with the merriam-webster vs magpie_g2p delta interpretation: §Representative ipa_source split. The delta tells the deployment story — if the user sees a wide gap and asks "should we fine-tune?", the answer is not yet; route them back to 's IPA QA pipeline (Stage 2d). See the decision tree below.
ipa_sourcereferences/scoring-recipes.md/digital-health-clinical-asr-build生成五部分的Markdown格式排行榜,必须遵循以下顺序:
- 总览 — 选定模型的整体WER、CER、KER、SER。
- 按拆分的KER — 药物/手术/解剖结构等类别。这是用户部署时真正关心的数据。
entity_category - 按拆分的KER — 排行榜中最具参考价值的单一数值。
ipa_source与merriam-webster行之间的差值可证明SSML覆盖流程确实发挥作用。需向用户明确说明这部分数据。magpie_g2p - 按拆分的KER — 临床环境通常较为嘈杂。
noise_level行更接近真实场景,而snr_5db行则为理想状态。clean - 逐术语KER统计(从高到低排序) — 这些是第四阶段微调的目标术语。
关于拆分的代表性示例及与差值的解读:详见 §Representative ipa_source split。该差值可反映部署中的问题 — 若用户看到较大差距并询问"是否应该微调?",答案是暂时不;引导用户返回的IPA QA流程(步骤2d)。详见下方的决策树。
ipa_sourcemerriam-webstermagpie_g2preferences/scoring-recipes.md/digital-health-clinical-asr-buildDecision tree (after eval)
评估后决策树
Read the priority-category KER (drug KER for most clinical workflows, procedure KER for surgical workflows) and route:
| KER on priority category | Recommend |
|---|---|
| > 0.3 | |
| 0.1 – 0.3 | Either expand the term list (back to |
| < 0.1 | Strong baseline. Don't tune yet — you'd be optimizing against a saturated metric. Push the eval harder: add voices, noise levels, contexts, adversarial terms. Loop back to |
Special case — rows score well but rows are bad. That's a pronunciation-hint coverage gap, not a model gap. Route back to Step 2d (IPA QA review), not to . Fine-tuning over a TTS-pronunciation gap teaches the model to mis-recognize the model's own mistakes — the wrong fix.
merriam-webstermagpie_g2p/digital-health-clinical-asr-build/digital-health-clinical-asr-finetune查看优先级类别的KER(多数临床工作流关注药物KER,外科工作流关注手术KER)并进行路由:
| 优先级类别的KER | 推荐操作 |
|---|---|
| > 0.3 | |
| 0.1 – 0.3 | 可选择扩充术语列表(返回 |
| < 0.1 | 基线表现优异。暂时无需微调 — 此时优化已饱和的指标意义不大。需强化评估:添加更多语音类型、噪音等级、上下文场景、对抗性术语。返回 |
特殊场景 — 行得分良好,但行得分不佳。这属于发音提示覆盖缺口,而非模型缺口。引导用户返回的步骤2d(IPA QA审核),而非。在TTS发音缺口上进行微调,会导致模型学习到自身的错误识别结果 — 这是错误的解决方案。
merriam-webstermagpie_g2p/digital-health-clinical-asr-build/digital-health-clinical-asr-finetuneExamples
示例
Scenario A — first eval on a fresh cycle-1 manifest. User: "I have with 200 clinical audio rows already, with and fields. How do I score it?" → Skip Stage 2 entirely. Run the audio-existence pre-flight. Pick (default) and echo the choice + resolved function-id. Run the inlined Step 3b recipe (). Score the four metrics. Produce the five-section leaderboard. Read the by- split to the user. Apply the decision tree against drug KER.
manifest.jsonltermentity_categoryparakeet-tdt-0.6b-v2transcribe_manifest(...)ipa_sourceScenario B — interpreting a mixed result. User: "Eval shows KER 0.05 on rows tagged but 0.40 on rows tagged . Should I fine-tune?" → No — this is the special case. The model is fine; the pronunciation hints aren't covering the long-tail terms. Route the user back to Step 2d to audition the rows and append verified IPA to . Re-run Stage 3 after the rebuild before reconsidering Stage 4.
merriam-webstermagpie_g2p/digital-health-clinical-asr-buildmagpie_g2ppronunciation_overrides.csv场景A — 针对全新的第1轮循环清单进行首次评估。用户:"我已经有包含200行临床音频的文件,且包含和字段。如何为它打分?" → 直接跳过第二阶段。执行音频存在性预检查。选择(默认选项)并向用户反馈选择及解析后的函数ID。运行内嵌的步骤3b流程()。计算四项指标。生成五部分的排行榜。向用户说明基于的拆分数据。根据药物KER应用决策树。
manifest.jsonltermentity_categoryparakeet-tdt-0.6b-v2transcribe_manifest(...)ipa_source场景B — 解读混合结果。用户:"评估显示标记行的KER为0.05,但标记行的KER为0.40。我应该进行微调吗?" → 不 — 这属于特殊场景。模型本身无问题;发音提示未覆盖长尾术语。引导用户返回的步骤2d,审核行并将验证后的IPA添加至。重新构建后再次运行第三阶段,之后再考虑第四阶段。
merriam-webstermagpie_g2p/digital-health-clinical-asr-buildmagpie_g2ppronunciation_overrides.csvArtifacts produced
生成的产物
- — per-row transcription results with all clinical-extension fields preserved (the ASR
per_sample.jsonjoined to the manifest'shypand metadata)ref - — per-row WER/CER/KER/SER scores
results.csv - — five-section markdown report
leaderboard_cycle<N>.md
(File names are user-chosen; the names above are conventions the rest of this skill assumes.)
- — 逐行转录结果,保留所有临床扩展字段(ASR生成的
per_sample.json与清单中的hyp及元数据关联)ref - — 逐行WER/CER/KER/SER得分
results.csv - — 五部分的Markdown格式报告
leaderboard_cycle<N>.md
(文件名由用户选择;上述名称为本技能默认采用的约定。)
Troubleshooting
故障排除
- "No manifest found" → user skipped Stage 2. Route to or confirm
/digital-health-clinical-asr-build.$MANIFEST_PATH - All rows KER=1 → normalization mismatch between and
ref. Apply the four normalization steps to both sides.hyp - All rows KER=0 but WER high → likely misaligned manifest (audio row mismatch). Spot-check a few pairs by hand.
(ref, hyp) - low,
merriam-websterhigh → pronunciation-coverage gap. Route tomagpie_g2pStep 2d. Don't fine-tune — model isn't the problem./digital-health-clinical-asr-build - Both and
merriam-websterhigh → real model gap. Stage 4 is the right route (manifest ≥ 100 rows).magpie_g2p - rows fine,
cleanballoons → robustness gap; expand noise diversity viasnr_5db./digital-health-clinical-asr-build - Riva-NIM and offline NeMo results diverge → Riva preprocessing / flags. Route to
riva-build./riva-asr-custom - on large manifests → retry after 30 s; slice + re-run dropped rows. Built-in backoff:
RESOURCE_EXHAUSTED./riva-asr - / CUDA illegal-memory-access on Parakeet function ID: see
Auth.__init__() got 'ssl_cert'(ssl_root_cert rename + §Whisper fallback).references/offline-asr-recipe.md
Anything else: identify the upstream owner. ASR protocol / NIM deploy → . Scoring → here.
/riva-asr- "未找到清单" → 用户跳过了第二阶段。引导至或确认
/digital-health-clinical-asr-build。$MANIFEST_PATH - 所有行的KER=1 → 与
ref之间的归一化不匹配。需对双方均执行四项归一化步骤。hyp - 所有行的KER=0但WER较高 → 可能是清单对齐错误(音频行不匹配)。手动抽查几组对。
(ref, hyp) - 行得分低,
merriam-webster行得分高 → 存在发音覆盖缺口。引导至magpie_g2p的步骤2d。请勿进行微调 — 模型并非问题所在。/digital-health-clinical-asr-build - 和
merriam-webster行得分均高 → 存在真实的模型缺口。第四阶段是正确的选择(清单≥100行)。magpie_g2p - 行表现良好,但
clean行得分骤升 → 鲁棒性不足;通过snr_5db扩充噪音多样性。/digital-health-clinical-asr-build - Riva-NIM与离线NeMo结果不一致 → Riva预处理/参数问题。引导至
riva-build。/riva-asr-custom - 大型清单出现错误 → 30秒后重试;拆分并重新运行失败的行。内置退避机制:详见
RESOURCE_EXHAUSTED。/riva-asr - / Parakeet函数ID出现CUDA illegal-memory-access错误:详见
Auth.__init__() got 'ssl_cert'(ssl_root_cert重命名 + §Whisper fallback)。references/offline-asr-recipe.md
其他问题:确定上游负责方。ASR协议/NIM部署 → 。打分问题 → 本技能。
/riva-asrLimitations
局限性
- English-only by default. Tokenization + normalization assume Latin script and en-US lexicon.
- Strict-contiguous KER is conservative. A near-miss like counts as a miss. That's intentional — pharmacy lookups fail on near-misses. Users wanting "soft" matching can switch to phoneme-level edit distance, which is a methodology extension, not a config tweak.
cefa zolin - One model per eval run. Comparing two models means running the eval twice and diffing the two files (or extending the recipe to write multi-model rows yourself).
leaderboard_cycle<N>.md - Hosted-only paths assumed. Self-hosted NIMs work but require first.
/riva-nim-setup
- 默认仅支持英文。令牌化与归一化规则假设使用拉丁字母和美式英语词汇。
- 严格连续KER规则较为保守。近似错误如会被判定为识别失败。这是有意设计的 — 药房检索会因近似错误而失败。若用户需要"软匹配",可切换至音素级编辑距离,这属于方法学扩展,而非配置调整。
cefa zolin - 每次评估仅支持一款模型。对比两款模型需运行两次评估并对比两个文件(或自行扩展流程以写入多模型行)。
leaderboard_cycle<N>.md - 默认假设使用托管路径。自托管NIM可正常工作,但需先完成。
/riva-nim-setup
Next steps
下一步
- Forward (KER > 0.3, manifest ≥ 100 rows): .
/digital-health-clinical-asr-finetune - Back to build (KER 0.1–0.3 on first eval, or gap):
magpie_g2p./digital-health-clinical-asr-build - Stop (KER < 0.1): the eval is saturated. Harden it before declaring victory.
- Lateral for ASR protocol / auth / streaming / self-hosted NIM details: .
/riva-asr
- 向前推进(KER > 0.3,清单≥100行): 。
/digital-health-clinical-asr-finetune - 返回构建阶段(首次评估时KER为0.1–0.3,或存在缺口):
magpie_g2p。/digital-health-clinical-asr-build - 停止(KER < 0.1): 评估已饱和。在宣布成功前需强化评估流程。
- 横向跳转获取ASR协议/认证/流式传输/自托管NIM详细信息:。
/riva-asr
References
参考文档
- — full Step 3b Python recipe (
references/offline-asr-recipe.md,transcribe_manifest,resolve_asr_config), function-ID catalog with call-shape notes, Whisper fallback, self-hosted Riva NIM setupbuild_asr_auth - — pure-Python WER/CER/KER/SER scoring functions with the canonical 4-step normalization
references/scoring-recipes.md
- — 完整的步骤3b Python流程(
references/offline-asr-recipe.md、transcribe_manifest、resolve_asr_config)、包含调用形态说明的函数ID目录、Whisper fallback方案、自托管Riva NIM设置build_asr_auth - — 纯Python实现的WER/CER/KER/SER打分函数,包含标准的四步归一化规则
references/scoring-recipes.md