digital-health-clinical-asr-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0 -->

Clinical ASR Flywheel — Stage 3 (Eval)

临床ASR飞轮 — 第三阶段(评估)

⚠ Agent: read the Critical Workflow Rules section below before answering. This SKILL.md is self-contained —
evals/
,
references/
, and
assets/
are pointers, not load-bearing. Answer methodology questions from this file directly; only invoke tools when the user explicitly asks to execute against a real manifest.
You are the score-and-route stage. The user arrives with a NeMo-format
manifest.jsonl
(either from
/digital-health-clinical-asr-build
or carried in from elsewhere). You transcribe it via the chosen ASR NIM, score four metrics, produce a five-section leaderboard, and read the decision tree to decide whether the user should advance to
/digital-health-clinical-asr-finetune
, loop back to
/digital-health-clinical-asr-build
, or stop and harden the eval.
This skill does not generate audio. If the manifest is missing or empty, send the user back to
/digital-health-clinical-asr-build
.
⚠ Agent:在作答前请阅读下方的关键工作流规则部分。 本SKILL.md文件内容完整独立 —
evals/
references/
assets/
仅为指向性链接,非核心依赖。直接从本文件回答方法学相关问题;仅当用户明确要求针对真实清单执行操作时,才调用工具。
你处于打分与路由阶段。用户将携带NeMo格式的
manifest.jsonl
文件(来自
/digital-health-clinical-asr-build
或其他渠道)前来。你需通过选定的ASR NIM完成转录,计算四项指标,生成五部分的排行榜,并根据决策树判断用户应进入
/digital-health-clinical-asr-finetune
、返回
/digital-health-clinical-asr-build
,还是停止并强化评估流程。
本技能不生成音频。 若清单缺失或为空,将用户引导回
/digital-health-clinical-asr-build

Audio leaves your environment — disclose this to the user before any clip is sent

音频将离开你的环境 — 在发送任何音频片段前告知用户

This stage transmits each manifest row's WAV file plus its reference text to an external NVIDIA service. Surface this before invoking the first ASR call:
ServiceWhat gets sentWhen
NVIDIA NVCF Parakeet/Nemotron ASR (
grpc.nvcf.nvidia.com
)
Every audio clip referenced by the manifest (raw PCM bytes), plus the reference transcript and the clinical-extension metadata for scoringStep 3b, one call per manifest row
The clips should be synthetic audio generated by Stage 2 (Magpie TTS over a user-curated term list) — not real patient audio. Do not pass real ASR recordings, real patient encounters, or any PHI through this skill. Scoring then runs locally (pure-Python WER/CER/KER/SER, or
jiwer
if installed). The scoring step itself does not transmit anything; only the ASR step does.
本阶段会将清单中每一行对应的WAV文件及其参考文本传输至外部NVIDIA服务。在发起首次ASR调用前,需向用户说明以下信息:
服务传输内容时机
NVIDIA NVCF Parakeet/Nemotron ASR (
grpc.nvcf.nvidia.com
)
清单引用的每个音频片段(原始PCM字节)、参考转录文本,以及用于打分的临床扩展元数据步骤3b,每一行清单发起一次调用
这些音频片段应为第二阶段生成的合成音频(基于用户整理的术语列表,通过Magpie TTS生成)—— 不可为真实患者音频。请勿通过本技能传输真实ASR录音、真实患者诊疗记录或任何受保护健康信息(PHI)。 打分步骤在本地运行(纯Python实现的WER/CER/KER/SER,或使用已安装的
jiwer
库)。打分过程本身不会传输任何内容;仅ASR转录步骤会传输数据。

Critical workflow rules (apply on every activation)

关键工作流规则(每次激活时均需遵守)

For methodology questions (leaderboard structure, KER definition, decision tree), answer from this file. Don't invoke tools, call other skills, or run scripts unless the user explicitly asks to execute against a real manifest. Surface these facts in any response:
  1. Off-ramp first. If the user is asking about something outside scoring, route and stop without running any workflow:
    • ASR model-catalog selection / comparison / alternative NIMs →
      /riva-asr
    • ASR auth (API keys, bearer tokens, function IDs) →
      /riva-asr
    • ASR gRPC protocol, streaming, batching, chunking, retries →
      /riva-asr
    • NIM deploy /
      riva-build
      /
      riva-deploy
      /riva-asr-custom
    • NGC / Docker / NVIDIA Container Toolkit →
      /riva-nim-setup
    • No manifest yet →
      /digital-health-clinical-asr-build
    • Wants to fine-tune now with a known KER →
      /digital-health-clinical-asr-finetune
  2. Default ASR NIM is
    nvidia/parakeet-tdt-0.6b-v2
    (NVCF function-id
    d3fe9151-442b-4204-a70d-5fcc597fd610
    , offline gRPC). Env-var overrides:
    ASR_MODEL_NAME
    (leaderboard display name),
    ASR_NVCF_FUNCTION_ID
    (swap to a different hosted NIM — e.g. Whisper Large v3
    b702f636-…
    while the Parakeet backend is faulting, or a fine-tuned NIM),
    ASR_ENDPOINT
    (self-hosted gRPC; takes precedence). Echo the chosen NIM and the resolved function-id back before spending API credits.
  3. ASR transcription is inlined in Step 3b (NVCF gRPC +
    riva.client.ASRService.offline_recognize
    , same auth pattern as Stage 1). For deeper protocol/auth questions, alternative NIM catalogs, or self-hosted Riva NIM configuration, defer to
    /riva-asr
    .
  4. KER is the headline. Per-row check: the flagged
    term
    words must appear in order, contiguous, adjacent in the normalized hypothesis.
    cefazolin → cefa zolin
    is a miss. Aggregate WER hides clinically dangerous failures; both are reported, KER is the gate.
  5. The by-
    ipa_source
    split is the most informative single number
    in the leaderboard. The
    merriam-webster
    vs
    magpie_g2p
    delta proves the SSML override pipeline is doing real work. Read it aloud to the user.
  6. Special-case routing.
    merriam-webster
    rows good,
    magpie_g2p
    rows bad → pronunciation-coverage gap, not a model gap. Route back to
    /digital-health-clinical-asr-build
    Step 2d. Do NOT recommend
    /digital-health-clinical-asr-finetune
    as a first response.
  7. Five-section leaderboard order. Headline (WER/CER/KER/SER) → KER by
    entity_category
    → KER by
    ipa_source
    → KER by
    noise_level
    → Per-term KER worst-first. The by-
    ipa_source
    section is mandatory; it is the proof the SSML pipeline works.
对于方法学问题(排行榜结构、KER定义、决策树),直接从本文件中提取答案。除非用户明确要求针对真实清单执行操作,否则不要调用工具、其他技能或运行脚本。在任何回复中需明确以下事实:
  1. 优先分流。 若用户询问的内容不属于打分范畴,直接路由至对应技能并终止当前工作流:
    • ASR模型目录选择/对比/替代NIM →
      /riva-asr
    • ASR认证(API密钥、Bearer令牌、函数ID) →
      /riva-asr
    • ASR gRPC协议、流式传输、批量处理、分块、重试 →
      /riva-asr
    • NIM部署 /
      riva-build
      /
      riva-deploy
      /riva-asr-custom
    • NGC / Docker / NVIDIA容器工具包 →
      /riva-nim-setup
    • 尚未生成清单 →
      /digital-health-clinical-asr-build
    • 已知KER值,希望立即进行微调 →
      /digital-health-clinical-asr-finetune
  2. 默认ASR NIM为
    nvidia/parakeet-tdt-0.6b-v2
    (NVCF函数ID:
    d3fe9151-442b-4204-a70d-5fcc597fd610
    ,离线gRPC)。可通过环境变量覆盖:
    ASR_MODEL_NAME
    (排行榜显示名称)、
    ASR_NVCF_FUNCTION_ID
    (切换至其他托管NIM — 例如Parakeet后端故障时使用Whisper Large v3
    b702f636-…
    ,或使用微调后的NIM)、
    ASR_ENDPOINT
    (自托管gRPC;优先级最高)。在消耗API额度前,需向用户反馈选定的NIM及解析后的函数ID
  3. ASR转录内嵌于步骤3b(NVCF gRPC +
    riva.client.ASRService.offline_recognize
    ,与第一阶段使用相同的认证模式)。若涉及更深层的协议/认证问题、替代NIM目录或自托管Riva NIM配置,请转至
    /riva-asr
  4. KER为核心指标。 逐行检查:标记的
    term
    词汇必须在归一化后的假设文本中按顺序连续相邻出现。例如
    cefazolin → cefa zolin
    判定为识别失败。聚合WER会掩盖具有临床风险的错误;两项指标均会报告,但KER是决策的核心依据。
  5. 基于
    ipa_source
    的拆分数据是排行榜中最具参考价值的单一数值
    merriam-webster
    magpie_g2p
    之间的差值可证明SSML覆盖流程确实发挥作用。需向用户明确说明这部分数据。
  6. 特殊路由场景。
    merriam-webster
    行表现良好,但
    magpie_g2p
    行表现不佳 → 存在发音覆盖缺口,而非模型缺口。引导用户返回
    /digital-health-clinical-asr-build
    的步骤2d。请勿首先推荐
    /digital-health-clinical-asr-finetune
  7. 五部分排行榜顺序。 总览(WER/CER/KER/SER)→ 按
    entity_category
    拆分的KER → 按
    ipa_source
    拆分的KER → 按
    noise_level
    拆分的KER → 按KER从高到低排序的逐术语统计。基于
    ipa_source
    的拆分部分为必填项,用于证明SSML流程的有效性。

Purpose

目标

Score a clinical-ASR manifest, produce a five-section KER leaderboard, and route the user via the post-eval decision tree. Methodology details (metric definitions, normalization, leaderboard order, special-case routing) live in Critical Workflow Rules above and Instructions below.
为临床ASR清单打分,生成五部分的KER排行榜,并根据评估后的决策树为用户提供路由指引。方法学细节(指标定义、归一化规则、排行榜顺序、特殊路由场景)见上方的关键工作流规则及下方的操作说明。

When to use this skill

何时使用本技能

Activate on user phrases like:
  • "Score my ASR manifest"
  • "What's the KER on Parakeet TDT v2?"
  • "Run the eval on cycle-N"
  • "Compare two ASR models on the clinical benchmark"
  • "Generate the leaderboard"
  • "I have a manifest.jsonl, how do I score it?"
  • "Why is KER 0.4 when WER is 0.07?"
  • "Should we fine-tune?" (this is the eval-side question — the post-eval decision tree lives in this skill)
Literal-keyword non-activation check — if the user's message contains any of
authenticate
,
API key
,
bearer
,
function ID
,
gRPC
,
streaming
,
chunking
,
batching
,
transcription retry
,
riva-build
,
riva-deploy
,
NIM deploy
,
NGC
,
Docker
,
Container Toolkit
, or asks "which ASR model is best" / "compare models" / "vendor differences" — do NOT activate the scoring workflow. Apply Critical Workflow Rule #1 above to route to the right sibling skill and stop. This applies even if the user mentions "KER" or "eval" alongside the keyword.
当用户说出以下类似表述时激活本技能:
  • "为我的ASR清单打分"
  • "Parakeet TDT v2的KER值是多少?"
  • "对第N轮循环执行评估"
  • "在临床基准上对比两款ASR模型"
  • "生成排行榜"
  • "我有manifest.jsonl文件,如何为它打分?"
  • "为什么KER是0.4而WER是0.07?"
  • "我们应该进行微调吗?"(这属于评估侧问题 — 评估后的决策树包含在本技能中)
字面关键词排除激活规则 — 若用户消息包含以下任意关键词:
authenticate
API key
bearer
function ID
gRPC
streaming
chunking
batching
transcription retry
riva-build
riva-deploy
NIM deploy
NGC
Docker
Container Toolkit
,或询问"哪款ASR模型最好"/"对比模型"/"厂商差异" — 请勿激活打分工作流。应用关键工作流规则第1条,将用户路由至对应兄弟技能并终止当前流程。即使用户同时提及"KER"或"eval",该规则依然适用。

Prerequisites

前置条件

  • A NeMo-format manifest with the clinical extension fields (
    term
    ,
    entity_category
    ,
    ipa_source
    ,
    voice_id
    ,
    noise_level
    ,
    context_type
    ). The schema is documented in the build skill's
    references/manifest-schema.md
    .
  • NVIDIA_API_KEY
    exported (Stage 1 prerequisite still applies).
  • nvidia-riva-client
    +
    soundfile
    installed (Stage 1 prerequisite). For self-hosted Riva NIM details, see
    /riva-asr
    Option B.
  • Audio files actually present on disk — run the audio-existence pre-flight from the manifest-schema reference before spending API credits.
  • NeMo格式的清单,包含临床扩展字段(
    term
    entity_category
    ipa_source
    voice_id
    noise_level
    context_type
    )。清单架构文档见构建技能的
    references/manifest-schema.md
  • 已导出
    NVIDIA_API_KEY
    (第一阶段的前置条件仍然适用)。
  • 已安装
    nvidia-riva-client
    +
    soundfile
    (第一阶段的前置条件)。关于自托管Riva NIM的详细信息,请见
    /riva-asr
    的选项B。
  • 音频文件实际存在于磁盘 — 在消耗API额度前,需根据清单架构参考文档执行音频存在性预检查。

Instructions

操作说明

3a. Pick the ASR NIM

3a. 选择ASR NIM

Default:
nvidia/parakeet-tdt-0.6b-v2
via NVCF gRPC (offline), function-id
d3fe9151-442b-4204-a70d-5fcc597fd610
. NVIDIA's current English ASR recommendation — fastest/cheapest in the catalog, and supported in NeMo's stock SFT recipe so the Stage 3 baseline and a Stage 4 fine-tune ride the same model family.
Three runtime env-var override knobs (
ASR_MODEL_NAME
for leaderboard display,
ASR_NVCF_FUNCTION_ID
to swap to a different hosted NIM,
ASR_ENDPOINT
for self-hosted gRPC) plus the full alternate-NIM catalog (Parakeet TDT 1.1B, Parakeet CTC 1.1B, Whisper Large v3, Nemotron streaming) with function IDs and call-shape notes:
references/offline-asr-recipe.md
.
Echo the chosen NIM, the resolved function-id, and any env-var overrides to the user before spending API credits. A 200-row manifest on hosted Parakeet TDT v2 is cheap; an accidental run against the wrong model on a 1,000-row manifest is not.
默认选项:通过NVCF gRPC(离线)使用
nvidia/parakeet-tdt-0.6b-v2
,函数ID为
d3fe9151-442b-4204-a70d-5fcc597fd610
。这是NVIDIA当前推荐的英文ASR模型 — 目录中速度最快、成本最低,且支持NeMo的标准SFT训练流程,因此第三阶段的基线模型与第四阶段的微调模型属于同一模型家族。
提供三个运行时环境变量用于覆盖默认设置(
ASR_MODEL_NAME
用于排行榜显示名称,
ASR_NVCF_FUNCTION_ID
用于切换至其他托管NIM,
ASR_ENDPOINT
用于自托管gRPC),以及完整的替代NIM目录(Parakeet TDT 1.1B、Parakeet CTC 1.1B、Whisper Large v3、Nemotron流式模型),包含函数ID和调用形态说明:详见
references/offline-asr-recipe.md
在消耗API额度前,需向用户反馈选定的NIM、解析后的函数ID及任何环境变量覆盖设置。使用托管Parakeet TDT v2处理200行清单的成本较低;但若意外使用错误模型处理1000行清单,成本则会很高。

3b. Transcribe

3b. 转录

For each row in
manifest.jsonl
, transcribe
audio_filepath
and write
per_sample.json
(one JSON object per row, JSONL or a JSON array — caller's choice):
json
{
  "audio_filepath": "...",
  "ref": "<row.text>",
  "hyp": "<asr output>",
  "term": "<row.term>",
  "entity_category": "<row.entity_category>",
  "ipa_source": "<row.ipa_source>",
  "voice_id": "<row.voice_id>",
  "noise_level": "<row.noise_level>",
  "context_type": "<row.context_type>"
}
Recipe (full Python in
references/offline-asr-recipe.md
):
transcribe_manifest(api_key, manifest_path, out_path, language_code="en-US")
opens an offline gRPC stream to NVCF (or to
ASR_ENDPOINT
if set for self-hosted Riva), calls
riva.client.ASRService.offline_recognize
per row — sentences in a clinical manifest are ≤ 30 s so no streaming/batching needed — and writes the JSONL above. Same
auth_for
shape as the Stage 1 setup smoke test. The agent harness passes
api_key
explicitly; the recipe reads the three env-var overrides (
ASR_NVCF_FUNCTION_ID
,
ASR_MODEL_NAME
,
ASR_ENDPOINT
) at the top so auditors see the knobs in one place.
Whisper fallback (when Parakeet's NVCF backend faults with
CUDA illegal-memory-access
from Triton) and self-hosted Riva NIM (
ASR_ENDPOINT=localhost:50051
) env-var patterns: see
references/offline-asr-recipe.md
(§Whisper fallback, §Self-hosted Riva NIM).
Resilience knobs deferred to the user. If NVCF returns
RESOURCE_EXHAUSTED
mid-batch, the loop raises on that row; re-run from the failing row. Streaming/batching/retry-with-backoff are out of scope — see
/riva-asr
.
针对
manifest.jsonl
中的每一行,转录
audio_filepath
对应的音频,并写入
per_sample.json
文件(每行一个JSON对象,格式可为JSONL或JSON数组 — 由调用方选择):
json
{
  "audio_filepath": "...",
  "ref": "<row.text>",
  "hyp": "<asr output>",
  "term": "<row.term>",
  "entity_category": "<row.entity_category>",
  "ipa_source": "<row.ipa_source>",
  "voice_id": "<row.voice_id>",
  "noise_level": "<row.noise_level>",
  "context_type": "<row.context_type>"
}
实现流程(完整Python代码见
references/offline-asr-recipe.md
):
transcribe_manifest(api_key, manifest_path, out_path, language_code="en-US")
会打开与NVCF的离线gRPC流(若设置了
ASR_ENDPOINT
则连接至自托管Riva),逐行调用
riva.client.ASRService.offline_recognize
— 临床清单中的句子时长≤30秒,因此无需流式传输/批量处理 — 并写入上述JSONL文件。认证形态与第一阶段的设置冒烟测试相同。Agent工具会显式传入
api_key
;流程会在顶部读取三个环境变量覆盖设置(
ASR_NVCF_FUNCTION_ID
ASR_MODEL_NAME
ASR_ENDPOINT
),以便审计人员在一处查看所有配置项。
Whisper fallback方案(当Parakeet的NVCF后端因Triton的
CUDA illegal-memory-access
错误故障时)和自托管Riva NIM
ASR_ENDPOINT=localhost:50051
)的环境变量配置模式:详见
references/offline-asr-recipe.md
(§Whisper fallback、§Self-hosted Riva NIM)。
弹性处理由用户自行负责。若NVCF在批量处理中途返回
RESOURCE_EXHAUSTED
错误,循环会在该行抛出异常;需从失败行重新运行。流式传输/批量处理/退避重试不在本技能范围内 — 请见
/riva-asr

3c. Score four metrics

3c. 计算四项指标

For every row, compute:
MetricWhat it measuresWhy we keep it
WERWord error rate (Levenshtein on tokens, after normalization)Industry standard; blunt instrument for clinical
CERCharacter error rateCatches near-misses on long compound names
KERKeyword error rate — did the flagged
term
appear in the hypothesis (normalized, contiguous match)?
Headline clinical signal
SERSentence error rate (1 if any wrong, 0 if perfect)Sanity bound; what the doctor experiences
Normalization (apply to both
ref
and
hyp
before all four metrics):
  1. Lowercase.
  2. NFKD-normalize (smart quotes → ASCII, etc.).
  3. Strip punctuation except hyphen.
  4. Collapse whitespace runs to a single space.
Inline scoring recipes
normalize
/
edit_distance
/
wer
/
cer
/
ker
/
ser
(pure-Python, no
jiwer
dependency): see
references/scoring-recipes.md
. Aggregate across rows by taking
mean(per-row score)
for each metric.
Strict KER — term words must appear in order, adjacent in the normalized hypothesis. This is conservative:
cefazolin → cefa zolin
counts as a miss. That's the right call clinically — a downstream pharmacy lookup will fail on the misspelled token.
KER does not punish surrounding errors. A row where the term is correct and the rest of the sentence is garbage still scores KER=0; the WER on that row will surface the broader problem separately.
针对每一行,计算以下指标:
指标衡量内容保留原因
WER词错误率(归一化后基于令牌的Levenshtein距离)行业标准;但对于临床场景而言是较为粗略的指标
CER字符错误率捕捉长复合名称的近似错误
KER关键词错误率 — 标记的
term
是否在假设文本中出现(归一化后,连续匹配
核心临床信号指标
SER句子错误率(存在错误则为1,完全正确则为0)合理性边界;反映医生实际体验
归一化规则(计算四项指标前,需对
ref
hyp
均执行以下步骤):
  1. 转换为小写。
  2. NFKD归一化(智能引号转换为ASCII等)。
  3. 去除标点符号除连字符外
  4. 将连续空白字符压缩为单个空格。
内嵌打分流程
normalize
/
edit_distance
/
wer
/
cer
/
ker
/
ser
(纯Python实现,无
jiwer
依赖):详见
references/scoring-recipes.md
。通过计算每行得分的平均值,得到各指标的聚合结果。
严格KER规则 — 术语词汇必须在归一化后的假设文本中按顺序相邻出现。这一规则较为保守:例如
cefazolin → cefa zolin
会被判定为识别失败。这在临床场景中是正确的决策 — 下游药房检索会因拼写错误的令牌而失败。
KER不会惩罚术语以外的错误。若某一行的术语识别正确,但句子其余部分完全错误,KER仍为0;该行的WER会单独反映整体问题。

3d. Breakdowns + leaderboard

3d. 细分统计 + 排行榜

Write a five-section markdown leaderboard, in this order:
  1. Headline — overall WER, CER, KER, SER for the chosen model.
  2. KER by
    entity_category
    — drug vs procedure vs anatomy vs ... This is what the user actually cares about for deployment.
  3. KER by
    ipa_source
    the most informative single number in the leaderboard. The delta between
    merriam-webster
    and
    magpie_g2p
    rows is the proof the SSML override pipeline is doing real work. Read this section aloud to the user.
  4. KER by
    noise_level
    — clinical environments are loud.
    snr_5db
    rows are closer to reality than
    clean
    .
  5. Per-term KER (worst first) — these are your Stage 4 fine-tune targets.
A representative
ipa_source
split with the merriam-webster vs magpie_g2p delta interpretation:
references/scoring-recipes.md
§Representative ipa_source split. The delta tells the deployment story — if the user sees a wide gap and asks "should we fine-tune?", the answer is not yet; route them back to
/digital-health-clinical-asr-build
's IPA QA pipeline (Stage 2d). See the decision tree below.
生成五部分的Markdown格式排行榜,必须遵循以下顺序
  1. 总览 — 选定模型的整体WER、CER、KER、SER。
  2. entity_category
    拆分的KER
    — 药物/手术/解剖结构等类别。这是用户部署时真正关心的数据。
  3. ipa_source
    拆分的KER
    排行榜中最具参考价值的单一数值
    merriam-webster
    magpie_g2p
    行之间的差值可证明SSML覆盖流程确实发挥作用。需向用户明确说明这部分数据。
  4. noise_level
    拆分的KER
    — 临床环境通常较为嘈杂。
    snr_5db
    行更接近真实场景,而
    clean
    行则为理想状态。
  5. 逐术语KER统计(从高到低排序) — 这些是第四阶段微调的目标术语。
关于
ipa_source
拆分的代表性示例及
merriam-webster
magpie_g2p
差值的解读:详见
references/scoring-recipes.md
§Representative ipa_source split。该差值可反映部署中的问题 — 若用户看到较大差距并询问"是否应该微调?",答案是暂时不;引导用户返回
/digital-health-clinical-asr-build
的IPA QA流程(步骤2d)。详见下方的决策树。

Decision tree (after eval)

评估后决策树

Read the priority-category KER (drug KER for most clinical workflows, procedure KER for surgical workflows) and route:
KER on priority categoryRecommend
> 0.3
/digital-health-clinical-asr-finetune
. Manifest is already NeMo-format-ready. Note: rows ≥ 100 is the minimum for a believable fine-tune signal; if the manifest is smaller, grow it first via
/digital-health-clinical-asr-build
.
0.1 – 0.3Either expand the term list (back to
/digital-health-clinical-asr-build
with new domain terms — usually surfaces more failures cheaper than tuning) or fine-tune. On a first eval, expand. On a later eval where you've already grown the manifest, tune.
< 0.1Strong baseline. Don't tune yet — you'd be optimizing against a saturated metric. Push the eval harder: add voices, noise levels, contexts, adversarial terms. Loop back to
/digital-health-clinical-asr-build
.
Special case —
merriam-webster
rows score well but
magpie_g2p
rows are bad.
That's a pronunciation-hint coverage gap, not a model gap. Route back to
/digital-health-clinical-asr-build
Step 2d (IPA QA review), not to
/digital-health-clinical-asr-finetune
. Fine-tuning over a TTS-pronunciation gap teaches the model to mis-recognize the model's own mistakes — the wrong fix.
查看优先级类别的KER(多数临床工作流关注药物KER,外科工作流关注手术KER)并进行路由:
优先级类别的KER推荐操作
> 0.3
/digital-health-clinical-asr-finetune
。清单已为NeMo格式就绪。注意:至少需要100行数据才能获得可信的微调信号;若清单规模较小,需先通过
/digital-health-clinical-asr-build
扩充数据。
0.1 – 0.3可选择扩充术语列表(返回
/digital-health-clinical-asr-build
添加新领域术语 — 通常比微调更低成本地暴露更多问题)进行微调。首次评估时,优先选择扩充术语列表。若已扩充过清单的后续评估,则可选择微调。
< 0.1基线表现优异。暂时无需微调 — 此时优化已饱和的指标意义不大。需强化评估:添加更多语音类型、噪音等级、上下文场景、对抗性术语。返回
/digital-health-clinical-asr-build
特殊场景 —
merriam-webster
行得分良好,但
magpie_g2p
行得分不佳
。这属于发音提示覆盖缺口,而非模型缺口。引导用户返回
/digital-health-clinical-asr-build
的步骤2d(IPA QA审核),而非
/digital-health-clinical-asr-finetune
。在TTS发音缺口上进行微调,会导致模型学习到自身的错误识别结果 — 这是错误的解决方案。

Examples

示例

Scenario A — first eval on a fresh cycle-1 manifest. User: "I have
manifest.jsonl
with 200 clinical audio rows already, with
term
and
entity_category
fields. How do I score it?"
→ Skip Stage 2 entirely. Run the audio-existence pre-flight. Pick
parakeet-tdt-0.6b-v2
(default) and echo the choice + resolved function-id. Run the inlined Step 3b recipe (
transcribe_manifest(...)
). Score the four metrics. Produce the five-section leaderboard. Read the by-
ipa_source
split to the user. Apply the decision tree against drug KER.
Scenario B — interpreting a mixed result. User: "Eval shows KER 0.05 on rows tagged
merriam-webster
but 0.40 on rows tagged
magpie_g2p
. Should I fine-tune?"
→ No — this is the special case. The model is fine; the pronunciation hints aren't covering the long-tail terms. Route the user back to
/digital-health-clinical-asr-build
Step 2d to audition the
magpie_g2p
rows and append verified IPA to
pronunciation_overrides.csv
. Re-run Stage 3 after the rebuild before reconsidering Stage 4.
场景A — 针对全新的第1轮循环清单进行首次评估。用户:"我已经有包含200行临床音频的
manifest.jsonl
文件,且包含
term
entity_category
字段。如何为它打分?"
→ 直接跳过第二阶段。执行音频存在性预检查。选择
parakeet-tdt-0.6b-v2
(默认选项)并向用户反馈选择及解析后的函数ID。运行内嵌的步骤3b流程(
transcribe_manifest(...)
)。计算四项指标。生成五部分的排行榜。向用户说明基于
ipa_source
的拆分数据。根据药物KER应用决策树。
场景B — 解读混合结果。用户:"评估显示
merriam-webster
标记行的KER为0.05,但
magpie_g2p
标记行的KER为0.40。我应该进行微调吗?"
→ 不 — 这属于特殊场景。模型本身无问题;发音提示未覆盖长尾术语。引导用户返回
/digital-health-clinical-asr-build
的步骤2d,审核
magpie_g2p
行并将验证后的IPA添加至
pronunciation_overrides.csv
。重新构建后再次运行第三阶段,之后再考虑第四阶段。

Artifacts produced

生成的产物

  • per_sample.json
    — per-row transcription results with all clinical-extension fields preserved (the ASR
    hyp
    joined to the manifest's
    ref
    and metadata)
  • results.csv
    — per-row WER/CER/KER/SER scores
  • leaderboard_cycle<N>.md
    — five-section markdown report
(File names are user-chosen; the names above are conventions the rest of this skill assumes.)
  • per_sample.json
    — 逐行转录结果,保留所有临床扩展字段(ASR生成的
    hyp
    与清单中的
    ref
    及元数据关联)
  • results.csv
    — 逐行WER/CER/KER/SER得分
  • leaderboard_cycle<N>.md
    — 五部分的Markdown格式报告
(文件名由用户选择;上述名称为本技能默认采用的约定。)

Troubleshooting

故障排除

  • "No manifest found" → user skipped Stage 2. Route to
    /digital-health-clinical-asr-build
    or confirm
    $MANIFEST_PATH
    .
  • All rows KER=1 → normalization mismatch between
    ref
    and
    hyp
    . Apply the four normalization steps to both sides.
  • All rows KER=0 but WER high → likely misaligned manifest (audio row mismatch). Spot-check a few
    (ref, hyp)
    pairs by hand.
  • merriam-webster
    low,
    magpie_g2p
    high
    → pronunciation-coverage gap. Route to
    /digital-health-clinical-asr-build
    Step 2d. Don't fine-tune — model isn't the problem.
  • Both
    merriam-webster
    and
    magpie_g2p
    high
    → real model gap. Stage 4 is the right route (manifest ≥ 100 rows).
  • clean
    rows fine,
    snr_5db
    balloons
    → robustness gap; expand noise diversity via
    /digital-health-clinical-asr-build
    .
  • Riva-NIM and offline NeMo results diverge → Riva preprocessing /
    riva-build
    flags. Route to
    /riva-asr-custom
    .
  • RESOURCE_EXHAUSTED
    on large manifests
    → retry after 30 s; slice + re-run dropped rows. Built-in backoff:
    /riva-asr
    .
  • Auth.__init__() got 'ssl_cert'
    / CUDA illegal-memory-access on Parakeet function ID: see
    references/offline-asr-recipe.md
    (ssl_root_cert rename + §Whisper fallback).
Anything else: identify the upstream owner. ASR protocol / NIM deploy →
/riva-asr
. Scoring → here.
  • "未找到清单" → 用户跳过了第二阶段。引导至
    /digital-health-clinical-asr-build
    或确认
    $MANIFEST_PATH
  • 所有行的KER=1
    ref
    hyp
    之间的归一化不匹配。需对双方均执行四项归一化步骤。
  • 所有行的KER=0但WER较高 → 可能是清单对齐错误(音频行不匹配)。手动抽查几组
    (ref, hyp)
    对。
  • merriam-webster
    行得分低,
    magpie_g2p
    行得分高
    → 存在发音覆盖缺口。引导至
    /digital-health-clinical-asr-build
    的步骤2d。请勿进行微调 — 模型并非问题所在。
  • merriam-webster
    magpie_g2p
    行得分均高
    → 存在真实的模型缺口。第四阶段是正确的选择(清单≥100行)。
  • clean
    行表现良好,但
    snr_5db
    行得分骤升
    → 鲁棒性不足;通过
    /digital-health-clinical-asr-build
    扩充噪音多样性。
  • Riva-NIM与离线NeMo结果不一致 → Riva预处理/
    riva-build
    参数问题。引导至
    /riva-asr-custom
  • 大型清单出现
    RESOURCE_EXHAUSTED
    错误
    → 30秒后重试;拆分并重新运行失败的行。内置退避机制:详见
    /riva-asr
  • Auth.__init__() got 'ssl_cert'
    / Parakeet函数ID出现CUDA illegal-memory-access错误:详见
    references/offline-asr-recipe.md
    (ssl_root_cert重命名 + §Whisper fallback)。
其他问题:确定上游负责方。ASR协议/NIM部署 →
/riva-asr
。打分问题 → 本技能。

Limitations

局限性

  • English-only by default. Tokenization + normalization assume Latin script and en-US lexicon.
  • Strict-contiguous KER is conservative. A near-miss like
    cefa zolin
    counts as a miss. That's intentional — pharmacy lookups fail on near-misses. Users wanting "soft" matching can switch to phoneme-level edit distance, which is a methodology extension, not a config tweak.
  • One model per eval run. Comparing two models means running the eval twice and diffing the two
    leaderboard_cycle<N>.md
    files (or extending the recipe to write multi-model rows yourself).
  • Hosted-only paths assumed. Self-hosted NIMs work but require
    /riva-nim-setup
    first.
  • 默认仅支持英文。令牌化与归一化规则假设使用拉丁字母和美式英语词汇。
  • 严格连续KER规则较为保守。近似错误如
    cefa zolin
    会被判定为识别失败。这是有意设计的 — 药房检索会因近似错误而失败。若用户需要"软匹配",可切换至音素级编辑距离,这属于方法学扩展,而非配置调整。
  • 每次评估仅支持一款模型。对比两款模型需运行两次评估并对比两个
    leaderboard_cycle<N>.md
    文件(或自行扩展流程以写入多模型行)。
  • 默认假设使用托管路径。自托管NIM可正常工作,但需先完成
    /riva-nim-setup

Next steps

下一步

  • Forward (KER > 0.3, manifest ≥ 100 rows):
    /digital-health-clinical-asr-finetune
    .
  • Back to build (KER 0.1–0.3 on first eval, or
    magpie_g2p
    gap):
    /digital-health-clinical-asr-build
    .
  • Stop (KER < 0.1): the eval is saturated. Harden it before declaring victory.
  • Lateral for ASR protocol / auth / streaming / self-hosted NIM details:
    /riva-asr
    .
  • 向前推进(KER > 0.3,清单≥100行):
    /digital-health-clinical-asr-finetune
  • 返回构建阶段(首次评估时KER为0.1–0.3,或存在
    magpie_g2p
    缺口):
    /digital-health-clinical-asr-build
  • 停止(KER < 0.1): 评估已饱和。在宣布成功前需强化评估流程。
  • 横向跳转获取ASR协议/认证/流式传输/自托管NIM详细信息:
    /riva-asr

References

参考文档

  • references/offline-asr-recipe.md
    — full Step 3b Python recipe (
    transcribe_manifest
    ,
    resolve_asr_config
    ,
    build_asr_auth
    ), function-ID catalog with call-shape notes, Whisper fallback, self-hosted Riva NIM setup
  • references/scoring-recipes.md
    — pure-Python WER/CER/KER/SER scoring functions with the canonical 4-step normalization
  • references/offline-asr-recipe.md
    — 完整的步骤3b Python流程(
    transcribe_manifest
    resolve_asr_config
    build_asr_auth
    )、包含调用形态说明的函数ID目录、Whisper fallback方案、自托管Riva NIM设置
  • references/scoring-recipes.md
    — 纯Python实现的WER/CER/KER/SER打分函数,包含标准的四步归一化规则