digital-health-clinical-asr-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Clinical ASR Flywheel — Stage 3 (Eval)

临床ASR飞轮 — 第三阶段（评估）

⚠ Agent: read the Critical Workflow Rules section below before answering. This SKILL.md is self-contained —
evals/
,
references/
, and
assets/
are pointers, not load-bearing. Answer methodology questions from this file directly; only invoke tools when the user explicitly asks to execute against a real manifest.

You are the score-and-route stage. The user arrives with a NeMo-format

manifest.jsonl

(either from

/digital-health-clinical-asr-build

or carried in from elsewhere). You transcribe it via the chosen ASR NIM, score four metrics, produce a five-section leaderboard, and read the decision tree to decide whether the user should advance to

/digital-health-clinical-asr-finetune

, loop back to

/digital-health-clinical-asr-build

, or stop and harden the eval.

This skill does not generate audio. If the manifest is missing or empty, send the user back to

/digital-health-clinical-asr-build

⚠ Agent：在作答前请阅读下方的关键工作流规则部分。 本SKILL.md文件内容完整独立 —
evals/
、
references/
和
assets/
仅为指向性链接，非核心依赖。直接从本文件回答方法学相关问题；仅当用户明确要求针对真实清单执行操作时，才调用工具。

你处于打分与路由阶段。用户将携带NeMo格式的

manifest.jsonl

文件（来自

/digital-health-clinical-asr-build

或其他渠道）前来。你需通过选定的ASR NIM完成转录，计算四项指标，生成五部分的排行榜，并根据决策树判断用户应进入

/digital-health-clinical-asr-finetune

、返回

/digital-health-clinical-asr-build

，还是停止并强化评估流程。

本技能不生成音频。 若清单缺失或为空，将用户引导回

/digital-health-clinical-asr-build

。

Audio leaves your environment — disclose this to the user before any clip is sent

音频将离开你的环境 — 在发送任何音频片段前告知用户

This stage transmits each manifest row's WAV file plus its reference text to an external NVIDIA service. Surface this before invoking the first ASR call:

Service	What gets sent	When
NVIDIA NVCF Parakeet/Nemotron ASR ( `grpc.nvcf.nvidia.com` )	Every audio clip referenced by the manifest (raw PCM bytes), plus the reference transcript and the clinical-extension metadata for scoring	Step 3b, one call per manifest row

The clips should be synthetic audio generated by Stage 2 (Magpie TTS over a user-curated term list) — not real patient audio. Do not pass real ASR recordings, real patient encounters, or any PHI through this skill. Scoring then runs locally (pure-Python WER/CER/KER/SER, or

jiwer

if installed). The scoring step itself does not transmit anything; only the ASR step does.

本阶段会将清单中每一行对应的WAV文件及其参考文本传输至外部NVIDIA服务。在发起首次ASR调用前，需向用户说明以下信息：

服务	传输内容	时机
NVIDIA NVCF Parakeet/Nemotron ASR ( `grpc.nvcf.nvidia.com` )	清单引用的每个音频片段（原始PCM字节）、参考转录文本，以及用于打分的临床扩展元数据	步骤3b，每一行清单发起一次调用

这些音频片段应为第二阶段生成的合成音频（基于用户整理的术语列表，通过Magpie TTS生成）—— 不可为真实患者音频。请勿通过本技能传输真实ASR录音、真实患者诊疗记录或任何受保护健康信息（PHI）。 打分步骤在本地运行（纯Python实现的WER/CER/KER/SER，或使用已安装的

jiwer

库）。打分过程本身不会传输任何内容；仅ASR转录步骤会传输数据。

Critical workflow rules (apply on every activation)

关键工作流规则（每次激活时均需遵守）

For methodology questions (leaderboard structure, KER definition, decision tree), answer from this file. Don't invoke tools, call other skills, or run scripts unless the user explicitly asks to execute against a real manifest. Surface these facts in any response:

Off-ramp first. If the user is asking about something outside scoring, route and stop without running any workflow:
- ASR model-catalog selection / comparison / alternative NIMs →
```
/riva-asr
```
- ASR auth (API keys, bearer tokens, function IDs) →
```
/riva-asr
```
- ASR gRPC protocol, streaming, batching, chunking, retries →
```
/riva-asr
```
- NIM deploy /
```
riva-build
```
  /
```
riva-deploy
```
  →
```
/riva-asr-custom
```
- NGC / Docker / NVIDIA Container Toolkit →
```
/riva-nim-setup
```
- No manifest yet →
```
/digital-health-clinical-asr-build
```
- Wants to fine-tune now with a known KER →
```
/digital-health-clinical-asr-finetune
```
Default ASR NIM is
nvidia/parakeet-tdt-0.6b-v2
(NVCF function-id
```
d3fe9151-442b-4204-a70d-5fcc597fd610
```
, offline gRPC). Env-var overrides:
```
ASR_MODEL_NAME
```
(leaderboard display name),
```
ASR_NVCF_FUNCTION_ID
```
(swap to a different hosted NIM — e.g. Whisper Large v3
```
b702f636-…
```
while the Parakeet backend is faulting, or a fine-tuned NIM),
```
ASR_ENDPOINT
```
(self-hosted gRPC; takes precedence). Echo the chosen NIM and the resolved function-id back before spending API credits.
ASR transcription is inlined in Step 3b (NVCF gRPC +
```
riva.client.ASRService.offline_recognize
```
, same auth pattern as Stage 1). For deeper protocol/auth questions, alternative NIM catalogs, or self-hosted Riva NIM configuration, defer to
```
/riva-asr
```
.
KER is the headline. Per-row check: the flagged
```
term
```
words must appear in order, contiguous, adjacent in the normalized hypothesis.
```
cefazolin → cefa zolin
```
is a miss. Aggregate WER hides clinically dangerous failures; both are reported, KER is the gate.
The by-
ipa_source
split is the most informative single number in the leaderboard. The
```
merriam-webster
```
vs
```
magpie_g2p
```
delta proves the SSML override pipeline is doing real work. Read it aloud to the user.
Special-case routing.
```
merriam-webster
```
rows good,
```
magpie_g2p
```
rows bad → pronunciation-coverage gap, not a model gap. Route back to
```
/digital-health-clinical-asr-build
```
Step 2d. Do NOT recommend
/digital-health-clinical-asr-finetune
as a first response.
Five-section leaderboard order. Headline (WER/CER/KER/SER) → KER by
```
entity_category
```
→ KER by
```
ipa_source
```
→ KER by
```
noise_level
```
→ Per-term KER worst-first. The by-
```
ipa_source
```
section is mandatory; it is the proof the SSML pipeline works.

对于方法学问题（排行榜结构、KER定义、决策树），直接从本文件中提取答案。除非用户明确要求针对真实清单执行操作，否则不要调用工具、其他技能或运行脚本。在任何回复中需明确以下事实：

优先分流。 若用户询问的内容不属于打分范畴，直接路由至对应技能并终止当前工作流：
- ASR模型目录选择/对比/替代NIM →
```
/riva-asr
```
- ASR认证（API密钥、Bearer令牌、函数ID） →
```
/riva-asr
```
- ASR gRPC协议、流式传输、批量处理、分块、重试 →
```
/riva-asr
```
- NIM部署 /
```
riva-build
```
  /
```
riva-deploy
```
  →
```
/riva-asr-custom
```
- NGC / Docker / NVIDIA容器工具包 →
```
/riva-nim-setup
```
- 尚未生成清单 →
```
/digital-health-clinical-asr-build
```
- 已知KER值，希望立即进行微调 →
```
/digital-health-clinical-asr-finetune
```
默认ASR NIM为
nvidia/parakeet-tdt-0.6b-v2
（NVCF函数ID：
```
d3fe9151-442b-4204-a70d-5fcc597fd610
```
，离线gRPC）。可通过环境变量覆盖：
```
ASR_MODEL_NAME
```
（排行榜显示名称）、
```
ASR_NVCF_FUNCTION_ID
```
（切换至其他托管NIM — 例如Parakeet后端故障时使用Whisper Large v3
```
b702f636-…
```
，或使用微调后的NIM）、
```
ASR_ENDPOINT
```
（自托管gRPC；优先级最高）。在消耗API额度前，需向用户反馈选定的NIM及解析后的函数ID。
ASR转录内嵌于步骤3b（NVCF gRPC +
```
riva.client.ASRService.offline_recognize
```
，与第一阶段使用相同的认证模式）。若涉及更深层的协议/认证问题、替代NIM目录或自托管Riva NIM配置，请转至
```
/riva-asr
```
。
KER为核心指标。 逐行检查：标记的
```
term
```
词汇必须在归一化后的假设文本中按顺序连续相邻出现。例如
```
cefazolin → cefa zolin
```
判定为识别失败。聚合WER会掩盖具有临床风险的错误；两项指标均会报告，但KER是决策的核心依据。
基于
ipa_source
的拆分数据是排行榜中最具参考价值的单一数值。
```
merriam-webster
```
与
```
magpie_g2p
```
之间的差值可证明SSML覆盖流程确实发挥作用。需向用户明确说明这部分数据。
特殊路由场景。
```
merriam-webster
```
行表现良好，但
```
magpie_g2p
```
行表现不佳 → 存在发音覆盖缺口，而非模型缺口。引导用户返回
```
/digital-health-clinical-asr-build
```
的步骤2d。请勿首先推荐
/digital-health-clinical-asr-finetune
。
五部分排行榜顺序。 总览（WER/CER/KER/SER）→ 按
```
entity_category
```
拆分的KER → 按
```
ipa_source
```
拆分的KER → 按
```
noise_level
```
拆分的KER → 按KER从高到低排序的逐术语统计。基于
```
ipa_source
```
的拆分部分为必填项，用于证明SSML流程的有效性。

Purpose

目标

Score a clinical-ASR manifest, produce a five-section KER leaderboard, and route the user via the post-eval decision tree. Methodology details (metric definitions, normalization, leaderboard order, special-case routing) live in Critical Workflow Rules above and Instructions below.

为临床ASR清单打分，生成五部分的KER排行榜，并根据评估后的决策树为用户提供路由指引。方法学细节（指标定义、归一化规则、排行榜顺序、特殊路由场景）见上方的关键工作流规则及下方的操作说明。

When to use this skill

何时使用本技能

Activate on user phrases like:

"Score my ASR manifest"
"What's the KER on Parakeet TDT v2?"
"Run the eval on cycle-N"
"Compare two ASR models on the clinical benchmark"
"Generate the leaderboard"
"I have a manifest.jsonl, how do I score it?"
"Why is KER 0.4 when WER is 0.07?"
"Should we fine-tune?" (this is the eval-side question — the post-eval decision tree lives in this skill)

Literal-keyword non-activation check — if the user's message contains any of

authenticate

API key

bearer

function ID

gRPC

streaming

chunking

batching

transcription retry

riva-build

riva-deploy

NIM deploy

NGC

Docker

Container Toolkit

, or asks "which ASR model is best" / "compare models" / "vendor differences" — do NOT activate the scoring workflow. Apply Critical Workflow Rule #1 above to route to the right sibling skill and stop. This applies even if the user mentions "KER" or "eval" alongside the keyword.

当用户说出以下类似表述时激活本技能：

"为我的ASR清单打分"
"Parakeet TDT v2的KER值是多少？"
"对第N轮循环执行评估"
"在临床基准上对比两款ASR模型"
"生成排行榜"
"我有manifest.jsonl文件，如何为它打分？"
"为什么KER是0.4而WER是0.07？"
"我们应该进行微调吗？"（这属于评估侧问题 — 评估后的决策树包含在本技能中）

字面关键词排除激活规则 — 若用户消息包含以下任意关键词：

authenticate

、

API key

、

bearer

、

function ID

、

gRPC

、

streaming

、

chunking

、

batching

、

transcription retry

、

riva-build

、

riva-deploy

、

NIM deploy

、

NGC

、

Docker

、

Container Toolkit

，或询问"哪款ASR模型最好"/"对比模型"/"厂商差异" — 请勿激活打分工作流。应用关键工作流规则第1条，将用户路由至对应兄弟技能并终止当前流程。即使用户同时提及"KER"或"eval"，该规则依然适用。

Prerequisites

前置条件

A NeMo-format manifest with the clinical extension fields (
```
term
```
,
```
entity_category
```
,
```
ipa_source
```
,
```
voice_id
```
,
```
noise_level
```
,
```
context_type
```
). The schema is documented in the build skill's
```
references/manifest-schema.md
```
.
NVIDIA_API_KEY
exported (Stage 1 prerequisite still applies).
nvidia-riva-client
+
soundfile
installed (Stage 1 prerequisite). For self-hosted Riva NIM details, see
```
/riva-asr
```
Option B.
Audio files actually present on disk — run the audio-existence pre-flight from the manifest-schema reference before spending API credits.

NeMo格式的清单，包含临床扩展字段（
```
term
```
、
```
entity_category
```
、
```
ipa_source
```
、
```
voice_id
```
、
```
noise_level
```
、
```
context_type
```
）。清单架构文档见构建技能的
```
references/manifest-schema.md
```
。
已导出
NVIDIA_API_KEY
（第一阶段的前置条件仍然适用）。
已安装
nvidia-riva-client
+
soundfile
（第一阶段的前置条件）。关于自托管Riva NIM的详细信息，请见
```
/riva-asr
```
的选项B。
音频文件实际存在于磁盘 — 在消耗API额度前，需根据清单架构参考文档执行音频存在性预检查。

Instructions

操作说明

3a. Pick the ASR NIM

3a. 选择ASR NIM

Default:

nvidia/parakeet-tdt-0.6b-v2

via NVCF gRPC (offline), function-id

d3fe9151-442b-4204-a70d-5fcc597fd610

. NVIDIA's current English ASR recommendation — fastest/cheapest in the catalog, and supported in NeMo's stock SFT recipe so the Stage 3 baseline and a Stage 4 fine-tune ride the same model family.

Three runtime env-var override knobs (

ASR_MODEL_NAME

for leaderboard display,

ASR_NVCF_FUNCTION_ID

to swap to a different hosted NIM,

ASR_ENDPOINT

for self-hosted gRPC) plus the full alternate-NIM catalog (Parakeet TDT 1.1B, Parakeet CTC 1.1B, Whisper Large v3, Nemotron streaming) with function IDs and call-shape notes:

references/offline-asr-recipe.md

Echo the chosen NIM, the resolved function-id, and any env-var overrides to the user before spending API credits. A 200-row manifest on hosted Parakeet TDT v2 is cheap; an accidental run against the wrong model on a 1,000-row manifest is not.

默认选项：通过NVCF gRPC（离线）使用

nvidia/parakeet-tdt-0.6b-v2

，函数ID为

d3fe9151-442b-4204-a70d-5fcc597fd610

。这是NVIDIA当前推荐的英文ASR模型 — 目录中速度最快、成本最低，且支持NeMo的标准SFT训练流程，因此第三阶段的基线模型与第四阶段的微调模型属于同一模型家族。

提供三个运行时环境变量用于覆盖默认设置（

ASR_MODEL_NAME

用于排行榜显示名称，

ASR_NVCF_FUNCTION_ID

用于切换至其他托管NIM，

ASR_ENDPOINT

用于自托管gRPC），以及完整的替代NIM目录（Parakeet TDT 1.1B、Parakeet CTC 1.1B、Whisper Large v3、Nemotron流式模型），包含函数ID和调用形态说明：详见

references/offline-asr-recipe.md

。

在消耗API额度前，需向用户反馈选定的NIM、解析后的函数ID及任何环境变量覆盖设置。使用托管Parakeet TDT v2处理200行清单的成本较低；但若意外使用错误模型处理1000行清单，成本则会很高。

3b. Transcribe

3b. 转录

For each row in

manifest.jsonl

, transcribe

audio_filepath

and write

per_sample.json

(one JSON object per row, JSONL or a JSON array — caller's choice):

json

{
  "audio_filepath": "...",
  "ref": "<row.text>",
  "hyp": "<asr output>",
  "term": "<row.term>",
  "entity_category": "<row.entity_category>",
  "ipa_source": "<row.ipa_source>",
  "voice_id": "<row.voice_id>",
  "noise_level": "<row.noise_level>",
  "context_type": "<row.context_type>"
}

Recipe (full Python in

references/offline-asr-recipe.md

transcribe_manifest(api_key, manifest_path, out_path, language_code="en-US")

opens an offline gRPC stream to NVCF (or to

ASR_ENDPOINT

if set for self-hosted Riva), calls

riva.client.ASRService.offline_recognize

per row — sentences in a clinical manifest are ≤ 30 s so no streaming/batching needed — and writes the JSONL above. Same

auth_for

shape as the Stage 1 setup smoke test. The agent harness passes

api_key

explicitly; the recipe reads the three env-var overrides (

ASR_NVCF_FUNCTION_ID

ASR_MODEL_NAME

ASR_ENDPOINT

) at the top so auditors see the knobs in one place.

Whisper fallback (when Parakeet's NVCF backend faults with

CUDA illegal-memory-access

from Triton) and self-hosted Riva NIM (

ASR_ENDPOINT=localhost:50051

) env-var patterns: see

references/offline-asr-recipe.md

(§Whisper fallback, §Self-hosted Riva NIM).

Resilience knobs deferred to the user. If NVCF returns

RESOURCE_EXHAUSTED

mid-batch, the loop raises on that row; re-run from the failing row. Streaming/batching/retry-with-backoff are out of scope — see

/riva-asr

针对

manifest.jsonl

中的每一行，转录

audio_filepath

对应的音频，并写入

per_sample.json

文件（每行一个JSON对象，格式可为JSONL或JSON数组 — 由调用方选择）：

json

{
  "audio_filepath": "...",
  "ref": "<row.text>",
  "hyp": "<asr output>",
  "term": "<row.term>",
  "entity_category": "<row.entity_category>",
  "ipa_source": "<row.ipa_source>",
  "voice_id": "<row.voice_id>",
  "noise_level": "<row.noise_level>",
  "context_type": "<row.context_type>"
}

实现流程（完整Python代码见

references/offline-asr-recipe.md

）：

transcribe_manifest(api_key, manifest_path, out_path, language_code="en-US")

会打开与NVCF的离线gRPC流（若设置了

ASR_ENDPOINT

则连接至自托管Riva），逐行调用

riva.client.ASRService.offline_recognize

— 临床清单中的句子时长≤30秒，因此无需流式传输/批量处理 — 并写入上述JSONL文件。认证形态与第一阶段的设置冒烟测试相同。Agent工具会显式传入

api_key

；流程会在顶部读取三个环境变量覆盖设置（

ASR_NVCF_FUNCTION_ID

、

ASR_MODEL_NAME

、

ASR_ENDPOINT

），以便审计人员在一处查看所有配置项。

Whisper fallback方案（当Parakeet的NVCF后端因Triton的

CUDA illegal-memory-access

错误故障时）和自托管Riva NIM（

ASR_ENDPOINT=localhost:50051

）的环境变量配置模式：详见

references/offline-asr-recipe.md

（§Whisper fallback、§Self-hosted Riva NIM）。

弹性处理由用户自行负责。若NVCF在批量处理中途返回

RESOURCE_EXHAUSTED

错误，循环会在该行抛出异常；需从失败行重新运行。流式传输/批量处理/退避重试不在本技能范围内 — 请见

/riva-asr

。

3c. Score four metrics

3c. 计算四项指标

For every row, compute:

Metric	What it measures	Why we keep it
WER	Word error rate (Levenshtein on tokens, after normalization)	Industry standard; blunt instrument for clinical
CER	Character error rate	Catches near-misses on long compound names
KER ★	Keyword error rate — did the flagged `term` appear in the hypothesis (normalized, contiguous match)?	Headline clinical signal
SER	Sentence error rate (1 if any wrong, 0 if perfect)	Sanity bound; what the doctor experiences

Normalization (apply to both
ref
and
hyp
before all four metrics):

Lowercase.
NFKD-normalize (smart quotes → ASCII, etc.).
Strip punctuation except hyphen.
Collapse whitespace runs to a single space.

Inline scoring recipes —

normalize

edit_distance

wer

cer

ker

ser

(pure-Python, no

jiwer

dependency): see

references/scoring-recipes.md

. Aggregate across rows by taking

mean(per-row score)

for each metric.

Strict KER — term words must appear in order, adjacent in the normalized hypothesis. This is conservative:

cefazolin → cefa zolin

counts as a miss. That's the right call clinically — a downstream pharmacy lookup will fail on the misspelled token.

KER does not punish surrounding errors. A row where the term is correct and the rest of the sentence is garbage still scores KER=0; the WER on that row will surface the broader problem separately.

针对每一行，计算以下指标：

指标	衡量内容	保留原因
WER	词错误率（归一化后基于令牌的Levenshtein距离）	行业标准；但对于临床场景而言是较为粗略的指标
CER	字符错误率	捕捉长复合名称的近似错误
KER ★	关键词错误率 — 标记的 `term` 是否在假设文本中出现（归一化后，连续匹配）	核心临床信号指标
SER	句子错误率（存在错误则为1，完全正确则为0）	合理性边界；反映医生实际体验

归一化规则（计算四项指标前，需对
ref
和
hyp
均执行以下步骤）：

转换为小写。
NFKD归一化（智能引号转换为ASCII等）。
去除标点符号除连字符外。
将连续空白字符压缩为单个空格。

内嵌打分流程 —

normalize

edit_distance

wer

cer

ker

ser

（纯Python实现，无

jiwer

依赖）：详见

references/scoring-recipes.md

。通过计算每行得分的平均值，得到各指标的聚合结果。

严格KER规则 — 术语词汇必须在归一化后的假设文本中按顺序相邻出现。这一规则较为保守：例如

cefazolin → cefa zolin

会被判定为识别失败。这在临床场景中是正确的决策 — 下游药房检索会因拼写错误的令牌而失败。

KER不会惩罚术语以外的错误。若某一行的术语识别正确，但句子其余部分完全错误，KER仍为0；该行的WER会单独反映整体问题。

3d. Breakdowns + leaderboard

3d. 细分统计 + 排行榜

Write a five-section markdown leaderboard, in this order:

Headline — overall WER, CER, KER, SER for the chosen model.
KER by
entity_category
— drug vs procedure vs anatomy vs ... This is what the user actually cares about for deployment.
KER by
ipa_source
— the most informative single number in the leaderboard. The delta between
```
merriam-webster
```
and
```
magpie_g2p
```
rows is the proof the SSML override pipeline is doing real work. Read this section aloud to the user.
KER by
noise_level
— clinical environments are loud.
```
snr_5db
```
rows are closer to reality than
```
clean
```
.
Per-term KER (worst first) — these are your Stage 4 fine-tune targets.

A representative

ipa_source

split with the merriam-webster vs magpie_g2p delta interpretation:

references/scoring-recipes.md

§Representative ipa_source split. The delta tells the deployment story — if the user sees a wide gap and asks "should we fine-tune?", the answer is not yet; route them back to

/digital-health-clinical-asr-build

's IPA QA pipeline (Stage 2d). See the decision tree below.

生成五部分的Markdown格式排行榜，必须遵循以下顺序：

总览 — 选定模型的整体WER、CER、KER、SER。
按
entity_category
拆分的KER — 药物/手术/解剖结构等类别。这是用户部署时真正关心的数据。
按
ipa_source
拆分的KER — 排行榜中最具参考价值的单一数值。
```
merriam-webster
```
与
```
magpie_g2p
```
行之间的差值可证明SSML覆盖流程确实发挥作用。需向用户明确说明这部分数据。
按
noise_level
拆分的KER — 临床环境通常较为嘈杂。
```
snr_5db
```
行更接近真实场景，而
```
clean
```
行则为理想状态。
逐术语KER统计（从高到低排序） — 这些是第四阶段微调的目标术语。

关于

ipa_source

拆分的代表性示例及

merriam-webster

与

magpie_g2p

差值的解读：详见

references/scoring-recipes.md

§Representative ipa_source split。该差值可反映部署中的问题 — 若用户看到较大差距并询问"是否应该微调？"，答案是暂时不；引导用户返回

/digital-health-clinical-asr-build

的IPA QA流程（步骤2d）。详见下方的决策树。

Decision tree (after eval)

评估后决策树

Read the priority-category KER (drug KER for most clinical workflows, procedure KER for surgical workflows) and route:

KER on priority category	Recommend
> 0.3	`/digital-health-clinical-asr-finetune` . Manifest is already NeMo-format-ready. Note: rows ≥ 100 is the minimum for a believable fine-tune signal; if the manifest is smaller, grow it first via `/digital-health-clinical-asr-build` .
0.1 – 0.3	Either expand the term list (back to `/digital-health-clinical-asr-build` with new domain terms — usually surfaces more failures cheaper than tuning) or fine-tune. On a first eval, expand. On a later eval where you've already grown the manifest, tune.
< 0.1	Strong baseline. Don't tune yet — you'd be optimizing against a saturated metric. Push the eval harder: add voices, noise levels, contexts, adversarial terms. Loop back to `/digital-health-clinical-asr-build` .

Special case —
merriam-webster
rows score well but
magpie_g2p
rows are bad. That's a pronunciation-hint coverage gap, not a model gap. Route back to

/digital-health-clinical-asr-build

Step 2d (IPA QA review), not to

/digital-health-clinical-asr-finetune

. Fine-tuning over a TTS-pronunciation gap teaches the model to mis-recognize the model's own mistakes — the wrong fix.

查看优先级类别的KER（多数临床工作流关注药物KER，外科工作流关注手术KER）并进行路由：

优先级类别的KER	推荐操作
> 0.3	`/digital-health-clinical-asr-finetune` 。清单已为NeMo格式就绪。注意：至少需要100行数据才能获得可信的微调信号；若清单规模较小，需先通过 `/digital-health-clinical-asr-build` 扩充数据。
0.1 – 0.3	可选择扩充术语列表（返回 `/digital-health-clinical-asr-build` 添加新领域术语 — 通常比微调更低成本地暴露更多问题）或进行微调。首次评估时，优先选择扩充术语列表。若已扩充过清单的后续评估，则可选择微调。
< 0.1	基线表现优异。暂时无需微调 — 此时优化已饱和的指标意义不大。需强化评估：添加更多语音类型、噪音等级、上下文场景、对抗性术语。返回 `/digital-health-clinical-asr-build` 。

特殊场景 —
merriam-webster
行得分良好，但
magpie_g2p
行得分不佳。这属于发音提示覆盖缺口，而非模型缺口。引导用户返回

/digital-health-clinical-asr-build

的步骤2d（IPA QA审核），而非

/digital-health-clinical-asr-finetune

。在TTS发音缺口上进行微调，会导致模型学习到自身的错误识别结果 — 这是错误的解决方案。

Examples

示例

Scenario A — first eval on a fresh cycle-1 manifest. User: "I have
manifest.jsonl
with 200 clinical audio rows already, with
term
and
entity_category
fields. How do I score it?" → Skip Stage 2 entirely. Run the audio-existence pre-flight. Pick

parakeet-tdt-0.6b-v2

(default) and echo the choice + resolved function-id. Run the inlined Step 3b recipe (

transcribe_manifest(...)

). Score the four metrics. Produce the five-section leaderboard. Read the by-

ipa_source

split to the user. Apply the decision tree against drug KER.

Scenario B — interpreting a mixed result. User: "Eval shows KER 0.05 on rows tagged
merriam-webster
but 0.40 on rows tagged
magpie_g2p
. Should I fine-tune?" → No — this is the special case. The model is fine; the pronunciation hints aren't covering the long-tail terms. Route the user back to

/digital-health-clinical-asr-build

Step 2d to audition the

magpie_g2p

rows and append verified IPA to

pronunciation_overrides.csv

. Re-run Stage 3 after the rebuild before reconsidering Stage 4.

场景A — 针对全新的第1轮循环清单进行首次评估。用户："我已经有包含200行临床音频的
manifest.jsonl
文件，且包含
term
和
entity_category
字段。如何为它打分？" → 直接跳过第二阶段。执行音频存在性预检查。选择

parakeet-tdt-0.6b-v2

（默认选项）并向用户反馈选择及解析后的函数ID。运行内嵌的步骤3b流程（

transcribe_manifest(...)

）。计算四项指标。生成五部分的排行榜。向用户说明基于

ipa_source

的拆分数据。根据药物KER应用决策树。

场景B — 解读混合结果。用户："评估显示
merriam-webster
标记行的KER为0.05，但
magpie_g2p
标记行的KER为0.40。我应该进行微调吗？" → 不 — 这属于特殊场景。模型本身无问题；发音提示未覆盖长尾术语。引导用户返回

/digital-health-clinical-asr-build

的步骤2d，审核

magpie_g2p

行并将验证后的IPA添加至

pronunciation_overrides.csv

。重新构建后再次运行第三阶段，之后再考虑第四阶段。

Artifacts produced

生成的产物

```
per_sample.json
```
— per-row transcription results with all clinical-extension fields preserved (the ASR
```
hyp
```
joined to the manifest's
```
ref
```
and metadata)
```
results.csv
```
— per-row WER/CER/KER/SER scores
```
leaderboard_cycle<N>.md
```
— five-section markdown report

(File names are user-chosen; the names above are conventions the rest of this skill assumes.)

```
per_sample.json
```
— 逐行转录结果，保留所有临床扩展字段（ASR生成的
```
hyp
```
与清单中的
```
ref
```
及元数据关联）
```
results.csv
```
— 逐行WER/CER/KER/SER得分
```
leaderboard_cycle<N>.md
```
— 五部分的Markdown格式报告

（文件名由用户选择；上述名称为本技能默认采用的约定。）

Troubleshooting

故障排除

"No manifest found" → user skipped Stage 2. Route to
```
/digital-health-clinical-asr-build
```
or confirm
```
$MANIFEST_PATH
```
.
All rows KER=1 → normalization mismatch between
```
ref
```
and
```
hyp
```
. Apply the four normalization steps to both sides.
All rows KER=0 but WER high → likely misaligned manifest (audio row mismatch). Spot-check a few
```
(ref, hyp)
```
pairs by hand.
merriam-webster
low,
magpie_g2p
high → pronunciation-coverage gap. Route to
```
/digital-health-clinical-asr-build
```
Step 2d. Don't fine-tune — model isn't the problem.
Both
merriam-webster
and
magpie_g2p
high → real model gap. Stage 4 is the right route (manifest ≥ 100 rows).
clean
rows fine,
snr_5db
balloons → robustness gap; expand noise diversity via
```
/digital-health-clinical-asr-build
```
.
Riva-NIM and offline NeMo results diverge → Riva preprocessing /
```
riva-build
```
flags. Route to
```
/riva-asr-custom
```
.
RESOURCE_EXHAUSTED
on large manifests → retry after 30 s; slice + re-run dropped rows. Built-in backoff:
```
/riva-asr
```
.
Auth.__init__() got 'ssl_cert'
/ CUDA illegal-memory-access on Parakeet function ID: see
```
references/offline-asr-recipe.md
```
(ssl_root_cert rename + §Whisper fallback).

Anything else: identify the upstream owner. ASR protocol / NIM deploy →

/riva-asr

. Scoring → here.

"未找到清单" → 用户跳过了第二阶段。引导至
```
/digital-health-clinical-asr-build
```
或确认
```
$MANIFEST_PATH
```
。
所有行的KER=1 →
```
ref
```
与
```
hyp
```
之间的归一化不匹配。需对双方均执行四项归一化步骤。
所有行的KER=0但WER较高 → 可能是清单对齐错误（音频行不匹配）。手动抽查几组
```
(ref, hyp)
```
对。
merriam-webster
行得分低，
magpie_g2p
行得分高 → 存在发音覆盖缺口。引导至
```
/digital-health-clinical-asr-build
```
的步骤2d。请勿进行微调 — 模型并非问题所在。
merriam-webster
和
magpie_g2p
行得分均高 → 存在真实的模型缺口。第四阶段是正确的选择（清单≥100行）。
clean
行表现良好，但
snr_5db
行得分骤升 → 鲁棒性不足；通过
```
/digital-health-clinical-asr-build
```
扩充噪音多样性。
Riva-NIM与离线NeMo结果不一致 → Riva预处理/
```
riva-build
```
参数问题。引导至
```
/riva-asr-custom
```
。
大型清单出现
RESOURCE_EXHAUSTED
错误 → 30秒后重试；拆分并重新运行失败的行。内置退避机制：详见
```
/riva-asr
```
。
Auth.__init__() got 'ssl_cert'
/ Parakeet函数ID出现CUDA illegal-memory-access错误：详见
```
references/offline-asr-recipe.md
```
（ssl_root_cert重命名 + §Whisper fallback）。

其他问题：确定上游负责方。ASR协议/NIM部署 →

/riva-asr

。打分问题 → 本技能。

Limitations

局限性

English-only by default. Tokenization + normalization assume Latin script and en-US lexicon.
Strict-contiguous KER is conservative. A near-miss like
```
cefa zolin
```
counts as a miss. That's intentional — pharmacy lookups fail on near-misses. Users wanting "soft" matching can switch to phoneme-level edit distance, which is a methodology extension, not a config tweak.
One model per eval run. Comparing two models means running the eval twice and diffing the two
```
leaderboard_cycle<N>.md
```
files (or extending the recipe to write multi-model rows yourself).
Hosted-only paths assumed. Self-hosted NIMs work but require
```
/riva-nim-setup
```
first.

默认仅支持英文。令牌化与归一化规则假设使用拉丁字母和美式英语词汇。
严格连续KER规则较为保守。近似错误如
```
cefa zolin
```
会被判定为识别失败。这是有意设计的 — 药房检索会因近似错误而失败。若用户需要"软匹配"，可切换至音素级编辑距离，这属于方法学扩展，而非配置调整。
每次评估仅支持一款模型。对比两款模型需运行两次评估并对比两个
```
leaderboard_cycle<N>.md
```
文件（或自行扩展流程以写入多模型行）。
默认假设使用托管路径。自托管NIM可正常工作，但需先完成
```
/riva-nim-setup
```
。

Next steps

下一步

Forward (KER > 0.3, manifest ≥ 100 rows):
```
/digital-health-clinical-asr-finetune
```
.
Back to build (KER 0.1–0.3 on first eval, or
magpie_g2p
gap):
```
/digital-health-clinical-asr-build
```
.
Stop (KER < 0.1): the eval is saturated. Harden it before declaring victory.
Lateral for ASR protocol / auth / streaming / self-hosted NIM details:
```
/riva-asr
```
.

向前推进（KER > 0.3，清单≥100行）：
```
/digital-health-clinical-asr-finetune
```
。
返回构建阶段（首次评估时KER为0.1–0.3，或存在
magpie_g2p
缺口）：
```
/digital-health-clinical-asr-build
```
。
停止（KER < 0.1）： 评估已饱和。在宣布成功前需强化评估流程。
横向跳转获取ASR协议/认证/流式传输/自托管NIM详细信息：
```
/riva-asr
```
。

References

参考文档

```
references/offline-asr-recipe.md
```
— full Step 3b Python recipe (
```
transcribe_manifest
```
,
```
resolve_asr_config
```
,
```
build_asr_auth
```
), function-ID catalog with call-shape notes, Whisper fallback, self-hosted Riva NIM setup
```
references/scoring-recipes.md
```
— pure-Python WER/CER/KER/SER scoring functions with the canonical 4-step normalization

```
references/offline-asr-recipe.md
```
— 完整的步骤3b Python流程（
```
transcribe_manifest
```
、
```
resolve_asr_config
```
、
```
build_asr_auth
```
）、包含调用形态说明的函数ID目录、Whisper fallback方案、自托管Riva NIM设置
```
references/scoring-recipes.md
```
— 纯Python实现的WER/CER/KER/SER打分函数，包含标准的四步归一化规则