digital-health-clinical-asr-build
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Clinical ASR Flywheel — Stage 2 (Build the benchmark)
临床ASR飞轮——第2阶段(构建基准测试集)
⚠ Agent: read this entire SKILL.md before answering. This stage is conversational and gated. Specifically: ask the user 1–2 specialty-aware clarifying questions before proposing terms (Step 2a), walk them through the two-tier IPA pipeline (override → merriam-webster → magpie_g2p) in Step 2c, hit the explicit QA-mode audition gate in Step 2d before full Cartesian synthesis, and name KER as the headline metric they'll see in Stage 3. Skipping any of these defeats the methodology.
You are the curate-and-synthesize stage. The user arrives from and leaves with a NeMo-format plus the audio it references — both ready for scoring at .
/digital-health-clinical-asr-setupmanifest.jsonl/digital-health-clinical-asr-evalBe conversational. This is the warmest, most domain-aware step in the flywheel: you're asking a clinician (or someone who works with them) which terms hurt today and shaping a benchmark around their reality. Ask short, focused questions. Show the user what's being added. Don't lecture.
⚠ Agent:请在作答前完整阅读本SKILL.md。 本阶段采用对话式流程且设有准入机制。具体要求:在提出术语前(步骤2a),向用户询问1-2个专科相关的澄清问题;在步骤2c中,向用户讲解双层IPA处理流程(优先级:自定义覆盖 → Merriam-Webster → magpie_g2p);在进行全笛卡尔积合成前,需通过步骤2d中的显式QA模式审核关卡;并告知用户第3阶段将以KER作为核心指标。跳过任何步骤都会破坏方法论的有效性。
你处于术语整理与合成阶段。用户从进入本阶段,离开时将获得NeMo格式的及其对应的音频文件——两者均可直接用于的评分环节。
/digital-health-clinical-asr-setupmanifest.jsonl/digital-health-clinical-asr-eval保持对话风格。这是飞轮流程中最贴近业务场景、最具领域感知的步骤:你需要询问临床医生(或相关从业者)当前遇到的术语识别痛点,并围绕实际场景构建基准测试集。提问要简短聚焦,向用户展示正在添加的内容,避免说教。
Data leaves your environment — disclose this to the user before any term is sent
数据将离开你的环境——在发送任何术语前告知用户
This stage transmits user-curated content to two external services. Surface this to the user before invoking either call:
| Service | What gets sent | When |
|---|---|---|
Merriam-Webster ( | One HTTP request per term in the seed list — term goes in URL path | Step 2c — see MW path bullets below |
NVIDIA NVCF Magpie TTS ( | Each generated clinical sentence (text, plus any SSML IPA wrappers) | Steps 2d and 2e, every synthesis call |
Both endpoints expect non-PHI synthetic content — the term list you curate, the sentences (or your fallback templates) generates from it. Do not pass real patient records, real ASR transcripts, or any PHI through this skill. If the term list itself is sensitive (proprietary drug codenames, unreleased product names, customer-confidential indications), confirm with the user that external-API transmission is acceptable under their organization's data-governance policy before proceeding.
/data-designerIf no MW transmission is acceptable: take Path C below (skip MW; pipeline falls through to Magpie G2P with reduced coverage on long-tail terms).
本阶段会将用户整理的内容传输至两个外部服务。在调用任一服务前,需向用户明确说明:
| 服务 | 传输内容 | 时机 |
|---|---|---|
Merriam-Webster( | 种子列表中每个术语对应一次HTTP请求——术语放在URL路径中 | 步骤2c——参见下方MW路径说明 |
NVIDIA NVCF Magpie TTS( | 每个生成的临床句子(文本及所有SSML IPA包装) | 步骤2d和2e的每次合成调用 |
两个端点均要求传输非PHI的合成内容——即你整理的术语列表、由(或备用模板)生成的句子。请勿通过本Skill传输真实患者记录、真实ASR转录文本或任何PHI数据。若术语列表本身涉及敏感内容(如专有药物代号、未发布产品名称、客户保密适应症),需先确认用户所在组织的数据治理政策允许向外部API传输此类内容,再继续操作。
/data-designer若不允许使用Merriam-Webster传输:选择下方路径C(跳过MW;流程自动 fallback 到Magpie G2P,但长尾术语的覆盖范围会降低)。
Purpose
目标
Curate a clinical-specialty term list, generate eval audio for it through Magpie TTS with a two-tier IPA pipeline, and write a NeMo-format manifest tagged with the clinical-extension fields (, , , , , ). The output is the input to Stage 3.
termentity_categoryipa_sourcevoice_idnoise_levelcontext_typeBy the end the user has:
$EVAL_DIR/cycle<N>/
├── audio/<slug>.wav synthesized clips
├── manifest.jsonl NeMo format + clinical extension
├── term_seed.csv the curated input
└── pronunciation_overrides.csv appendable across cycles( is the user's own choice — this skill does not impose a layout. The structure above is a recommendation, not a requirement.)
$EVAL_DIR整理临床专科术语列表,通过带有双层IPA流程的Magpie TTS生成评估音频,并生成带有临床扩展字段(、、、、、)的NeMo格式清单。输出结果将作为第3阶段的输入。
termentity_categoryipa_sourcevoice_idnoise_levelcontext_type完成后,用户将获得以下内容:
$EVAL_DIR/cycle<N>/
├── audio/<slug>.wav 合成音频片段
├── manifest.jsonl NeMo格式 + 临床扩展字段
├── term_seed.csv 整理后的输入术语
└── pronunciation_overrides.csv 可跨周期追加的发音覆盖文件(由用户自行选择——本Skill不强制目录结构。上述结构为推荐方案,非硬性要求。)
$EVAL_DIRWhen to use this skill
何时使用本Skill
Activate on user phrases like:
- "Build a clinical ASR benchmark"
- "Curate drug names / procedure names for ASR eval"
- "Generate eval audio for medical terms"
- "Create a NeMo manifest from clinical terms"
- "Add oncology / cardiology / ortho terms to my benchmark"
- "Audition the TTS pronunciation for these drug names"
- "Make me a cycle-N manifest"
Do not activate when (also: if the message mentions , , , , , , , or , route per the bullets below and stop):
authAPI keygRPCstreamingriva-buildNIM deployNGCDocker- The user already has a manifest and wants to score it →
/digital-health-clinical-asr-eval - The user wants to fine-tune on an existing manifest →
/digital-health-clinical-asr-finetune - The user is asking generic TTS / SSML / voice-cloning / voice-catalog questions → (or
/read-aloud)/riva-tts - TTS/ASR auth / API keys / gRPC / streaming → or
/riva-tts/riva-asr - NIM deploy or /
riva-buildflags →riva-deployor/riva-asr-custom/riva-tts-custom - NGC / Docker / NVIDIA Container Toolkit →
/riva-nim-setup - The user is asking generic synthetic-data questions →
/data-designer
当用户提出以下类似需求时激活:
- "构建临床ASR基准测试集"
- "整理药物名称/手术名称用于ASR评估"
- "为医学术语生成评估音频"
- "从临床术语创建NeMo清单"
- "在我的基准测试集中添加肿瘤/心血管/骨科术语"
- "试听这些药物名称的TTS发音"
- "帮我生成cycle-N清单"
请勿激活的场景(此外:若消息中提及、、、、、、或,请按以下指引路由并停止操作):
authAPI keygRPCstreamingriva-buildNIM deployNGCDocker- 用户已有清单并想要评分 →
/digital-health-clinical-asr-eval - 用户想要基于现有清单进行微调 →
/digital-health-clinical-asr-finetune - 用户询问通用TTS/SSML/语音克隆/语音库相关问题 → (或
/read-aloud)/riva-tts - TTS/ASR 认证/API密钥/gRPC/流式传输相关问题 → 或
/riva-tts/riva-asr - NIM部署或/
riva-build参数相关问题 →riva-deploy或/riva-asr-custom/riva-tts-custom - NGC/Docker/NVIDIA容器工具包相关问题 →
/riva-nim-setup - 用户询问通用合成数据相关问题 →
/data-designer
Prerequisites
前置条件
- completed —
/digital-health-clinical-asr-setupexported, Python deps installed, the six upstream skills confirmed.NVIDIA_API_KEY - (or
/read-aloud) reachable. Hosted Magpie via NVCF is the default. Self-hosted Magpie NIM works but adds/riva-ttsto the prerequisite chain./riva-nim-setup - reachable. Template fallback is acceptable for a first cycle if
/data-designeris unavailable, but tag those rows so future cycles can re-generate./data-designer - A working directory the user owns. The skill recommends but does not enforce it.
$EVAL_DIR/cycle<N>/
- 已完成——已导出
/digital-health-clinical-asr-setup,安装Python依赖,确认六个上游Skill可用。NVIDIA_API_KEY - (或
/read-aloud)可访问。默认使用NVCF托管的Magpie。自托管Magpie NIM也可使用,但需额外完成/riva-tts前置流程。/riva-nim-setup - ****可访问。若
/data-designer不可用,首次周期可使用模板备用方案,但需为这些行添加标签以便后续周期重新生成。/data-designer - 用户拥有一个工作目录。本Skill推荐使用,但不强制要求。
$EVAL_DIR/cycle<N>/
Instructions
操作步骤
2a. Specialty interview → term_seed.csv
term_seed.csv2a. 专科访谈 → term_seed.csv
term_seed.csvAsk one question at a time. The goal is to surface 4–10 candidate terms with the right , not to write a textbook.
entity_categoryQuestions, in order:
- What specialty / workflow is this for? (oncology dictation, ICU handoff, psych intake, ortho post-op, …)
- What ASR failure modes have you seen? — drug names, multi-word procedures, abbreviations, compound conditions.
- Which terms come up daily vs which are the hard ones? — daily-common terms become the sanity baseline; daily-hard terms become the signal.
Propose 4–10 candidate terms with . Confirm with the user before writing. Then write :
entity_categoryterm_seed.csvcsv
term,entity_category
cefazolin,drug
acetabular reamer,procedure
tibial plateau,anatomy
femoroacetabular impingement,condition
hemoglobin a1c,lab
respiratory therapist,roleThe category vocabulary is fixed. KER keys off it. Allowed values:
drug | procedure | anatomy | condition | lab | roleIf the user proposes a new category, push back: either it maps to one of the six, or the methodology needs a deliberate extension (which is a future cycle's job, not a one-off ad-hoc add).
一次只问一个问题。目标是筛选出4-10个带有正确的候选术语,而非撰写专业教材。
entity_category提问顺序:
- 这是针对哪个专科/工作流程的?(如肿瘤口述、ICU交接班、精神科接诊、骨科术后随访等)
- 你遇到过哪些ASR识别失败的情况?——如药物名称、多词手术名称、缩写、复合病症。
- 哪些是日常高频术语,哪些是识别难度高的术语?——日常高频术语作为 sanity 基线;识别难度高的术语作为核心测试信号。
提出4-10个带有的候选术语,经用户确认后写入:
entity_categoryterm_seed.csvcsv
term,entity_category
cefazolin,drug
acetabular reamer,procedure
tibial plateau,anatomy
femoroacetabular impingement,condition
hemoglobin a1c,lab
respiratory therapist,role分类词汇是固定的。KER指标依赖该分类。允许的值为:
drug | procedure | anatomy | condition | lab | role若用户提出新分类,请说明:要么可映射到上述六个分类之一,要么需要对方法论进行针对性扩展(这属于后续周期的工作,而非临时添加)。
2b. Sentence generation via /data-designer
/data-designer2b. 通过/data-designer
生成句子
/data-designerBrief with:
/data-designerFor each row in, generate one or more natural English sentences embeddingterm_seed.csvin a way that fits the row'sterm. Output schema:entity_category. Generate 3–5{term, entity_category, sentence, context_type}variants per term. Initialcontext_typevocabulary:context_type,dictation,handoff,chart_note. Sentence length 10–30 words.history
The output of this step is a per-term sentence variants file. Any filename is fine — pick one and use it consistently across the cycle directory.
Template fallback. If is unavailable, use a 4-template fallback (one per ) and substitute mechanically. Tag those rows in the manifest ( is set, the sentence is just less natural) so a future cycle can regenerate.
/data-designercontext_typetermcontext_type向提供以下指令:
/data-designer针对中的每一行,生成一个或多个自然英文句子,将term_seed.csv嵌入符合该行term的场景中。输出 schema:entity_category。每个术语生成3-5种{term, entity_category, sentence, context_type}变体。初始context_type词汇:context_type、dictation、handoff、chart_note。句子长度为10-30个单词。history
本步骤的输出是每个术语的句子变体文件。文件名可任意选择,但需在整个周期目录中保持一致。
模板备用方案。若不可用,使用4种模板(每种对应一种)并自动替换。在清单中为这些行添加标签(设置,但句子仅为机械替换),以便后续周期重新生成更自然的句子。
/data-designercontext_typetermcontext_type2c. Two-tier IPA tagging (the load-bearing quality lever)
2c. 双层IPA标注(核心质量保障环节)
Every term passes through a 3-tier pipeline, in order:
- Override — carries verified IPA the team has audited. If
pronunciation_overrides.csvmatches a row here, the override wins.term - Merriam-Webster — for un-overridden terms, fetch the MW respelling, convert to IPA, validate against Magpie's en-US phoneme set. If both succeed, the term is tagged .
merriam-webster - Magpie G2P (fall-through) — if neither override nor MW produces a valid IPA, the plain text is passed to Magpie's neural G2P at synthesis time. The row is tagged .
magpie_g2p
Every manifest row carries the tag (). The delta between and rows in the Stage 3 leaderboard is the proof the pronunciation strategy is working — call it out explicitly when you produce the leaderboard.
ipa_sourceoverride | merriam-webster | magpie_g2pmerriam-webstermagpie_g2pThree MW lookup choices — all tag . A: JSON API + (free at dictionaryapi.com) — recommended for standalone use. B: HTML scrape of — no key, brittle to site HTML changes; recipe inlined in . C: skip MW, fall through to Magpie G2P with weaker long-tail coverage. Both recipes + the full respelling→IPA table live in . The Path A function takes as an arg (never reads ); pass to skip MW.
merriam-websterdictionaryapi.comDICTIONARY_API_KEYmerriam-webster.comreferences/pronunciation-pipeline.mdreferences/pronunciation-pipeline.mdapi_keyos.environNonepronunciation_overrides.csvcsv
term,ipa,verified_by,verified_at,notes
cefazolin,sɛfəˈzoʊlɪn,brandoing,2026-05-13,confirmed against MW respelling + ear testAppend-only across cycles. Re-running the build later picks up new entries automatically.
每个术语都会依次通过三层处理流程:
- 自定义覆盖——包含经过团队审核的验证IPA。若
pronunciation_overrides.csv与其中某行匹配,则使用自定义覆盖的发音。term - Merriam-Webster——对于未被覆盖的术语,获取MW的音标转写,转换为IPA格式,并验证是否符合Magpie的美式英语音素集。若两者均成功,则该术语标记为。
merriam-webster - Magpie G2P(兜底方案)——若自定义覆盖和MW均无法生成有效IPA,则在合成时将纯文本传入Magpie的神经G2P模型。该行标记为。
magpie_g2p
每个清单行都会携带标签()。第3阶段排行榜中和行的差异正是发音策略有效性的证明——在生成排行榜时需明确指出这一点。
ipa_sourceoverride | merriam-webster | magpie_g2pmerriam-webstermagpie_g2p三种MW查询选项——均标记为。A: JSON API + (可在dictionaryapi.com免费获取)——推荐独立使用。B:爬取的HTML页面——无需密钥,但易受网站HTML结构变化影响;实现方法见。C:跳过MW,直接使用Magpie G2P兜底,但长尾术语的覆盖能力较弱。两种实现方案+完整的音标转写→IPA对照表均位于中。路径A的函数以为参数(从不读取);传入即可跳过MW。
merriam-websterdictionaryapi.comDICTIONARY_API_KEYmerriam-webster.comreferences/pronunciation-pipeline.mdreferences/pronunciation-pipeline.mdapi_keyos.environNonepronunciation_overrides.csvcsv
term,ipa,verified_by,verified_at,notes
cefazolin,sɛfəˈzoʊlɪn,brandoing,2026-05-13,confirmed against MW respelling + ear test可跨周期追加内容。后续重新运行构建流程时会自动读取新条目。
2d. QA-mode synthesis (do not skip this gate)
2d. QA模式合成(请勿跳过此关卡)
Before running the full Cartesian product, synthesize one wav per term with: first voice, clean noise, default context. Audition each clip with the user.
For every term tagged , propose an IPA candidate using clinical suffix patterns and validate against Magpie's en-US phoneme set before suggesting:
magpie_g2p| Suffix | Stress pattern (example) |
|---|---|
| …ˈmaɪsɪn (vancomycin, gentamicin) |
| …ˈpreɪzoʊl (esomeprazole, omeprazole) |
| …ˈstætɪn (atorvastatin, rosuvastatin) |
| …ˈsɑːrtən (losartan, valsartan) |
| …ˈeɪzoʊl (fluconazole, ketoconazole) |
| …ˈsɪlɪn (amoxicillin, piperacillin) |
| …ˈpɛərɪn (enoxaparin, heparin) |
Phoneme-validation pattern — live-probe Magpie's en-US neural G2P with a candidate IPA. If Magpie accepts the SSML, the IPA is in its inventory. Use the suffix patterns above as a pre-filter (cheap heuristic) and the live probe to confirm before committing to an override. The recipe — a minimal NVCF gRPC synthesis call that returns / fail-closed — is in .
magpie_validates_ipa(ipa, api_key, voice_id)TrueFalsereferences/pronunciation-pipeline.mdCall it once per candidate IPA before showing it to the user. On user approval, append the verified IPA to . The row's flips from to on the next manifest generation.
pronunciation_overrides.csvipa_sourcemagpie_g2poverrideHITL audition gate before Step 2e — fail-closed. Do not synthesize the full Cartesian product, do not promote any staged IPA candidate to , and do not advance to Stage 3 until one of the following has happened explicitly in conversation:
pronunciation_overrides.csv- The user confirms they have auditioned the QA clips and reports their verdict per clip (or per bucket: "the MW set sounds fine", "fix ", etc.). Provide the
pembrolizumab(macOS) orafplay/paplay(Linux) commands so the user can play them — then halt and wait for their reply after listening. Paper-only approval via an AskUserQuestion prompt — clicking "Promote all" or "Lock in" without auditioning — does not satisfy this gate. Magpie-validating an IPA proves it's in the phoneme inventory; it does not prove it matches the intended pronunciation. Only the user's ears do that.aplay - The user explicitly opts to skip audition for this cycle, in deliberate language (e.g. "skip audition, accept the risk that mispronunciations may dilute the Stage 3 KER signal — log it as a cycle-N caveat"), not as a side-effect of a single click-through. Record the skip in a cycle-level note (e.g. ) so a future operator can see the audition was deferred.
eval/cycle<N>/cycle_notes.md
Magpie NVCF rate-limits aggressively on >100-row jobs, and a do-over costs both API credits and clock time — but the larger risk is shipping a manifest with mispronounced reference audio that quietly corrupts the Stage 3 KER signal. Time spent auditioning is cheaper than re-running the cycle.
在运行全笛卡尔积合成前,为每个术语合成一个wav文件:使用第一个语音、无噪声、默认场景。与用户一起试听每个音频片段。
对于所有标记为的术语,先使用临床后缀模式生成IPA候选,并验证是否符合Magpie的美式英语音素集,再向用户提出建议:
magpie_g2p| 后缀 | 重音模式(示例) |
|---|---|
| …ˈmaɪsɪn(vancomycin, gentamicin) |
| …ˈpreɪzoʊl(esomeprazole, omeprazole) |
| …ˈstætɪn(atorvastatin, rosuvastatin) |
| …ˈsɑːrtən(losartan, valsartan) |
| …ˈeɪzoʊl(fluconazole, ketoconazole) |
| …ˈsɪlɪn(amoxicillin, piperacillin) |
| …ˈpɛərɪn(enoxaparin, heparin) |
音素验证方式——使用候选IPA实时测试Magpie的美式英语神经G2P模型。若Magpie接受该SSML,则说明该IPA在其音素库中。先使用上述后缀模式作为预筛选(低成本启发式规则),再通过实时测试确认后,方可提交自定义覆盖。的实现方法——一个最小化的NVCF gRPC合成调用,返回/的闭包——位于中。
magpie_validates_ipa(ipa, api_key, voice_id)TrueFalsereferences/pronunciation-pipeline.md在向用户展示候选IPA前,需调用一次验证。经用户批准后,将验证通过的IPA追加到中。下次生成清单时,该行的将从变为。
pronunciation_overrides.csvipa_sourcemagpie_g2poverride进入步骤2e前需通过HITL审核关卡——未通过则终止流程。在对话中明确发生以下情况之一前,不得进行全笛卡尔积合成、不得将任何候选IPA升级到、不得进入第3阶段:
pronunciation_overrides.csv- 用户确认已试听QA音频,并针对每个音频(或分组)给出反馈(如“MW组的发音没问题”、“修正的发音”等)。提供
pembrolizumab(macOS)或afplay/paplay(Linux)命令供用户播放音频——然后暂停并等待用户听完后的回复。仅通过点击“全部确认”或“锁定”而未试听的纸面批准不满足此关卡要求。Magpie验证IPA仅能证明其在音素库中,无法证明其符合预期发音。只有用户的听觉判断才能确认这一点。aplay - 用户明确选择跳过本次周期的试听,且表述清晰(例如:“跳过试听,接受发音错误可能削弱第3阶段KER信号的风险——将此记录为cycle-N的注意事项”),而非仅通过单次点击操作。将跳过试听的情况记录在周期级备注中(如),以便后续操作人员知晓试听已被推迟。
eval/cycle<N>/cycle_notes.md
Magpie NVCF对超过100行的任务会严格限流,重新执行会消耗API额度和时间——但更大的风险是交付带有错误发音参考音频的清单,从而悄悄破坏第3阶段的KER信号。花时间试听比重新执行整个周期更划算。
2e. Full benchmark generation
2e. 完整基准测试集生成
After pronunciations are locked, generate the full Cartesian product . Defaults: 2–4 Magpie en-US voices (Mia/Jason/Ray), , .
|terms| × |voices| × |noise_levels| × |context_types|[clean, snr_15db, snr_5db][dictation, handoff, chart_note, history]Self-contained synthesis — no required. The recipe — opens an NVCF gRPC stream, wraps overrides into SSML via , writes 16-bit mono PCM to — is in (§Synthesis call). Key invariant: carries every entry from (including context-word overrides like ) so the renderer wraps any override whose verbatim text appears in . Wrapping only silently drops context-word overrides.
/read-aloudsynthesize_row(row, all_overrides, out_dir, api_key)render_sentence_with_overrides<out_dir>/audio/<slug>.wavreferences/pronunciation-pipeline.mdall_overridespronunciation_overrides.csvintravenouslyrow['text']row['term']Noise-injection (clean → → ) and the manifest schema (NeMo canonical fields + clinical extension, plus pre-flight schema and audio-existence checks) all live in .
snr_15dbsnr_5dbreferences/manifest-schema.mdWarn when product > 100 rows. Magpie NVCF rate-limits with ~5–10% drops on big runs. Re-run the dropped rows.
RESOURCE_EXHAUSTED发音确认无误后,生成全笛卡尔积:。默认配置:2-4种Magpie美式英语语音(Mia/Jason/Ray)、、。
|术语数| × |语音数| × |噪声等级| × |场景类型|[clean, snr_15db, snr_5db][dictation, handoff, chart_note, history]合成流程独立完成——无需依赖。的实现方法——打开NVCF gRPC流,通过将自定义覆盖包装为SSML,将16位单声道PCM写入——位于(§合成调用)。核心规则:需包含中的所有条目(包括等场景词汇的覆盖),以便渲染器自动包装中出现的任何覆盖词汇。仅包装会导致场景词汇的覆盖被忽略。
/read-aloudsynthesize_row(row, all_overrides, out_dir, api_key)render_sentence_with_overrides<out_dir>/audio/<slug>.wavreferences/pronunciation-pipeline.mdall_overridespronunciation_overrides.csvintravenouslyrow['text']row['term']噪声注入(clean → → )和清单schema(NeMo标准字段+临床扩展字段,以及预校验schema和音频存在性检查)均位于中。
snr_15dbsnr_5dbreferences/manifest-schema.md当笛卡尔积行数>100时发出警告。Magpie NVCF对大型任务会有约5-10%的错误。需重新运行失败的行。
RESOURCE_EXHAUSTEDStage 2 completion checklist
第2阶段完成检查清单
Don't consider Stage 2 done until all five sub-steps ran. Agents commonly stop after 2a or 2b; the goal is a synthesized manifest plus a hand-off:
- 2a — , 4–10 terms,
term_seed.csventity_category ∈ {drug, procedure, anatomy, condition, lab, role} - 2b — 3–5 sentence variants per term
context_type - 2c — every term tagged
ipa_source ∈ {override, merriam-webster, magpie_g2p} - 2d — QA wavs auditioned, IPA overrides locked with explicit user approval
- 2e — + per-row audio for the Cartesian product
manifest.jsonl - Hand-off — name as the next skill and KER as its headline metric
/digital-health-clinical-asr-eval
Writes go only into the user-chosen . Don't write elsewhere, modify env, or install packages — those belong to .
$EVAL_DIR/cycle<N>//digital-health-clinical-asr-setup需完成所有五个子步骤后,方可认为第2阶段结束。Agent常停留在步骤2a或2b;本阶段的目标是生成合成清单并完成交接:
- 2a — ,包含4-10个术语,
term_seed.csventity_category ∈ {drug, procedure, anatomy, condition, lab, role} - 2b — 每个术语对应3-5种的句子变体
context_type - 2c — 每个术语均标记
ipa_source ∈ {override, merriam-webster, magpie_g2p} - 2d — QA音频已试听,IPA覆盖经用户明确批准后锁定
- 2e — + 笛卡尔积中每一行对应的音频
manifest.jsonl - 交接 — 告知用户下一Skill为,其核心指标为KER
/digital-health-clinical-asr-eval
所有写入操作仅允许在用户选择的目录中进行。不得写入其他位置、修改环境变量或安装包——这些操作属于的职责范围。
$EVAL_DIR/cycle<N>//digital-health-clinical-asr-setupExamples
示例
Scenario A — fresh oncology benchmark. User: "We're seeing chemo drug names mistranscribed. Where do I start?" → Step 2a: confirm specialty is oncology, ask about which drugs (immunotherapy biologics, platinum agents, taxanes). Propose ~10 candidates: , , , , , , , , , . Write with all . Step 2b: brief for 4 context variants each = 40 sentences. Step 2c: MW lookup for each — biologics like will likely fall to ; platinum agents likely hit MW. Step 2d: synthesize one QA wav per term, walk the user through the etc. clips, propose IPA candidates with suffix stress patterns. Step 2e: on approval, run 10 terms × 2 voices × 2 noise levels × 3 contexts = 120 rows.
cisplatinpaclitaxelpembrolizumabnivolumabcarboplatindocetaxelbevacizumabtrastuzumabcetuximabpemetrexedterm_seed.csventity_category=drug/data-designerpembrolizumabmagpie_g2ppembrolizumab-mabScenario B — appending to an existing cycle. User: "I have a cycle-1 manifest and I want to add 5 more procedures." → Re-run only Steps 2a (specialty interview just for the new terms), 2b (sentence gen for the additions), 2c (IPA pipeline for the additions), 2d (audition the new terms), and 2e (synthesize only the new term rows). Append to the existing . Do not regenerate audio for existing terms — cycle isolation is intentional so leaderboards diff cycle N vs cycle N+1 cleanly.
manifest.jsonl场景A——全新肿瘤基准测试集。用户:“我们发现化疗药物名称经常被转录错误。我该从哪里开始?” → 步骤2a:确认专科为肿瘤学,询问涉及哪些药物(免疫治疗生物制剂、铂类药物、紫杉烷类)。提出约10个候选术语:、、、、、、、、、。将所有术语的写入。步骤2b:向提供指令,每个术语生成4种场景变体 → 共40个句子。步骤2c:为每个术语查询MW——等生物制剂可能会 fallback 到;铂类药物可能匹配到MW结果。步骤2d:为每个术语合成一个QA音频,引导用户试听等音频,基于后缀重音模式提出IPA候选。步骤2e:获得批准后,运行10个术语 × 2种语音 × 2种噪声等级 × 3种场景 = 120行。
cisplatinpaclitaxelpembrolizumabnivolumabcarboplatindocetaxelbevacizumabtrastuzumabcetuximabpemetrexedentity_category=drugterm_seed.csv/data-designerpembrolizumabmagpie_g2ppembrolizumab-mab场景B——向现有周期追加内容。用户:“我有一个cycle-1清单,想添加5个手术术语。” → 仅重新运行步骤2a(仅针对新增术语进行专科访谈)、2b(为新增术语生成句子)、2c(为新增术语执行IPA流程)、2d(试听新增术语的音频)和2e(仅合成新增术语的行)。将结果追加到现有中。请勿重新生成现有术语的音频——周期隔离是有意设计的,以便排行榜可以清晰对比cycle N和cycle N+1的差异。
manifest.jsonlArtifacts produced
生成的产物
- — curated terms with
term_seed.csventity_category - — verified IPA, appendable across cycles
pronunciation_overrides.csv - — NeMo format with clinical extension fields (one JSON object per line)
manifest.jsonl - — synthesized clips, one per manifest row
audio/<slug>.wav
- — 带有
term_seed.csv的整理后术语entity_category - — 验证通过的IPA,可跨周期追加
pronunciation_overrides.csv - — 带有临床扩展字段的NeMo格式清单(每行一个JSON对象)
manifest.jsonl - — 合成音频片段,每个清单行对应一个
audio/<slug>.wav
Troubleshooting
故障排查
- TTS rate-limit drops () on >100-row generation → expected on Magpie NVCF. Confirm exponential backoff is active in
RESOURCE_EXHAUSTED; expect ~5–10% drops on big runs and re-run for the gaps./read-aloud - All rows tagged
ipa_source→ MW lookup is failing across the board, or candidate IPAs are failing phoneme validation. Re-verify whichever MW path you configured (magpie_g2pfor A; HTTPS reachability + parser for B), then check candidates against Magpie's en-US phoneme inventory.DICTIONARY_API_KEY - Magpie mispronounces a term even with the IPA override → first verify the IPA is in the Magpie en-US phoneme inventory and the SSML wrapping is syntactically valid. If both check out, the underlying TTS bug is owned by (
/read-aloud) — route there for diagnosis. This skill provides the override mechanism but does not own the neural G2P or SSML parser./riva-tts - Sentence variants from are bland / template-like → check the brief; the schema-only prompt sometimes produces stereotyped output. Add 1–2 in-context examples to the brief and re-run.
/data-designer - Audio files exist but is short → manifest writer skipped rows whose synthesis returned a NVCF error. Re-run the build with only the missing rows.
manifest.jsonl
For anything not in this list, identify which upstream skill is implicated and route there. The skill owns the methodology, not the TTS or DataDesigner internals.
digital-health-clinical-asr-build- TTS限流错误()——当生成行数>100时出现,这在Magpie NVCF上是预期情况。确认
RESOURCE_EXHAUSTED已启用指数退避机制;大型任务约有5-10%的失败率,需重新运行失败的行。/read-aloud - 所有行均标记为
ipa_source——MW查询全面失败,或候选IPA未通过音素验证。重新验证你配置的MW路径(路径A需检查magpie_g2p;路径B需检查HTTPS可达性和解析器),然后检查候选IPA是否符合Magpie的美式英语音素库。DICTIONARY_API_KEY - 即使使用IPA覆盖,Magpie仍发音错误——首先验证IPA是否在Magpie的美式英语音素库中,且SSML包装语法正确。若两者均无问题,则底层TTS bug由(
/read-aloud)负责——请路由至该Skill进行诊断。本Skill仅提供覆盖机制,不负责神经G2P或SSML解析器的问题。/riva-tts - 生成的句子变体平淡/模板化——检查指令;仅提供schema的提示有时会产生刻板输出。在指令中添加1-2个上下文示例后重新运行。
/data-designer - 音频文件存在但行数不足——清单生成器跳过了合成时返回NVCF错误的行。仅针对缺失的行重新运行构建流程。
manifest.jsonl
对于未在此列表中的问题,确定涉及哪个上游Skill并路由至该Skill。 Skill负责方法论,不负责TTS或DataDesigner的内部实现。
digital-health-clinical-asr-buildLimitations
局限性
- English-only by default. Magpie's en-US phoneme inventory is what the two-tier IPA pipeline validates against. Other locales need a different upstream phoneme set + override CSV format.
- Six fixed entity categories. Extending is a deliberate methodology change, not a one-off tweak — KER breakdowns, leaderboard sections, and downstream finetune scripts all key off the vocabulary.
entity_category - Tiny first cycles. Below ~20 terms, the by-leaderboard split won't have enough rows in each bucket to be statistically meaningful. Build a meaningful cycle even if it costs a session.
ipa_source - Magpie NVCF rate-limits. ~5–10% drops on large jobs; budget a re-run pass.
- 默认仅支持英语。双层IPA流程基于Magpie的美式英语音素库进行验证。其他地区语言需要不同的上游音素集+覆盖CSV格式。
- 六个固定实体分类。扩展是对方法论的针对性修改,而非临时调整——KER指标细分、排行榜章节和下游微调脚本均依赖该词汇体系。
entity_category - 首次周期规模较小。若术语数少于20个,按拆分的排行榜每个分组的行数不足,无法具备统计意义。即使需要额外时间,也要构建一个有意义的周期。
ipa_source - Magpie NVCF限流。大型任务约有5-10%的失败率;需预留重新运行的时间。
Next steps
下一步
- Forward: — transcribe the manifest, score WER/CER/KER/SER, produce the five-section leaderboard.
/digital-health-clinical-asr-eval - Back to setup (if anything in the env is broken): .
/digital-health-clinical-asr-setup - Lateral for TTS-specific debugging: or
/read-aloud./riva-tts
- 前进:— 转录清单,评分WER/CER/KER/SER,生成五部分排行榜。
/digital-health-clinical-asr-eval - 返回设置(若环境存在问题):。
/digital-health-clinical-asr-setup - 横向排查(TTS特定问题):或
/read-aloud。/riva-tts
References
参考文档
- — NeMo canonical fields + clinical extension; pre-flight schema and audio-existence checks; cross-cycle stability rules
references/manifest-schema.md
- — NeMo标准字段+临床扩展字段;预校验schema和音频存在性检查;跨周期稳定性规则
references/manifest-schema.md