wjs-dubbing-video

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

wjs-dubbing-video

wjs-dubbing-video

Video + target-language SRT →
*_<lang>_dub.mp4
with a time-aligned TTS voice. This skill stops at the dub track. Burn-in + audio bed mixing is the next skill (
/wjs-burning-subtitles/render.py
composites everything in one final encode).
视频 + 目标语言SRT字幕 → 生成带有时间对齐TTS语音的
*_<lang>_dub.mp4
文件。本技能仅完成配音音轨制作。字幕嵌入与背景音混音需使用下一个技能(
/wjs-burning-subtitles/render.py
会将所有内容合成到最终编码文件中)。

When to use

适用场景

  • User has a target-language SRT (e.g.,
    entrevista.zh-CN.srt
    ) and wants the video to speak that language.
  • User says "中文配音 / 配音 / 帮我做配音 / dub it / voice over".
  • User has multiple speakers on camera and wants different voices per speaker.
  • 用户拥有目标语言SRT字幕(例如
    entrevista.zh-CN.srt
    ),希望视频以该语言发声。
  • 用户说出“中文配音 / 配音 / 帮我做配音 / dub it / voice over”。
  • 用户的视频中有多位说话人,希望为每位说话人分配不同的语音。

When NOT to use

不适用场景

  • No SRT yet → run
    /wjs-transcribing-audio
    then
    /wjs-translating-subtitles
    first.
  • Source-language only TTS (rare; usually you translate first) → still use this skill, but pass the source SRT.
  • Burn-in only, no audio change → skip to
    /wjs-burning-subtitles
    .
  • 尚未生成SRT字幕 → 先运行
    /wjs-transcribing-audio
    ,再运行
    /wjs-translating-subtitles
  • 仅需源语言TTS配音(罕见情况;通常需先翻译)→ 仍可使用本技能,但传入源语言SRT字幕即可。
  • 仅需嵌入字幕,无需更改音频 → 直接使用
    /wjs-burning-subtitles

Number of speakers — default to one

说话人数量——默认单角色

Default: assume one speaker. Use a single voice for the entire dub. This is the right answer for monologues, vlogs, recorded talks, narrator-only clips, and the overwhelming majority of videos people ask about. Don't run diarization, don't tag the SRT with
[A]
/
[B]
, don't bring up multi-speaker complexity.
Switch to multi-speaker only when the user explicitly says so — phrasings like "two people", "interview", "dialogue", "conversation between", "separate the speakers", "different voice for each", or a direct request to do diarization. When triggered, follow the "Multi-speaker dubbing" section below.
If you're unsure whether a video is one speaker or many, ship the single-voice version first. Adding speaker separation later is cheap (just regenerate the dub); shipping confused multi-speaker output by default wastes the user's time.
默认:假设视频为单说话人。为整个配音使用单一语音。这适用于独白、vlog、录制讲座、仅旁白的视频,以及绝大多数用户请求的视频场景。无需运行说话人分离,无需在SRT中标记
[A]
/
[B]
,无需引入多角色的复杂操作。
仅当用户明确要求时才切换到多角色模式——例如用户提到“两个人”、“采访”、“对话”、“交谈”、“分离说话人”、“每位说话人用不同声音”,或直接请求进行说话人分离。触发多角色模式后,请遵循下文“多角色配音”部分的操作。
若不确定视频是单说话人还是多说话人,先交付单角色配音版本。后续添加说话人分离成本很低(只需重新生成配音);若默认交付混乱的多角色输出,会浪费用户时间。

Engine routing — by voice ID

引擎路由——按语音ID选择

scripts/dub.py
auto-routes by voice-ID prefix:
Voice ID patternEngineAuth
zh_..._bigtts
Volcano (字节跳动豆包) TTS
VOLC_TTS_APPID
+
VOLC_TTS_ACCESS_TOKEN
zh-CN-...Neural
/
en-US-...Neural
/ etc.
edge-tts (Microsoft Edge neural)none (free)
For Mandarin, Volcano is markedly more natural than edge-tts, especially for emotional/contemplative content. Use edge-tts when Volcano credentials aren't available or as a debugging fallback.
scripts/dub.py
会根据语音ID前缀自动选择引擎:
语音ID模式引擎认证方式
zh_..._bigtts
Volcano(字节跳动豆包)TTS
VOLC_TTS_APPID
+
VOLC_TTS_ACCESS_TOKEN
zh-CN-...Neural
/
en-US-...Neural
edge-tts(Microsoft Edge神经语音合成)无需认证(免费)
对于普通话,Volcano的语音自然度明显优于edge-tts,尤其是情感类/沉思类内容。当Volcano凭证不可用时,或作为调试回退方案时,使用edge-tts。

Volcano TTS (Chinese only)

Volcano TTS(仅支持中文)

Endpoint:
https://openspeech.bytedance.com/api/v3/tts/unidirectional
(used for both TTS 1.0 and 2.0; the Resource-Id header picks the backend).
Headers:
X-Api-App-Id:       (env: VOLC_TTS_APPID)         # 10-digit speech App ID
X-Api-Access-Key:   (env: VOLC_TTS_ACCESS_TOKEN)  # 32-char token from speech console
X-Api-Resource-Id:  volc.service_type.10029       # see resource ID note below
Content-Type:       application/json
Loading credentials: most users keep them in
~/code/.env
. Read them at the top of any session via:
bash
set -a; source ~/code/.env; set +a
接口地址:
https://openspeech.bytedance.com/api/v3/tts/unidirectional
(同时用于TTS 1.0和2.0;通过Resource-Id头选择后端)。
请求头:
X-Api-App-Id:       (环境变量: VOLC_TTS_APPID)         # 10位语音应用ID
X-Api-Access-Key:   (环境变量: VOLC_TTS_ACCESS_TOKEN)  # 语音控制台获取的32位令牌
X-Api-Resource-Id:  volc.service_type.10029       # 详见下文资源ID说明
Content-Type:       application/json
加载凭证:大多数用户将凭证保存在
~/code/.env
中。可通过以下命令在会话开始时读取:
bash
set -a; source ~/code/.env; set +a

Resource ID — important quirk

资源ID——重要注意事项

The doc lists
seed-tts-2.0
as the "TTS 2.0 (recommended)" resource, but a typical TTS-SeedTTS2.0 console instance does not include the popular
*_bigtts
speaker catalog (爽快斯斯, 高冷御姐, 开朗姐姐, etc.). Trying those speakers against
seed-tts-2.0
returns
200 code=55000000 "resource ID is mismatched with speaker related resource"
. The fix is to use
volc.service_type.10029
(the TTS 1.0 V3 endpoint) — the audio quality of the bigtts speakers is identical, and they all work against this resource. The bundled
dub.py
defaults to
volc.service_type.10029
; override with
VOLC_TTS_RESOURCE
env if you have a different instance.
Other 401/403 errors:
  • 401 code=45000010 "load grant: requested grant not found in SaaS storage"
    — the App ID + key combo is valid against the gateway, but the user has not activated this resource. They must go to 火山引擎 → 语音技术 → 语音合成大模型 → 实例管理 and 开通 the service. No workaround.
  • 403 code=45000030
    — the speaker isn't included in the user's instance bundle.
文档中列出
seed-tts-2.0
为“TTS 2.0(推荐)”资源,但典型的TTS-SeedTTS2.0控制台实例不包含热门的
*_bigtts
语音库(爽快斯斯、高冷御姐、开朗姐姐等)。若使用这些语音调用
seed-tts-2.0
,会返回
200 code=55000000 "resource ID is mismatched with speaker related resource"
。解决方法是使用
volc.service_type.10029
(TTS 1.0 V3接口)——bigtts语音的音频质量完全相同,且均可在此资源下正常使用。附带的
dub.py
默认使用
volc.service_type.10029
;若您的实例不同,可通过环境变量
VOLC_TTS_RESOURCE
覆盖。
其他401/403错误:
  • 401 code=45000010 "load grant: requested grant not found in SaaS storage"
    —— App ID和密钥组合在网关验证有效,但用户未激活该资源。用户必须前往火山引擎 → 语音技术 → 语音合成大模型 → 实例管理,开通该服务。无替代方案。
  • 403 code=45000030
    —— 该语音不在用户的实例套餐中。

Response format

响应格式

Despite the doc's casual language, the response is streaming NDJSON, not a single JSON object and not raw audio bytes. Each line is a separate JSON event with a base64-encoded MP3 chunk in
data
. The terminal event has
code: 20000000
(which means OK in this API's success codes — different from
code: 0
). Concatenate the decoded chunks for the full MP3.
python
import base64, json, requests
audio = b""
r = requests.post(url, headers=h, json=payload, timeout=60, stream=True)
for line in r.iter_lines():
    if not line: continue
    evt = json.loads(line)
    if evt.get("code") not in (0, None, 20000000):
        raise RuntimeError(f"code={evt.get('code')} {evt.get('message')}")
    if evt.get("data"):
        audio += base64.b64decode(evt["data"])
尽管文档描述较为随意,但响应为流式NDJSON,而非单个JSON对象或原始音频字节。每行是一个独立的JSON事件,其中
data
字段包含base64编码的MP3片段。终端事件的
code
20000000
(在此API中表示成功,与
code: 0
不同)。将解码后的片段拼接即可得到完整MP3。
python
import base64, json, requests
audio = b""
r = requests.post(url, headers=h, json=payload, timeout=60, stream=True)
for line in r.iter_lines():
    if not line: continue
    evt = json.loads(line)
    if evt.get("code") not in (0, None, 20000000):
        raise RuntimeError(f"code={evt.get('code')} {evt.get('message')}")
    if evt.get("data"):
        audio += base64.b64decode(evt["data"])

Speaker catalog (verified working under
volc.service_type.10029
)

语音库(已验证在
volc.service_type.10029
下可用)

Full list at volcengine.com/docs/6561/1257544 — but availability depends on your instance bundle. Confirmed-working female voices for the typical SeedTTS-2.0 starter instance:
Speaker ID中文名Feel
zh_female_gaolengyujie_moon_bigtts
高冷御姐Best for contemplative/spiritual content. Mature, restrained, calm.
zh_female_kailangjiejie_moon_bigtts
开朗姐姐Warm older-sister storytelling.
zh_female_shuangkuaisisi_moon_bigtts
爽快斯斯Versatile, conversational baseline.
zh_female_linjianvhai_moon_bigtts
邻家女孩Casual, lifestyle-vlog.
zh_female_yuanqinvyou_moon_bigtts
元气女友Lively, upbeat.
zh_female_meilinvyou_moon_bigtts
美丽女友Soft, intimate.
zh_female_shuangkuaisisi_emo_v2_mars_bigtts
斯斯情感版Full emotional range — pair with explicit emotion + scale.
These voices return 55000000 against the typical instance even though the doc lists them:
vv_uranus_bigtts
,
wenroushunv_moon_bigtts
,
qingxin_moon_bigtts
,
yingmaoxiaoyuan_moon_bigtts
,
tianxinxiaoling_moon_bigtts
,
shaoergushi_moon_bigtts
. Don't promise them without testing.
完整列表见volcengine.com/docs/6561/1257544 —— 但可用性取决于您的实例套餐。典型SeedTTS-2.0入门实例中已确认可用的女性语音:
语音ID中文名风格
zh_female_gaolengyujie_moon_bigtts
高冷御姐最适合沉思/精神类内容。成熟、克制、沉稳。
zh_female_kailangjiejie_moon_bigtts
开朗姐姐温暖的大姐姐讲故事风格。
zh_female_shuangkuaisisi_moon_bigtts
爽快斯斯通用、对话式基准语音。
zh_female_linjianvhai_moon_bigtts
邻家女孩休闲、生活vlog风格。
zh_female_yuanqinvyou_moon_bigtts
元气女友活泼、积极向上。
zh_female_meilinvyou_moon_bigtts
美丽女友柔和、亲密。
zh_female_shuangkuaisisi_emo_v2_mars_bigtts
斯斯情感版全情感范围 —— 需搭配明确的情感参数和强度。
以下语音即使文档列出,在典型实例中调用也会返回55000000错误:
vv_uranus_bigtts
wenroushunv_moon_bigtts
qingxin_moon_bigtts
yingmaoxiaoyuan_moon_bigtts
tianxinxiaoling_moon_bigtts
shaoergushi_moon_bigtts
。未经测试请勿向用户承诺这些语音。

Audio params

音频参数

speech_rate
is Volcano's native scale [-50, +100] where the value is a percentage delta (so
-8
means 8% slower). The script passes
--rate -8%
through as
-8
.
Useful emotion presets:
  • emotion="calm"
    ,
    emotion_scale=4
    — contemplative, default for this skill's spiritual-content niche.
  • emotion="gentle"
    — softer / more intimate.
  • emotion="neutral"
    — flat / informational.
  • emotion="sad"
    — melancholic. Use sparingly.
Override
dub.py
defaults with
VOLC_TTS_EMOTION
and
VOLC_TTS_EMOTION_SCALE
env vars without editing code.
No English Volcano voices are wired up in this skill — for English use edge-tts (next section). Volcano does have English speakers (
en_male_*_bigtts
,
en_female_*_bigtts
) but they aren't typically included in TTS-SeedTTS-2.0 starter instances. Add them by extending the voice routing in
dub.py
once verified.
speech_rate
是Volcano的原生参数,范围为[-50, +100],值为百分比增量(例如
-8
表示语速慢8%)。脚本会将
--rate -8%
转换为
-8
传入。
实用情感预设:
  • emotion="calm"
    ,
    emotion_scale=4
    —— 沉思风格,为本技能针对精神类内容的默认设置。
  • emotion="gentle"
    —— 更柔和/亲密。
  • emotion="neutral"
    —— 平淡/信息播报风格。
  • emotion="sad"
    —— 忧郁。请谨慎使用。
无需修改代码,可通过环境变量
VOLC_TTS_EMOTION
VOLC_TTS_EMOTION_SCALE
覆盖
dub.py
的默认设置。
本技能未接入Volcano的英文语音——英文配音请使用edge-tts(下一节)。Volcano确实有英文语音(
en_male_*_bigtts
en_female_*_bigtts
),但通常不包含在TTS-SeedTTS-2.0入门实例中。验证可用后,可通过扩展
dub.py
中的语音路由逻辑添加。

edge-tts (Microsoft Edge neural TTS)

edge-tts(Microsoft Edge神经语音合成)

Free, no API key, high-quality but less expressive than Volcano. Install into a project venv — do not call it via
uvx
once per segment. Each
uvx
invocation spawns a fresh Python process and the bing endpoint will rate-limit or RST the connection after a handful of rapid hits, breaking mid-render.
bash
uv venv .venv
uv pip install --python .venv/bin/python edge-tts
Then drive it from a single long-lived Python process using
edge_tts.Communicate(...)
directly, with retry-on-failure logic. The bundled
scripts/dub.py
does this.
免费、无需API密钥、音质较高,但情感表现力不如Volcano。需安装到项目虚拟环境中——请勿为每个片段单独调用
uvx
。每次
uvx
调用都会启动一个新的Python进程,Bing接口会在多次快速调用后触发限流或重置连接,导致渲染中断。
bash
uv venv .venv
uv pip install --python .venv/bin/python edge-tts
然后通过单个长期运行的Python进程直接使用
edge_tts.Communicate(...)
,并添加失败重试逻辑。附带的
scripts/dub.py
已实现此功能。

Voice selection — match the original speaker

语音选择——匹配原说话人

There is no perfect cross-language match — choose gender, age feel, and tone deliberately, then bend with rate/pitch.
不存在完美的跨语言匹配——需刻意选择性别、年龄感和语气,再通过语速/音调调整优化。

Chinese voices (Volcano preferred, edge-tts fallback)

中文语音(优先Volcano,edge-tts作为回退)

Volcano's
zh_female_gaolengyujie_moon_bigtts
(高冷御姐, calm,
speech_rate=-8
) is the validated baseline for mature contemplative female speakers — equivalent to or better than any edge-tts option for that profile. See the Volcano speaker table above for the rest.
edge-tts catalog (Chinese):
VoiceGenderDefault feel
zh-CN-XiaoxiaoNeural
FWarm, news/novel
zh-CN-XiaoyiNeural
FLively, young
zh-CN-YunjianNeural
MPassionate, sports
zh-CN-YunxiNeural
MSunshine, lively
zh-CN-YunyangNeural
MProfessional newsreader
zh-HK-HiuMaanNeural
FFriendly, slightly mature
zh-TW-HsiaoChenNeural
FFriendly
Volcano的
zh_female_gaolengyujie_moon_bigtts
(高冷御姐,沉稳风格,
speech_rate=-8
)是成熟沉思类女性说话人的验证基准——对于该风格,其表现优于任何edge-tts选项。其他语音见上文Volcano语音表。
edge-tts中文语音库:
语音性别默认风格
zh-CN-XiaoxiaoNeural
温暖,新闻/小说播报风格
zh-CN-XiaoyiNeural
活泼,年轻风格
zh-CN-YunjianNeural
富有激情,体育解说风格
zh-CN-YunxiNeural
阳光,活泼风格
zh-CN-YunyangNeural
专业新闻播报风格
zh-HK-HiuMaanNeural
友好,略带成熟风格
zh-TW-HsiaoChenNeural
友好风格

English voices (edge-tts neural, all multilingual)

英文语音(edge-tts神经语音,均支持多语言)

All voices below speak fluent American/British/Australian English; the
*Multilingual*
ones also handle Spanish names, French/Italian loanwords, etc. without mispronunciation.
VoiceGenderDefault feel
en-US-AvaMultilingualNeural
FBest for warm/mature/caring — natural for spiritual or coaching content
en-US-EmmaMultilingualNeural
FCheerful, conversational, younger
en-US-AndrewMultilingualNeural
MWarm, confident, sincere
en-US-BrianMultilingualNeural
MApproachable, casual
en-US-AriaNeural
FCrisp newsreader
en-US-GuyNeural
MSteady male newsreader
en-GB-SoniaNeural
FBritish female (RP)
en-GB-RyanNeural
MBritish male (RP)
en-AU-WilliamMultilingualNeural
MAustralian male
fr-FR-VivienneMultilingualNeural
FMature European female who also reads English
For matching a mature contemplative Spanish female (this skill's canonical use case), start with
en-US-AvaMultilingualNeural
at
--rate -5% --pitch -3Hz
. Do not use the news-style
Aria
or
Guy
for spiritual content — they sound clinical.
以下所有语音均可流利朗读美式/英式/澳式英语;带有
*Multilingual*
的语音还能正确发音西班牙语姓名、法/意语外来词等,不会出现发音错误。
语音性别默认风格
en-US-AvaMultilingualNeural
最适合温暖/成熟/关怀风格 —— 自然适配精神类或指导类内容
en-US-EmmaMultilingualNeural
开朗,对话式,年轻风格
en-US-AndrewMultilingualNeural
温暖,自信,真诚风格
en-US-BrianMultilingualNeural
平易近人,休闲风格
en-US-AriaNeural
清晰的新闻播报风格
en-US-GuyNeural
沉稳的男性新闻播报风格
en-GB-SoniaNeural
英式女性(RP口音)
en-GB-RyanNeural
英式男性(RP口音)
en-AU-WilliamMultilingualNeural
澳式男性风格
fr-FR-VivienneMultilingualNeural
成熟欧洲女性风格,也可朗读英语
对于匹配成熟沉思类西班牙女性(本技能的典型使用场景),可从
en-US-AvaMultilingualNeural
开始,设置
--rate -5% --pitch -3Hz
请勿为精神类内容使用新闻风格的
Aria
Guy
——它们听起来过于生硬。

Picking heuristics

选择准则

  • Mature contemplative female speaker (yoga/spirituality/coaching):
    zh-CN-XiaoxiaoNeural
    with
    --rate=-8% --pitch=-10Hz
    (or Volcano
    gaolengyujie
    ).
  • Mature professional male:
    zh-CN-YunyangNeural
    with
    --rate=-5%
    . Avoid Yunjian/Yunxi (too energetic).
  • Young casual speaker: Defaults; no pitch shift.
  • Western-mouth feel: one of the
    *MultilingualNeural
    voices.
  • 成熟沉思类女性说话人(瑜伽/精神/指导内容): 使用
    zh-CN-XiaoxiaoNeural
    ,设置
    --rate=-8% --pitch=-10Hz
    (或Volcano的
    高冷御姐
    语音)。
  • 成熟专业男性: 使用
    zh-CN-YunyangNeural
    ,设置
    --rate=-5%
    。避免使用Yunjian/Yunxi(过于活泼)。
  • 年轻休闲说话人: 使用默认设置;无需调整音调。
  • 西式发音风格: 选择
    *MultilingualNeural
    系列语音之一。

Always sample before committing

务必先采样再执行全片配音

🛑 Checkpoint — sample before full dub. A full-video dub is the most expensive step (TTS API calls + atempo + ffmpeg mux). Before running
dub.py
over the whole SRT:
  1. Pick the longest-text cue (worst stretch case) and one short/casual cue (timbre check).
  2. Synthesize 3–4 voice/rate/pitch combos at 3–8s each.
  3. Show the user the audio panel and ask: "选哪个 voice?rate/pitch 要调吗?确认后我再跑全片。" Wait for explicit pick.
Skip the checkpoint only if the user named a specific voice up front AND has already heard a sample of that voice on this video.
The script's
scripts/sample_voices.py
(if present) is a thin wrapper for exactly this; otherwise drive the same Python loop the dub script uses.
Mandatory smoke test before promising any Volcano voice on a new account: synth one ~5-word cue with that speaker ID first; only quote it to the user if the smoke test returns a non-empty MP3. If the smoke test 401s with
code=45000010
("grant not found"), tell the user they need to 开通 the resource in 火山引擎 console — do not pretend it'll work after a retry.
🛑 检查点——全片配音前先采样。全片配音是成本最高的步骤(TTS API调用+语速调整+ffmpeg混流)。在对整个SRT运行
dub.py
之前:
  1. 选择文本最长的片段(最具挑战性的情况)和一个简短/休闲的片段(音色检查)。
  2. 合成3–4种语音/语速/音调组合,每个片段3–8秒。
  3. 向用户展示音频面板并询问:“选哪个voice?rate/pitch要调吗?确认后我再跑全片。”等待用户明确选择。
仅当用户提前指定了特定语音,且已听过该语音在本视频中的采样时,才可跳过检查点。
若存在脚本
scripts/sample_voices.py
,它就是专门用于此操作的轻量封装;否则可直接使用配音脚本中的相同Python循环。
在新账户上承诺使用Volcano语音前,必须进行冒烟测试: 先使用该语音ID合成一个约5个词的片段;仅当冒烟测试返回非空MP3时,才可向用户推荐该语音。若冒烟测试返回401错误
code=45000010
(“grant not found”),需告知用户他们需要在火山引擎控制台开通该资源——不要假装重试后就能正常工作。

Running dub.py

运行dub.py

bash
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py [voice] [rate] [pitch]
bash
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py [voice] [rate] [pitch]

Mature Chinese contemplative female (Volcano):

成熟中文沉思类女性(Volcano):

.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py
zh_female_gaolengyujie_moon_bigtts -8% +0Hz
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py
zh_female_gaolengyujie_moon_bigtts -8% +0Hz

Warm English caring female (edge-tts, multilingual):

温暖关怀类英文女性(edge-tts,多语言):

.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py
en-US-AvaMultilingualNeural -5% -3Hz
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py
en-US-AvaMultilingualNeural -5% -3Hz

Default Chinese fallback (no Volcano creds needed):

默认中文回退方案(无需Volcano凭证):

.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py
zh-CN-XiaoxiaoNeural -8% -10Hz

The script:

1. Reads the SRT (auto-detects `*.zh-CN.srt`, `*.en.srt`, etc., or pass `--srt`).
2. Synthesizes one MP3 per cue under `dub_work/seg_NN.mp3`.
3. Probes each clip's actual duration with `ffprobe`.
4. For each cue: if TTS is longer than the SRT slot, chains `atempo` filters to speed it up; if shorter, pads with silence after.
5. Inserts silence segments for SRT gaps and any trailing tail so the output audio length exactly matches the source video.
6. Muxes the new audio into `*_zh_dub.mp4` / `*_en_dub.mp4` keeping the original video stream by `-c:v copy`.

Output: `<source-stem>_<lang>_dub.mp4` (e.g., `entrevista_zh_dub.mp4`). This is the input for the next step — `/wjs-burning-subtitles/render.py` — which composites the final video.
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py
zh-CN-XiaoxiaoNeural -8% -10Hz

脚本执行流程:

1. 读取SRT字幕(自动检测`*.zh-CN.srt`、`*.en.srt`等,或通过`--srt`参数指定)。
2. 为每个字幕片段合成MP3,保存到`dub_work/seg_NN.mp3`。
3. 使用`ffprobe`探测每个片段的实际时长。
4. 对于每个字幕片段:若TTS音频长于SRT时间槽,通过`atempo`滤镜链加速;若短于,则在末尾填充静音。
5. 为SRT中的间隙和片尾添加静音片段,使输出音频时长与原视频完全匹配。
6. 将新音频混流到`*_zh_dub.mp4`/`*_en_dub.mp4`中,通过`-c:v copy`保留原视频流。

输出文件:`<source-stem>_<lang>_dub.mp4`(例如`entrevista_zh_dub.mp4`)。此文件为下一步的输入——`/wjs-burning-subtitles/render.py`将合成最终视频。

Filling awkward silences

填补尴尬的静音间隙

Mandarin takes 60–80% of the time Spanish does to say the same thing. With strict cue-by-cue timing, that leaves awkward 2–4s silences at the end of most cues. English is closer to ~85% of Spanish. Three levers, in increasing impact:
  1. Slow the native TTS rate. Changing
    --rate
    from
    +0%
    to
    -12%
    to
    -15%
    produces clean, natural-sounding slower speech (much better than time-stretching afterward). Try
    -12%
    first;
    -15%
    /
    -20%
    for very contemplative content.
  2. Mild slow-stretch per cue. When a cue's TTS is still shorter than its slot, run
    atempo
    between 0.82× and 0.95×.
    dub.py
    does this automatically: when slack > 0.5s, it sets
    atempo = max(0.82, tts_dur / target_dur)
    and pads the remainder. Below 0.82× the voice starts sounding drugged; above 0.92× the stretch is essentially imperceptible.
  3. Expand the target-language text in the worst cues. When the slot is so long that even 0.82× stretch leaves >2s of silence, the cleanest fix is to lengthen the translation. Add natural Mandarin particles ("嗯,", "其实", "也就是说", "你知道") or unpack a compressed phrase into its full meaning. This changes the on-screen subtitle, so confirm with the user before doing it. Edit the SRT, regenerate just those segments by deleting their
    dub_work/seg_NN.mp3
    and re-running
    dub.py
    .
Combine the levers: native rate
-12%
+ stretch-to-fit handles ~80% of cases. Reserve text expansion for the 2–3 worst outliers.
普通话表达相同内容所需时间仅为西班牙语的60–80%。严格按片段对齐时间会导致大多数片段末尾出现2–4秒的尴尬静音。英语所需时间约为西班牙语的85%。可通过以下三种方式解决,效果依次增强:
  1. 降低原生TTS语速。将
    --rate
    +0%
    调整为
    -12%
    -15%
    ,可生成清晰自然的慢语速语音(比后期时间拉伸效果好得多)。先尝试
    -12%
    ;对于非常沉思的内容,可尝试
    -15%
    /
    -20%
  2. 对片段进行轻度时间拉伸。当片段的TTS音频仍短于时间槽时,将
    atempo
    设置在0.82×到0.95×之间。
    dub.py
    会自动执行此操作:当间隙>0.5秒时,设置
    atempo = max(0.82, tts_dur / target_dur)
    ,并填充剩余间隙。低于0.82×时,语音会变得生硬;高于0.92×时,拉伸效果几乎无法察觉。
  3. 扩展目标语言文本(针对最严重的片段)。当时间槽过长,即使0.82×拉伸仍留下>2秒静音时,最简洁的解决方法是加长翻译文本。添加自然的普通话语气词(“嗯,”、“其实”、“也就是说”、“你知道”),或将压缩短语扩展为完整含义。这会改变屏幕上的字幕,因此操作前需征得用户同意。编辑SRT后,删除对应的
    dub_work/seg_NN.mp3
    并重新运行
    dub.py
    ,仅重新生成这些片段。
组合使用上述方法:原生语速
-12%
+拉伸适配可解决约80%的情况。仅在2–3个最严重的异常片段中使用文本扩展。

Multi-speaker dubbing (opt-in)

多角色配音(可选)

Only invoke this section when the user explicitly says the source has multiple speakers ("interview", "two people", "dialogue", "separate the speakers", "different voice for each", or a direct request to do diarization).
When triggered, generate the dub with a different voice per speaker so the listener can follow who's speaking. Two paths:
仅当用户明确说明源视频有多位说话人时才启用此部分——例如用户提到“采访”、“两个人”、“对话”、“分离说话人”、“每位说话人用不同声音”,或直接请求进行说话人分离。
触发后,为每位说话人分配不同的语音生成配音,以便听众区分说话人。有两种实现方式:

Path 1 (recommended for on-camera speakers): visual diarization

方式1(推荐用于出镜说话人):视觉说话人分离

scripts/visual_diarize.py
watches mouth movement per face per frame and tags each cue with the dominant speaker. Self-contained, no API keys, no audio fingerprinting.
bash
uv pip install --python .venv/bin/python mediapipe opencv-python

.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/visual_diarize.py \
    --video input.mp4 --srt input.en.srt \
    --out input.en.diarized.srt \
    --report diarization_report.json \
    --sample-fps 5 --num-speakers 2
How it works:
  1. Samples N frames per second (default 5).
  2. Runs MediaPipe FaceLandmarker (Tasks API) for up to
    --num-speakers
    faces per frame, 478 landmarks each.
  3. Measures mouth aperture per face as the vertical distance between inner upper lip (idx 13) and inner lower lip (idx 14).
  4. Bins faces by horizontal screen position (x-quantiles) → speakers
    A
    ,
    B
    , ... left-to-right.
  5. For every cue's [start, end] window, integrates per-speaker frame-to-frame mouth-aperture change. Highest mover wins the tag.
  6. Writes a
    [A]
    /
    [B]
    -prefixed SRT plus a JSON report with per-cue scores and a confidence ratio (winner / runner-up).
On first run, downloads the FaceLandmarker model (~3.6 MB) to
/tmp/mp_models/face_landmarker.task
.
Visual is materially better than guessing from text. In one validation, manual text-based labels split 6/50 between speakers; visual diarization showed the actual split was 29/27 — text-based guessing was wildly wrong because both people take similar-shaped turns. Always prefer visual when the speakers are on camera.
Spot-check low-confidence cues. Any cue in the JSON report with
confidence_ratio < 1.5
is borderline — usually overlapping speech or one speaker briefly off-frame. Hand-correct before dubbing.
scripts/visual_diarize.py
会逐帧监控每个面部的嘴部动作,并为每个字幕片段标记主导说话人。无需API密钥,无需音频指纹识别,独立完成。
bash
uv pip install --python .venv/bin/python mediapipe opencv-python

.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/visual_diarize.py \
    --video input.mp4 --srt input.en.srt \
    --out input.en.diarized.srt \
    --report diarization_report.json \
    --sample-fps 5 --num-speakers 2
工作原理:
  1. 每秒采样N帧(默认5帧)。
  2. 运行MediaPipe FaceLandmarker(Tasks API),每帧检测最多
    --num-speakers
    个面部,每个面部提取478个关键点。
  3. 测量每个面部的嘴部张开程度,即上唇内侧(索引13)与下唇内侧(索引14)的垂直距离。
  4. 按面部水平屏幕位置(x分位数)分组 → 说话人
    A
    B
    ...从左到右排列。
  5. 对于每个字幕片段的[开始,结束]时间窗口,计算每位说话人逐帧的嘴部张开程度变化总和。变化最大的说话人获得标记。
  6. 生成带有
    [A]
    /
    [B]
    前缀的SRT字幕,以及包含每个片段得分和置信度比率(胜者得分/亚军得分)的JSON报告。
首次运行时,会将FaceLandmarker模型(约3.6 MB)下载到
/tmp/mp_models/face_landmarker.task
视觉分离明显优于基于文本的猜测。在一次验证中,手动基于文本的标签将50个片段分为6/44;而视觉分离显示实际比例为29/21——基于文本的猜测完全错误,因为两位说话人的发言模式相似。当说话人出镜时,始终优先使用视觉分离。
抽查低置信度片段。JSON报告中
confidence_ratio < 1.5
的片段属于边界情况——通常是重叠发言或某位说话人短暂离开画面。配音前需手动修正。

Path 2 (fallback): manual tagging

方式2(回退方案):手动标记

For very short clips (1–2 minutes), or when speakers are off-camera, or when visual diarization fails:
text
1
00:00:00,000 --> 00:00:03,400
[A] So what about that AI rewrite thing?

2
00:00:03,400 --> 00:00:08,200
[B] Right — let me explain the workflow.
Save as
*.tagged.srt
. Keep the clean SRT (without tags) for downstream burn-in via
/wjs-burning-subtitles
.
对于非常短的片段(1–2分钟),或说话人未出镜,或视觉分离失败的情况:
text
1
00:00:00,000 --> 00:00:03,400
[A] So what about that AI rewrite thing?

2
00:00:03,400 --> 00:00:08,200
[B] Right — let me explain the workflow.
保存为
*.tagged.srt
。保留干净的SRT字幕(无标记),用于后续通过
/wjs-burning-subtitles
嵌入字幕。

Routing voices in dub.py

在dub.py中配置语音路由

Pass
--voice-map
with
speaker=voice
pairs. The positional voice arg is the default for cues with no tag.
bash
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py \
    en-US-AndrewMultilingualNeural -3% +0Hz \
    --srt input.en.tagged.srt \
    --voice-map "A=en-US-BrianMultilingualNeural,B=en-US-AndrewMultilingualNeural"
Voice-pairing tips:
  • Two of the same gender: pick voices with audibly different timbre. Brian (casual) + Andrew (warm) works for two American males. Ava (warm female) + Emma (cheerful female) for two females.
  • Mixed gender: Ava + Andrew is a clean default.
  • Accent contrast: pair
    en-US-
    and
    en-GB-
    for distinctness.
  • Chinese: mix Volcano voices like
    zh_female_gaolengyujie_moon_bigtts
    (mature) +
    zh_female_kailangjiejie_moon_bigtts
    (warm sister).
通过
--voice-map
参数传递
speaker=voice
对。位置参数中的语音为无标记片段的默认语音。
bash
.venv/bin/python ~/.claude/skills/wjs-dubbing-video/scripts/dub.py \
    en-US-AndrewMultilingualNeural -3% +0Hz \
    --srt input.en.tagged.srt \
    --voice-map "A=en-US-BrianMultilingualNeural,B=en-US-AndrewMultilingualNeural"
语音配对技巧:
  • 同性别的两位说话人: 选择音色明显不同的语音。Brian(休闲)+ Andrew(温暖)适用于两位美国男性。Ava(温暖女性)+ Emma(开朗女性)适用于两位女性。
  • 混合性别: Ava + Andrew是简洁的默认组合。
  • 口音对比: 搭配
    en-US-
    en-GB-
    语音以增强区分度。
  • 中文: 混合Volcano语音,例如
    zh_female_gaolengyujie_moon_bigtts
    (成熟)+
    zh_female_kailangjiejie_moon_bigtts
    (温暖姐姐)。

Limits

限制

Visual diarization fails when:
  • A speaker is consistently off-camera while talking.
  • Camera cuts or zooms make face position unstable across cues.
  • Three or more speakers sit at similar horizontal positions (x-quantile binning is too coarse — switch to k-means on (x, y) or use audio-based diarization instead).
For audio-only material (podcasts, voice-overs), fall back to
pyannote.audio
or
whisperx --diarize
. This skill does not yet bundle audio-based diarization.
视觉分离在以下情况会失败:
  • 说话人说话时始终不在画面中。
  • 镜头切换或缩放导致面部位置在片段间不稳定。
  • 三位或更多说话人坐在相似的水平位置(x分位数分组过于粗糙——切换到基于(x,y)的k-means聚类,或改用基于音频的分离)。
对于纯音频内容(播客、旁白),可回退使用
pyannote.audio
whisperx --diarize
。本技能暂未集成基于音频的分离功能。

Output

输出

  • <source-stem>_<lang>_dub.mp4
    — video stream-copied from source, audio replaced with the time-aligned dub track. Drop-in input for
    /wjs-burning-subtitles/render.py
    .
  • dub_work/seg_NN.mp3
    — per-cue TTS clips (kept for resume / per-cue regen).
  • <source-stem>_<lang>_dub.mp4
    —— 视频流从源文件复制,音频替换为时间对齐的配音音轨。可直接作为
    /wjs-burning-subtitles/render.py
    的输入。
  • dub_work/seg_NN.mp3
    —— 每个字幕片段的TTS音频(保留用于续播/重新生成特定片段)。

Downstream

后续流程

  • /wjs-burning-subtitles
    — to mix the original audio as a low-volume bed, burn the SRT, or both. The final encode happens there in one ffmpeg pass (no cascade). Pass
    --video <source.mp4> --dub <source_lang_dub.mp4> [--srt <srt>]
    to its
    render.py
    .
  • The dub-only file (
    *_<lang>_dub.mp4
    ) is technically a finished video and can ship as-is, but it sounds dubbed (because it is). Mixing the original underneath gives the "professional translation" feel — do that in
    /wjs-burning-subtitles
    .
  • /wjs-burning-subtitles
    —— 将原音频作为低音量背景音混音、嵌入字幕,或两者皆做。最终编码在此处通过一次ffmpeg完成(无级联操作)。向其
    render.py
    传递参数
    --video <source.mp4> --dub <source_lang_dub.mp4> [--srt <srt>]
  • 仅配音的文件(
    *_<lang>_dub.mp4
    )技术上是成品视频,可直接交付,但听起来明显是配音(事实确实如此)。将原音频混合在下方可获得“专业翻译”的效果——此操作需在
    /wjs-burning-subtitles
    中完成。

Anti-patterns

反模式

  • Calling
    uvx edge-tts
    once per cue.
    Spawns a Python process each time; bing endpoint rate-limits or RSTs mid-render. Use the persistent library path in
    dub.py
    .
  • Trusting
    audio_source
    without listening.
    Always sample a 30 s clip before committing.
  • Stretching below 0.82× atempo. Voice starts sounding drugged. Add silence padding or expand text instead.
  • Tagging single-speaker SRTs with
    [A]
    .
    Wastes time and the dub sounds the same. Default to one voice.
  • Promising a Volcano voice without smoke-testing it on the user's instance. The doc lists many voices that error with
    code=55000000
    against typical SeedTTS-2.0 starter bundles. Always synth a 5-word smoke test before quoting.
  • Parsing Volcano response as one JSON document. It's streaming NDJSON; the success terminator is
    code=20000000
    , not
    code=0
    . Concatenate base64-decoded
    data
    chunks for the full MP3.
  • 为每个片段单独调用
    uvx edge-tts
    。每次调用都会启动一个Python进程;Bing接口会在多次快速调用后触发限流或重置连接,导致渲染中断。使用
    dub.py
    中的持久化库调用方式。
  • 未试听就信任
    audio_source
    。执行全片配音前务必试听30秒片段。
  • 将atempo拉伸至0.82×以下。语音会变得生硬。改用填充静音或扩展文本。
  • 为单说话人SRT添加
    [A]
    标记
    。浪费时间且配音效果无变化。默认使用单角色配音。
  • 未在用户实例上冒烟测试就承诺使用Volcano语音。文档列出的许多语音在典型SeedTTS-2.0入门套餐中会返回
    code=55000000
    错误。推荐前务必合成5个词的冒烟测试片段。
  • 将Volcano响应解析为单个JSON文档。它是流式NDJSON;成功终止符是
    code=20000000
    ,而非
    code=0
    。将base64解码后的
    data
    片段拼接得到完整MP3。