heygen-translate
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHeyGen Video Translation
HeyGen Video Translation
Translate and dub an existing video into 175+ languages. The system clones the
presenter's voice into the target language, re-syncs their lips to the new
audio, and returns a fully dubbed video. You provide a source video and a target
language — the engine handles transcription, translation, voice cloning,
lip-sync, and (optionally) burned-in captions.
This is not new-video generation. The presenter, performance, framing, and
brand assets in the original video are preserved. Translation rides on top of
what's already there.
将现有视频翻译并配音为175+种语言。系统会克隆演示者的语音并转换为目标语言,使其嘴唇重新同步到新音频,最终返回一个完全配音的视频。您只需提供原始视频和目标语言——引擎会自动处理转录、翻译、语音克隆、唇形同步以及(可选)内嵌字幕。
这不是新视频生成功能。原始视频中的演示者、表现、画面构图和品牌资产都会被保留。翻译是基于已有内容进行叠加处理。
User-Facing Behavior
用户交互规范
- Be concise. Don't dump translation IDs, raw API payloads, or status JSON in chat. Report the result (video link, language) not the plumbing.
- No internal jargon. Don't say "polling", "video_translation_id", "asset_id", "v3 endpoint". Say "translating", "almost done", "your file".
- Polling is silent. Background long-running translations and only speak on (a) result delivered, (b) >5 min stall (one update), (c) hard failure.
- One result, one message. When the video is done, send the link plus a one-line summary (target language, duration, mode). Not every API field.
- Don't narrate transport choice. MCP vs CLI vs OpenClaw plugin is internal. Pick silently at session start; never mention which is in use.
- Communicate in the user's language. Detect from their first message. Replies, confirmations, and questions in their language. Technical CLI/API directives stay in English.
- 简洁表达:不要在聊天中输出翻译ID、原始API负载或状态JSON。只报告结果(视频链接、语言),而非内部流程细节。
- 避免内部术语:不要使用“polling”“video_translation_id”“asset_id”“v3 endpoint”等术语。改用“正在翻译”“即将完成”“您的文件”等表述。
- 轮询操作静默进行:后台处理长时间运行的翻译任务,仅在以下情况告知用户:(a) 结果已交付;(b) 停滞超过5分钟(仅更新一次);(c) 任务失败。
- 一个结果对应一条消息:视频完成后,发送链接加一行简短摘要(目标语言、时长、模式),无需列出所有API字段。
- 不要提及传输方式选择:MCP、CLI或OpenClaw插件的选择是内部操作。在会话开始时静默选择,绝不向用户提及使用的是哪种方式。
- 使用用户的语言沟通:根据用户第一条消息检测其使用语言。回复、确认和提问都使用该语言。技术CLI/API指令保持英文。
API Mode Detection
API模式检测
Pick one transport at session start. Never mix, never switch mid-session, never narrate the choice.
Detect in this order:
- OpenClaw plugin mode — If running inside OpenClaw and the tool exposes a HeyGen translation model, prefer that. Currently the plugin generates videos but does not expose translation directly — fall through to the next tier until HeyGen ships translation through
video_generate.video_generate - CLI mode (API-key override) — If is set in the environment AND
HEYGEN_API_KEYexits 0, use CLI. API-key presence is an explicit signal that the user wants direct API access.heygen --version - MCP mode — No AND HeyGen MCP tools are visible (
HEYGEN_API_KEY). OAuth auth, runs against the user's plan credits.mcp__heygen__* - CLI mode (fallback) — MCP tools not available AND exits 0. Auth via
heygen --version.heygen auth login - Neither — tell the user once: "To use this skill, connect the HeyGen MCP server or install the HeyGen CLI: then
curl -fsSL https://static.heygen.ai/cli/install.sh | bash."heygen auth login
会话开始时选择一种传输方式。绝不混用、中途切换,也绝不向用户提及选择的方式。
检测顺序如下:
- OpenClaw插件模式 — 如果在OpenClaw环境中运行,且工具支持HeyGen翻译模型,则优先使用该模式。目前插件仅支持生成视频,未直接提供翻译功能——在HeyGen通过
video_generate推出翻译功能前,自动降级到下一模式。video_generate - CLI模式(API密钥覆盖) — 如果环境中已设置且
HEYGEN_API_KEY执行返回0,则使用CLI模式。API密钥的存在表明用户希望直接访问API。heygen --version - MCP模式 — 未设置且可见HeyGen MCP工具(
HEYGEN_API_KEY)。采用OAuth认证,使用用户套餐内的额度。mcp__heygen__* - CLI模式( fallback) — 无MCP工具可用但执行返回0。通过
heygen --version完成认证。heygen auth login - 无可用模式 — 告知用户一次:“要使用此功能,请连接HeyGen MCP服务器或安装HeyGen CLI:然后执行
curl -fsSL https://static.heygen.ai/cli/install.sh | bash。”heygen auth login
Auth verification (run before any API call)
认证验证(API调用前执行)
After mode detection, verify auth actually works before entering Phase 1. This avoids wasting the user's time gathering inputs only to hit an auth error on submit.
- MCP mode: auth is handled by OAuth — no check needed.
- CLI mode: run (silent). If it exits 0, proceed. If it exits non-zero (no key, expired, invalid):
heygen auth status- Ask the user: "I need your HeyGen API key to proceed. You can grab one from https://app.heygen.com/settings?nav=API — paste it here."
- Once they provide it, persist it: (writes to
echo "<key>" | heygen auth login, survives across sessions).~/.heygen/credentials - Verify: . If still failing, surface the error and stop.
heygen auth status
This is a one-time setup. Once persists the key, future sessions pick it up automatically. Don't ask again if passes.
heygen auth loginheygen auth statusHard rules:
- Never call Every operation in this skill has a CLI command and (where supported) an MCP tool. Use those.
curl api.heygen.com/... - MCP mode: only tools. If translation isn't exposed via MCP yet, fall through to CLI for translation operations specifically. Do not synthesize raw HTTP calls.
mcp__heygen__* - CLI mode: only commands. Run
heygen ...andheygen video-translate --helpto discover arguments. Useheygen video-translate <subcommand> --helpto see the full JSON shape of any create command.--request-schema - Operations below show MCP and CLI side-by-side — read only the column for your detected mode.
检测模式后,在进入第一阶段前需验证认证是否有效。避免在收集用户输入后才遇到认证错误,浪费用户时间。
- MCP模式:认证由OAuth处理——无需额外检查。
- CLI模式:静默执行。如果返回0,则继续;如果返回非0(无密钥、过期或无效):
heygen auth status- 询问用户:“我需要您的HeyGen API密钥才能继续。您可以从https://app.heygen.com/settings?nav=API获取——请在此粘贴。”
- 用户提供密钥后,持久化存储:(写入
echo "<key>" | heygen auth login,跨会话保留)。~/.heygen/credentials - 再次验证:。如果仍失败,显示错误并终止操作。
heygen auth status
这是一次性设置。执行持久化密钥后,后续会话会自动读取。如果通过,无需再次询问。
heygen auth loginheygen auth status硬性规则:
- 绝不直接调用此技能的所有操作都有对应的CLI命令和(支持的情况下)MCP工具。请使用这些工具。
curl api.heygen.com/... - MCP模式:仅使用工具 如果MCP尚未提供翻译功能,则降级到CLI模式处理翻译操作。禁止直接发起HTTP请求。
mcp__heygen__* - CLI模式:仅使用命令 执行
heygen ...和heygen video-translate --help查看参数。使用heygen video-translate <subcommand> --help查看任何create命令的完整JSON结构。--request-schema - 以下操作同时展示MCP和CLI模式——仅阅读对应检测到的模式列内容
MCP tool names (MCP mode only)
MCP工具名称(仅MCP模式)
create_video_translationmcp__heygen__*create_video_translationmcp__heygen__*CLI command groups (CLI mode only)
CLI命令组(仅CLI模式)
heygen video-translate
├── languages list # supported target languages
├── create # submit a translation job (single or batch)
├── get <id> # check status / fetch result
├── list # list past translations
├── update # update job metadata
├── delete # delete a job
└── proofreads
├── create # extract editable subtitles before final render
├── get <id> # check proofread session status
├── srt get <id> # download the extracted SRT
├── srt update <id> # upload edited SRT
└── generate <id> # render the final video from approved SRT
heygen asset create --file <path> # for local source video uploads (max 32 MB)Every command supports . Use on any to see the full JSON body. CLI output: JSON on stdout, envelope on stderr, exit codes ok · API · usage · auth · timeout. Add on to block until the job completes (default timeout 20m).
--help--request-schemacreate{error:{code,message,hint}}01234--waitcreate📖 Detailed CLI/MCP error → action mapping → references/troubleshooting.md
heygen video-translate
├── languages list # supported target languages
├── create # submit a translation job (single or batch)
├── get <id> # check status / fetch result
├── list # list past translations
├── update # update job metadata
├── delete # delete a job
└── proofreads
├── create # extract editable subtitles before final render
├── get <id> # check proofread session status
├── srt get <id> # download the extracted SRT
├── srt update <id> # upload edited SRT
└── generate <id> # render the final video from approved SRT
heygen asset create --file <path> # for local source video uploads (max 32 MB)所有命令都支持。对任何命令使用查看完整JSON请求体。CLI输出:标准输出为JSON,标准错误输出为格式,退出码表示成功 · 表示API错误 · 表示使用错误 · 表示认证错误 · 表示超时。在命令中添加可阻塞直到任务完成(默认超时20分钟)。
--helpcreate--request-schema{error:{code,message,hint}}01234create--wait📖 详细CLI/MCP错误→操作映射 → references/troubleshooting.md
Default Workflow
默认工作流
The skill runs four phases. Phase 1 (Discovery) is the only place you ask questions. Phase 2 (Pre-flight) is silent. Phase 3 (Submit + Poll) is silent. Phase 4 (Deliver) is one short message.
Phase 1 — Discovery — gather minimum needed inputs from the user
Phase 2 — Pre-flight — validate language, classify content, set flags
Phase 3 — Submit + Poll — kick off, background poll, surface only on done/fail
Phase 4 — Deliver — post the result with one-line summary该技能分为四个阶段。仅第一阶段(发现)会向用户提问。第二阶段(预检)、第三阶段(提交+轮询)静默执行。第四阶段(交付)发送一条简短消息。
Phase 1 — Discovery — 收集用户提供的必要输入
Phase 2 — Pre-flight — 验证语言、分类内容、设置参数
Phase 3 — Submit + Poll — 启动任务、后台轮询,仅在完成/失败时通知用户
Phase 4 — Deliver — 发布结果并附带一行摘要Phase 1 — Discovery
Phase 1 — Discovery
Ask only what you don't already have. Communicate in the user's language. Never run a form. One or two questions per turn, max.
Required inputs (block until you have these):
- Source video. Public URL, local file path, or a HeyGen asset_id from a previous step. If the user hasn't supplied it, ask: "What's the source video — a URL, file path, or an existing HeyGen asset?"
- Target language(s). Ask as an open-ended question in the user's language: "Which language should I translate it into?" Do NOT present a picker or pre-assigned choices — let the user type freely. They may want one language, multiple, or a region-specific variant. Accept whatever they give and validate against the canonical languages list in Phase 2.
Important inputs (ask if not provided, with smart defaults):
- Speaker count. Single speaker is the default and what most users have. Ask once when ambiguous: "How many distinct speakers are in the video?" Wrong speaker count is the #1 quality killer — speaker confusion creates voice swaps mid-translation. Don't skip this for multi-person content.
- Content type. You usually don't need to ask — infer from the video and confirm. The five profiles below cover ~95% of cases. Only ask if genuinely ambiguous.
- Caption preference. Default ON for talking-head and corporate; default OFF for podcast/audio-only. If you flip the default, mention it briefly in Phase 4.
- Duration flexibility. Ask: "Does the translated video need to be exactly the same length as the original, or can it be slightly longer/shorter? Allowing flexibility usually sounds more natural — the translated speech gets enough room to be spoken at a comfortable pace instead of being sped up or compressed." Default recommendation: flexible (). Only set to
enable_dynamic_duration: truewhen the user needs frame-exact timing (e.g., syncing to a timeline, ad slot, or external audio track).false
Optional (ask only if relevant):
- Glossary / do-not-translate terms. For corporate or technical content, ask: "Any product names, company names, or jargon I should keep in the original language?" HeyGen doesn't currently accept a hard glossary, so this becomes guidance for the proofread step (Phase 3-Proofread) when stakes are high.
- Partial translation. If the user mentions a specific segment ("just the intro", "from 1:30 to 4:00"), capture and
start_timein seconds.end_time - Proofread before final render? Default OFF (faster, fewer approvals). Default ON for: long videos (>3 min), corporate/branded content, high-stakes legal/medical/educational, languages the user reads natively (so they can verify). Ask: "Want to review and edit the subtitles before final render? Adds about 5 minutes but lets you fix any wrong terms."
📖 Locale-pair gotchas (formality registers, RTL languages, tonal compression, lip-sync ceiling) → references/language-locale-guide.md
仅询问未知信息。使用用户的语言沟通。绝不使用表单式提问。每次最多提问1-2个问题。
必填输入(收集完成后才能继续):
- 原始视频:公共URL、本地文件路径或之前步骤生成的HeyGen asset_id。如果用户未提供,询问:“原始视频是什么——URL、文件路径还是已有的HeyGen资源?”
- 目标语言:用用户的语言开放式提问:“需要翻译成哪种语言?” 不要提供选择器或预设选项——让用户自由输入。用户可能需要一种、多种或特定区域变体语言。接受用户输入并在第二阶段对照标准语言列表验证。
重要输入(未提供则询问,附带智能默认值):
- 说话人数量:默认是单说话人,也是大多数用户的场景。当情况不明确时询问一次:“视频中有多少个不同的说话人?” 说话人数量错误是影响质量的首要因素——说话人识别错误会导致翻译过程中语音混淆。多人物内容不要跳过此问题。
- 内容类型:通常无需询问——从视频推断并确认。以下五种场景覆盖约95%的情况。仅在确实不明确时询问。
- 字幕偏好:默认开启(适用于演讲者镜头和企业视频);默认关闭(适用于播客/纯音频)。如果修改默认值,在第四阶段简要提及。
- 时长灵活性:询问:“翻译后的视频需要与原视频时长完全一致,还是可以略有增减?允许灵活时长通常听起来更自然——翻译后的语音有足够空间以舒适的语速表达,无需加速或压缩。” 默认推荐:灵活时长()。仅当用户需要帧级精确时长时(如同步到时间线、广告位或外部音轨)设置为
enable_dynamic_duration: true。false
可选输入(仅相关时询问):
- 术语表/不翻译术语:对于企业或技术内容,询问:“是否有产品名称、公司名称或专业术语需要保留原语言?” HeyGen目前不支持硬性术语表,因此这将作为高风险场景下校对步骤(Phase 3-Proofread)的指导。
- 部分翻译:如果用户提及特定片段(如“仅翻译 intro”“从1:30到4:00”),记录和
start_time(秒)。end_time - 最终渲染前校对? 默认关闭(更快,无需额外审批)。对于以下场景默认开启:长视频(>3分钟)、企业/品牌内容、高风险法律/医疗/教育内容、用户母语(方便用户验证)。询问:“是否需要在最终渲染前审核并编辑字幕?会增加约5分钟时间,但可以修正错误术语。”
📖 语言对注意事项(正式程度、RTL语言、语调压缩、唇形同步上限) → references/language-locale-guide.md
Phase 2 — Pre-flight
Phase 2 — Pre-flight
Silent. No user-facing chatter. Three checks, in order.
Check 2a: Language validation.
MCP: (if exposed). Otherwise CLI.
CLI:
list_video_translation_languages()heygen video-translate languages list | jq -r '.data.languages[]'The list contains exact strings ("Spanish (Spain)", "Chinese (Mandarin, Simplified)", "Arabic (Saudi Arabia)"). Match the user's input case-insensitively against these exact strings. If they say "Spanish", default to "Spanish (Spain)" and confirm in Phase 4. If they say "Chinese", default to "Chinese (Mandarin, Simplified)". If they specify a region ("Mexican Spanish"), map it ("Spanish (Mexico)"). If no match: ask the user to pick from the closest options.
Check 2b: Source video routing.
| Source the user gave you | Route |
|---|---|
Public HTTPS URL (no auth, returns video MIME on | Pass directly as |
| Auth-walled URL, 403, 404, or HTML response | Tell the user, ask for a public URL or local file |
| Local file path | Upload via |
| Existing HeyGen asset_id | Pass directly as |
📖 Asset routing edge cases (very large files, presigned URLs, auth-walled sources) → references/asset-routing.md
Check 2c: Content profile.
Pick one profile based on the source. Don't list all five to the user — propose silently and only ask if the source is genuinely ambiguous (e.g., a music-heavy talking-head where you can't tell if speech enhancement will help).
| Profile | Use when | Flags |
|---|---|---|
| Talking head / presenter (default) | One person speaks to camera; clean audio | |
| Podcast / audio-only | The visual is static, doesn't matter, or doesn't exist | |
| Music / high-soundtrack | Background music interferes with speech | |
| Multi-speaker | Two or more distinct speakers | Talking-head defaults + |
| Corporate / branded | Brand voice, glossary discipline, high-stakes | Talking-head defaults + (if user has one) |
Always:
- unless the user explicitly asks for "fast" / "quick" / "speed".
mode: "precision" - : set based on the user's answer to the duration flexibility question in Phase 1. Default
enable_dynamic_duration(recommended) — lets translated speech breathe instead of being crammed into the source's exact timing. Settrueonly when the user explicitly needs fixed-length output. Tonal compression makes flexibility especially important for en→zh, en→ja, en→ko (Asian languages run shorter); de→en, ja→en (run longer); ar/he/ur (RTL + register shifts).false - for visual translations — preserves the source's resolution and bitrate so the dubbed video matches the original's encoding.
keep_the_same_format: true - (the default).
enable_watermark: false
静默执行。不与用户交互。依次执行三项检查。
检查2a:语言验证
MCP:(如果开放)。否则使用CLI。
CLI:
list_video_translation_languages()heygen video-translate languages list | jq -r '.data.languages[]'列表包含精确字符串(如"Spanish (Spain)"、"Chinese (Mandarin, Simplified)"、"Arabic (Saudi Arabia)")。不区分大小写匹配用户输入与这些精确字符串。如果用户输入“Spanish”,默认使用“Spanish (Spain)”并在第四阶段确认。如果用户输入“Chinese”,默认使用“Chinese (Mandarin, Simplified)”。如果用户指定区域(如“Mexican Spanish”),映射为对应标准值(“Spanish (Mexico)”)。如果无匹配项:请用户从最接近的选项中选择。
检查2b:原始视频路由
| 用户提供的视频源 | 处理方式 |
|---|---|
| 公共HTTPS URL(无需认证,HEAD请求返回视频MIME类型) | 直接传递为 |
| 需要认证的URL、返回403/404或HTML响应 | 告知用户,请求提供公共URL或本地文件 |
| 本地文件路径 | 通过 |
| 已有的HeyGen asset_id | 直接传递为 |
📖 资源路由边缘情况(超大文件、预签名URL、需要认证的源) → references/asset-routing.md
检查2c:内容配置
根据视频源选择一种配置。不要向用户列出所有五种配置——静默选择,仅在视频源确实不明确时询问(如音乐占比高的演讲视频,无法判断语音增强是否有帮助)。
| 配置 | 适用场景 | 参数 |
|---|---|---|
| 演讲者镜头(默认) | 单人对着镜头说话;音频清晰 | |
| 播客/纯音频 | 画面静态、无关或不存在 | |
| 音乐/高背景音 | 背景音乐干扰语音 | |
| 多说话人 | 两个或更多不同说话人 | 演讲者镜头默认参数 + |
| 企业/品牌内容 | 品牌语音、术语规范、高风险场景 | 演讲者镜头默认参数 +(如果用户有) |
始终遵循:
- 除非用户明确要求“快速”。
mode: "precision" - :根据用户在第一阶段对时长灵活性问题的回答设置。默认
enable_dynamic_duration(推荐)——让翻译后的语音自然表达,无需压缩到原视频时长。仅当用户明确需要固定时长时设置为true。语调压缩对以下语言对尤为重要:英→中、英→日、英→韩(亚洲语言时长更短);德→英、日→英(时长更长);阿/希/乌(RTL+语体转换)。false - 适用于视频翻译——保留源视频的分辨率和比特率,使配音后的视频与原视频编码一致。
keep_the_same_format: true - (默认值)。
enable_watermark: false
Phase 3 — Submit + Poll
Phase 3 — Submit + Poll
Silent. Background work. Surface only on (a) per-language completion, (b) per-language hard failure, (c) >5 min progress check.
Branching:
- Standard path (proofread = OFF): submit translations, background-poll, deliver.
- Proofread path (proofread = ON): create proofread session → download SRT → user edits or you assist → upload edited SRT → generate final → background-poll → deliver.
静默执行。后台处理。仅在以下情况通知用户:(a) 单语言任务完成;(b) 单语言任务失败;(c) 超过5分钟的进度更新。
分支流程:
- 标准流程(校对=关闭):提交翻译任务、后台轮询、交付结果。
- 校对流程(校对=开启):创建校对会话 → 下载SRT → 用户编辑或协助编辑 → 上传编辑后的SRT → 生成最终视频 → 后台轮询 → 交付结果。
Standard path
标准流程
Submit one job per target language using batch syntax ( accepts multiple).
--output-languagesMCP (single language only at time of writing):
create_video_translation(
video={type, url|asset_id},
output_languages=["Spanish (Spain)"],
mode="precision",
enable_speech_enhancement=true,
enable_caption=true,
enable_dynamic_duration=true,
keep_the_same_format=true,
speaker_num=<n>, # only when known multi-speaker
)CLI:
bash
heygen video-translate create \
-d '{"video":{"type":"url","url":"https://..."},"output_languages":["Spanish (Spain)","Japanese (Japan)"]}' \
--mode precision \
--enable-speech-enhancement \
--enable-caption \
--enable-dynamic-duration \
--keep-the-same-format \
--speaker-num 1 \
--title "<short title>"Response returns one per language. Capture all of them.
video_translation_idPolling (silent, backgrounded):
Use on to block until completion when running ONE language. For batch, drop and poll each ID:
--waitcreate--waitbash
undefined使用批量语法提交每个目标语言的任务(支持多个语言)。
--output-languagesMCP (目前仅支持单语言):
create_video_translation(
video={type, url|asset_id},
output_languages=["Spanish (Spain)"],
mode="precision",
enable_speech_enhancement=true,
enable_caption=true,
enable_dynamic_duration=true,
keep_the_same_format=true,
speaker_num=<n>, # 仅多说话人场景设置
)CLI:
bash
heygen video-translate create \
-d '{"video":{"type":"url","url":"https://..."},"output_languages":["Spanish (Spain)","Japanese (Japan)"]}' \
--mode precision \
--enable-speech-enhancement \
--enable-caption \
--enable-dynamic-duration \
--keep-the-same-format \
--speaker-num 1 \
--title "<short title>"响应返回每个语言对应的。记录所有ID。
video_translation_id轮询(静默、后台执行):
仅处理一种语言时,在命令中添加阻塞直到完成。批量任务则去掉,逐个轮询每个ID:
create--wait--waitbash
undefinedCLI mode polling (background)
CLI模式轮询(后台)
heygen video-translate get <video-translation-id>
heygen video-translate get <video-translation-id>
Returns { data: { status: "pending"|"running"|"succeeded"|"failed", video_url, ... } }
返回 { data: { status: "pending"|"running"|"succeeded"|"failed", video_url, ... } }
Polling cadence: 30s for the first 3 minutes, then 60s. Most translations complete in 5–15 min; some (long videos, batched languages) take 30+ min. Hard timeout: 60 min per translation — beyond that, treat as stuck and surface the issue.
**MCP equivalents:** `get_video_translation(id)` (if exposed). Otherwise fall through to CLI for polling.
📖 **Background polling pattern (don't poll in foreground / harness-specific notes) → [references/troubleshooting.md#polling](references/troubleshooting.md)**
轮询频率:前3分钟每30秒一次,之后每60秒一次。大多数翻译任务在5-15分钟内完成;部分任务(长视频、批量语言)需要30分钟以上。硬性超时:每个翻译任务60分钟——超时则视为停滞并通知用户。
**MCP对应操作:** `get_video_translation(id)`(如果开放)。否则降级到CLI模式轮询。
📖 **后台轮询模式(不要在前台轮询/工具特定说明) → [references/troubleshooting.md#polling](references/troubleshooting.md)**Proofread path
校对流程
For high-stakes content, run a proofread session first so the user can review/edit the translated subtitles before the engine commits to a final render.
bash
undefined对于高风险内容,先运行校对会话,让用户在引擎最终渲染前审核/编辑翻译后的字幕。
bash
undefined1. Create proofread session — returns proofread_ids (one per language)
1. 创建校对会话 — 返回proofread_ids(每个语言一个)
heygen video-translate proofreads create
-d '{"video":{"type":"url","url":"https://..."}}'
--output-languages "Spanish (Spain)"
--mode precision
--enable-speech-enhancement
--keep-the-same-format
--speaker-num 1
--title "<short fileNAME-safe title>"
-d '{"video":{"type":"url","url":"https://..."}}'
--output-languages "Spanish (Spain)"
--mode precision
--enable-speech-enhancement
--keep-the-same-format
--speaker-num 1
--title "<short fileNAME-safe title>"
heygen video-translate proofreads create
-d '{"video":{"type":"url","url":"https://..."}}'
--output-languages "Spanish (Spain)"
--mode precision
--enable-speech-enhancement
--keep-the-same-format
--speaker-num 1
--title "<short fileNAME-safe title>"
-d '{"video":{"type":"url","url":"https://..."}}'
--output-languages "Spanish (Spain)"
--mode precision
--enable-speech-enhancement
--keep-the-same-format
--speaker-num 1
--title "<short fileNAME-safe title>"
→ status: processing (3–5 min for short videos)
→ status: processing (短视频需要3-5分钟)
2. Poll until completed (or failed + failure_message)
2. 轮询直到完成(或失败并返回failure_message)
heygen video-translate proofreads get <proofread-id>
heygen video-translate proofreads get <proofread-id>
→ status: completed
→ status: completed
3. Fetch presigned URLs for editable + original SRTs
3. 获取可编辑SRT和原始SRT的预签名URL
heygen video-translate proofreads srt get <proofread-id> > /tmp/srt-resp.json
SRT_URL=$(jq -r '.data.srt_url' /tmp/srt-resp.json) # target-lang, edit this
ORIG_URL=$(jq -r '.data.original_srt_url' /tmp/srt-resp.json) # source-lang transcript
curl -s "$SRT_URL" -o /tmp/proofread.srt
heygen video-translate proofreads srt get <proofread-id> > /tmp/srt-resp.json
SRT_URL=$(jq -r '.data.srt_url' /tmp/srt-resp.json) # 目标语言SRT,可编辑
ORIG_URL=$(jq -r '.data.original_srt_url' /tmp/srt-resp.json) # 源语言转录文本
curl -s "$SRT_URL" -o /tmp/proofread.srt
4. Edit /tmp/proofread.srt by hand or sed (glossary, register, names)
4. 手动或通过sed编辑/tmp/proofread.srt(术语表、语体、名称)
See references/proofreads-workflow.md for the full edit playbook.
完整编辑指南请参考references/proofreads-workflow.md。
5. Host the edited SRT at a public URL, then upload by reference.
5. 将编辑后的SRT托管到公共URL,然后通过引用上传。
⚠️ asset_id route is currently BLOCKED for SRTs —
⚠️ 目前SRT不支持asset_id方式上传 —
heygen asset create
only accepts png/jpeg/mp4/webm/mp3/wav/pdf.
heygen asset createheygen asset create
仅支持png/jpeg/mp4/webm/mp3/wav/pdf格式。
heygen asset createUse the URL route. (gist raw, S3 public-read, presigned ≥2h, etc.)
请使用URL方式上传。(gist raw、S3公共可读、预签名URL有效期≥2小时等)
EDITED_URL="https://example.com/proofread-edited.srt"
heygen video-translate proofreads srt update <proofread-id>
-d "{"srt":{"type":"url","url":"$EDITED_URL"}}"
-d "{"srt":{"type":"url","url":"$EDITED_URL"}}"
EDITED_URL="https://example.com/proofread-edited.srt"
heygen video-translate proofreads srt update <proofread-id>
-d "{"srt":{"type":"url","url":"$EDITED_URL"}}"
-d "{"srt":{"type":"url","url":"$EDITED_URL"}}"
6. Kick off final render — returns a video_translation_id
6. 启动最终渲染 — 返回video_translation_id
heygen video-translate proofreads generate <proofread-id> --captions
heygen video-translate proofreads generate <proofread-id> --captions
→ {"data":{"video_translation_id":"<vid-id>","status":"processing"}}
→ {"data":{"video_translation_id":"<vid-id>","status":"processing"}}
7. Poll the translation to completion (NOT proofreads get — graduates here)
7. 轮询翻译任务直到完成(不再使用proofreads get — 切换到翻译任务轮询)
heygen video-translate get <vid-id>
heygen video-translate get <vid-id>
→ status: running → succeeded; data.video_url has the final mp4
→ status: running → succeeded; data.video_url包含最终mp4
📖 **When to insist on proofread, common SRT edits, glossary discipline → [references/proofreads-workflow.md](references/proofreads-workflow.md)**
📖 **何时强制使用校对、常见SRT编辑、术语规范 → [references/proofreads-workflow.md](references/proofreads-workflow.md)**Phase 4 — Deliver
Phase 4 — Deliver
One message per completed language. Format:
✅ Spanish (Spain) — <video_url> 1m 47s, precision mode, captions on.
If a language failed: one short line with the cause (from troubleshooting reference). Don't flood the user with retry options unless they ask. If the user batched many languages, deliver each as it completes — don't wait for all to finish before posting any.
Source-quality disclaimer. Translation can't improve on the source. If the source has muffled audio, fast cuts, heavy occlusion of the face, or low resolution, lip-sync and voice quality will degrade. When you detect these conditions in Phase 2 (or the user mentions them), warn upfront. Don't surface this after a bad result.
每个完成的语言对应一条消息。格式如下:
✅ 西班牙语(西班牙) — <video_url> 1分47秒,精准模式,字幕已开启。
如果某个语言任务失败:发送一行简短说明(参考故障排除文档)。除非用户询问,否则不要提供重试选项。如果用户批量处理多种语言,完成一个交付一个——不要等所有任务完成再统一发布。
源质量声明:翻译无法提升源视频质量。如果源视频音频模糊、镜头切换过快、面部遮挡严重或分辨率低,唇形同步和语音质量会下降。如果在第二阶段检测到这些情况(或用户提及),请提前告知用户。不要在结果不理想后才说明。
Embedded Expertise
内置专业判断
The defaults above cover the common case. The decisions below are what separate this skill from a generic API wrapper. Use them as judgement calls during the workflow, not as a checklist to recite.
上述默认配置覆盖常见场景。以下决策是本技能与通用API封装的区别所在。在工作流中根据判断使用,而非作为清单逐条执行。
Speaker count is the #1 quality killer
说话人数量是影响质量的首要因素
For talking-head: 1 speaker. For interviews / podcasts / panels: count exactly, don't guess. The engine separates voices by ; wrong count means voices bleed across speakers in the dubbed output. If the user is unsure, ask them to scrub the video and count.
speaker_num演讲者镜头:1个说话人。访谈/播客/圆桌会议:准确计数,不要猜测。引擎通过区分语音;数量错误会导致配音输出中语音混淆。如果用户不确定,请让用户查看视频并计数。
speaker_numSource-quality triage (do this before submitting)
源质量评估(提交前执行)
A 30-second triage in Phase 2 saves 10–30 minutes of bad translation. Watch/listen to the first ~10 seconds of the source and check:
- Audio: Is the speech clear? Background music dominant? Noise / hiss? → if speech is unclear, default . If music dominates,
enable_speech_enhancement: true. If both, warn the user that quality may be lower regardless of flags.disable_music_track: true - Face visibility: Is the speaker's face on-camera most of the time, front-facing, well-lit? → heavy occlusion (sunglasses, hands on face), profile-only shots, very fast cuts, or sub-720p faces all cap lip-sync quality.
- Text on screen: Burned-in captions in the source language? → these don't get re-rendered. They'll remain in the source language in the dubbed output. If the user wants new-language captions, they'll have two caption tracks — propose AND warn about the existing burn-in.
enable_caption: true
第二阶段花30秒评估源质量,可避免10-30分钟的无效翻译。观看/聆听源视频前10秒左右,检查:
- 音频:语音清晰吗?背景音乐占主导?有噪音/嘶嘶声?→ 如果语音不清晰,默认设置。如果音乐占主导,设置
enable_speech_enhancement: true。如果两者都存在,告知用户无论如何设置,质量可能都会较低。disable_music_track: true - 面部可见性:说话人面部大多数时间在镜头中、正面朝向、光线充足吗?→ 严重遮挡(墨镜、手遮脸)、侧面镜头、过快的镜头切换或低于720p的面部分辨率都会限制唇形同步质量。
- 屏幕文本:源视频中有内嵌字幕吗?→ 这些字幕不会重新渲染。配音后的视频中仍会保留源语言字幕。如果用户需要目标语言字幕,将会有两条字幕轨道——建议设置并告知用户已有内嵌字幕的情况。
enable_caption: true
Locale-pair gotchas
语言对注意事项
- Tonal compression / expansion. en→zh, en→ja, en→ko run ~30% shorter; de→en, ja→en run longer; en→ar/he typically expands. Dynamic duration (the Phase 1 duration flexibility question) is especially important for these pairs — without it, en→zh sounds artificially slow (speech crammed to fit a 30%-too-long timeline). If the user opted for fixed-length output, warn them that quality will degrade on high-compression pairs.
- Formality / register. ja-JP (敬語/keigo), ko-KR (honorifics), de-DE (Sie vs du), th-TH (royal/polite/casual), id-ID (formal vs colloquial) — the engine picks neutral-formal by default. If the source is conversational and the user wants matching register in target, flag it in proofread or pre-warn that it'll sound slightly more formal than the original.
- RTL languages. Arabic, Hebrew, Urdu, Persian — captions render right-to-left. Burned-in captions can collide with the source video's lower-third graphics on the wrong side. If the source has on-screen text or graphics in the lower-third, propose audio-only translation OR proofread with caption styling review.
- Regional variants matter. Spanish (Spain) vs Spanish (Mexico) vs Spanish (Argentina) have distinctly different vocabulary, intonation, and speech rate. Latin American audiences often perceive Castilian Spanish as foreign. Default to the user's stated audience region; if unspecified for Spanish, ask once. Same for Portuguese (Portugal vs Brazil), French (France vs Canada vs Switzerland), Arabic (which has 19 region variants).
- Mandarin specifically. "Chinese (Mandarin, Simplified)" is the standard default for mainland China audiences. "Chinese (Cantonese, Traditional)" for Hong Kong / overseas Cantonese-speaking diaspora. "Chinese (Taiwanese Mandarin, Traditional)" for Taiwan. These are not interchangeable.
📖 Full locale-pair table with register notes and known quirks → references/language-locale-guide.md
- 语调压缩/扩展:英→中、英→日、英→韩时长缩短约30%;德→英、日→英时长增加;英→阿/希时长通常增加。动态时长(第一阶段的时长灵活性问题)对这些语言对尤为重要——如果没有动态时长,英→中会听起来人为变慢(语音被迫填充过长的时间线)。如果用户选择固定时长,告知用户高压缩语言对的质量会下降。
- 语体/正式程度:日语(敬语)、韩语(敬语)、德语(Sie/du)、泰语(皇家/礼貌/非正式)、印尼语(正式/口语)——引擎默认选择中性正式语体。如果源视频是口语化内容,且用户希望目标语言匹配语体,请在校对阶段标注或提前告知用户翻译结果会比原文略正式。
- RTL语言:阿拉伯语、希伯来语、乌尔都语、波斯语——字幕从右到左渲染。内嵌字幕可能会与源视频下方的图形冲突。如果源视频下方有屏幕文本或图形,建议使用纯音频翻译或校对时检查字幕样式。
- 区域变体很重要:西班牙语(西班牙)、西班牙语(墨西哥)、西班牙语(阿根廷)的词汇、语调、语速差异明显。拉美观众通常认为卡斯蒂利亚语是外语。默认使用用户指定的受众区域;如果用户未指定西班牙语区域,询问一次。葡萄牙语(葡萄牙/巴西)、法语(法国/加拿大/瑞士)、阿拉伯语(19种区域变体)同理。
- 中文特例:“Chinese (Mandarin, Simplified)”是中国大陆受众的标准默认值。“Chinese (Cantonese, Traditional)”适用于香港/海外粤语受众。“Chinese (Taiwanese Mandarin, Traditional)”适用于台湾地区。这些不可互换。
📖 完整语言对表(含语体说明和已知问题) → references/language-locale-guide.md
Lip-sync ceiling
唇形同步上限
Lip-sync is best on:
- Stable, front-facing shots
- ≥720p face resolution
- Clean, well-lit faces
- Minimal cuts (long takes work better)
Lip-sync degrades on:
- Profile shots, looking-down shots, partial-face occlusion
- Fast cuts (<2s shots)
- Low light, motion blur, or low-res faces
- Heavy gesturing where the face moves rapidly
If the user's source has these conditions, warn them in Phase 1/2: "Heads up — the source has [X], so lip-sync won't be as tight as it would on a static talking-head. Want me to proceed anyway, or switch to audio-only translation?"
唇形同步效果最佳的场景:
- 稳定的正面镜头
- 面部分辨率≥720p
- 面部光线充足、清晰
- 镜头切换少(长镜头效果更好)
唇形同步效果下降的场景:
- 侧面镜头、低头镜头、面部部分遮挡
- 快速镜头切换(<2秒)
- 低光、运动模糊或低分辨率面部
- 面部快速移动的大幅度手势
如果用户的源视频存在这些情况,在第一/二阶段告知用户:“注意——源视频存在[X]问题,因此唇形同步效果不会像静态演讲者镜头那样好。是否继续,还是切换到纯音频翻译?”
Captions: burned-in vs sidecar
字幕:内嵌 vs 外挂
enable_caption: trueenable_caption: trueAudio-only translation
纯音频翻译
translate_audio_only: true- Podcasts (the "video" is a static image or waveform)
- Audio you'll re-composite into a different video later
- Any case where lip-sync would be impossible (no face, very poor source face quality)
Output is an audio file (typically MP3). Tell the user how to use it: "This gives you a translated audio track. Composite it back over the original video in your editor, or use it standalone." Do NOT pitch audio-only as a "quality workaround" for bad lip-sync — it's a different deliverable.
translate_audio_only: true- 播客(“视频”是静态图片或波形图)
- 后续要合成到其他视频中的音频
- 任何唇形同步不可能的场景(无面部、源视频面部质量极差)
输出为音频文件(通常是MP3)。告知用户使用方式:“这是翻译后的音频轨道。您可以在编辑器中将其重新合成到原视频中,或单独使用。” 不要将纯音频翻译作为唇形同步效果差的“质量 workaround”——这是不同的交付物。
Cost & time awareness
成本与时间意识
Translations bill by source video duration. A 5-minute video translated into 5 languages = 25 billable minutes. Surface time and cost expectations in Phase 1 when the user requests batches: "That's 5 languages × 5 minutes = ~25 min of translation time. Each one will take 10–20 min to render. Sound good?"
Don't quote dollar figures (pricing changes, varies by plan). Quote source minutes × language count, plus an honest render-time range.
翻译按源视频时长计费。5分钟视频翻译成5种语言=25计费分钟。当用户请求批量任务时,在第一阶段告知时间和成本预期:“这是5种语言×5分钟=约25分钟的翻译时长。每个语言版本需要10-20分钟渲染。可以接受吗?”
不要报价(定价会变化,因套餐而异)。告知源视频时长×语言数量,加上真实的渲染时间范围。
Failure-mode decoder
故障模式解码
Common error responses → human-readable causes → next action:
| Symptom | Likely cause | Fix |
|---|---|---|
| URL requires auth, returned HTML, or wrong MIME | Ask for public URL or local file → upload route |
| String didn't match canonical languages list | Re-run |
| Source has no audible speech, very corrupted audio, or wrong codec | Verify the source has speech; consider re-encoding |
| | Re-submit with correct speaker count or |
Stuck >30 min in | Backend queue / occasional stalls | Check status, give it 60 min total, then surface to user |
| Lip-sync looks bad on output | Source face conditions (see lip-sync ceiling) | Re-frame expectation; offer audio-only as alternative |
| Captions in wrong direction | RTL language with burned-in caption colliding with source layout | Switch to proofread + sidecar SRT |
📖 Full error → action table including auth and asset upload errors → references/troubleshooting.md
常见错误响应→人类可读原因→解决方案:
| 症状 | 可能原因 | 修复方法 |
|---|---|---|
| URL需要认证、返回HTML或MIME类型错误 | 请求用户提供公共URL或本地文件→使用上传流程 |
| 字符串与标准语言列表不匹配 | 重新执行 |
| 源视频无语音、音频损坏严重或编码错误 | 验证源视频是否有语音;考虑重新编码 |
| | 使用正确的说话人数量重新提交,或设置 |
| 后端队列/偶尔停滞 | 检查状态,最多等待60分钟,然后告知用户 |
| 输出视频唇形同步效果差 | 源视频面部条件(见唇形同步上限) | 调整预期;提供纯音频翻译作为替代 |
| 字幕方向错误 | RTL语言内嵌字幕与源视频布局冲突 | 切换到校对+外挂SRT |
📖 完整错误→操作表(含认证和资源上传错误) → references/troubleshooting.md
First-Time User Detection
首次用户检测
Signals the user is new to HeyGen translation specifically:
- Asks "what languages do you support?"
- Doesn't know about lip-sync vs audio-only
- Hasn't used HeyGen before (no avatars, no past translations on )
heygen video-translate list
For first-timers, suggest a 30–60 second test clip before committing to a full video. This catches source-quality issues, voice-clone fidelity, and lip-sync ceiling without burning a long-video translation.
用户首次使用HeyGen翻译功能的信号:
- 询问“支持哪些语言?”
- 不了解唇形同步 vs 纯音频翻译
- 从未使用过HeyGen(无虚拟形象、无过往翻译记录)
heygen video-translate list
对于首次用户,建议先使用30-60秒的测试片段,再处理完整视频。这样可以在不消耗长视频翻译额度的情况下,发现源质量问题、语音克隆保真度和唇形同步上限。
Source Quality Disclaimer (verbatim, for delivery)
源质量声明(交付时使用原文)
When the source is borderline:
"Heads up — the source video has [muffled audio / dim lighting / fast cuts / heavy music / etc.]. The translation engine can't improve on the source, so the dub might inherit some of that. Want to proceed, fix the source first, or test with a short clip?"
Don't surface this after a bad result. Surface it in Phase 2.
当源视频质量不佳时:
"注意——源视频存在[音频模糊/光线昏暗/镜头切换过快/背景音乐过大/等]问题。翻译引擎无法提升源视频质量,因此配音可能会继承这些问题。是否继续,先修复源视频,还是用短片段测试?"
不要在结果不理想后才说明。在第二阶段提前告知。
Phase 2 / Roadmap (mention only if asked)
第二阶段/路线图(仅用户询问时提及)
Currently best-supported via the proofreads workflow but not yet first-class flags:
- Brand glossary / do-not-translate lists. Inject during proofread by reverting auto-translated terms in the SRT. The flag is for voice consistency across translations, not for glossaries.
brand_voice_id - Custom SRT input. Supported via field (Enterprise plan) on
srt, or viacreatefor any plan. Useproofreads srt updateto apply YOUR subtitles to the source language;srt_role: "input"to apply them as the target-language captions.output - Partial translation. /
start_timein seconds. Useful for "translate just minute 2:00–4:00".end_time - Multi-language batch. Already supported — pass multiple values to (CLI) or
--output-languagesarray (MCP). One job ID returned per language.output_languages - Webhook callbacks. and
--callback-urlskip polling entirely. Use when you have a webhook endpoint and want event-driven completion.--callback-id
目前通过校对工作流支持但尚未成为一级参数的功能:
- 品牌术语表/不翻译列表:在校对阶段通过还原SRT中自动翻译的术语实现。参数用于保持翻译间的语音一致性,而非术语表。
brand_voice_id - 自定义SRT输入:企业套餐支持在命令中使用
create字段,或任何套餐通过srt支持。使用proofreads srt update将您的字幕应用到源语言;srt_role: "input"将其作为目标语言字幕。output - 部分翻译:/
start_time(秒)。适用于“仅翻译2:00–4:00”。end_time - 多语言批量:已支持——向(CLI)或
--output-languages数组(MCP)传递多个值。每个语言返回一个任务ID。output_languages - Webhook回调:和
--callback-url可跳过轮询。当您有webhook端点并希望事件驱动完成时使用。--callback-id
Best Practices (the short version)
最佳实践(精简版)
- Always precision mode unless the user explicitly asks for speed.
- Always confirm speaker count for non-obvious cases.
- Validate the language string against the canonical list before submitting.
- Suggest a short test clip for new users.
- Source quality is the ceiling — say it upfront, not after.
- Don't over-configure — the five content profiles cover ~95% of cases.
- Translations take time — set 5–30 min expectations.
- Match register — if the source is conversational and the target is a register-heavy language (ja, ko, th, de), proofread.
- Default proofread ON for: long videos, corporate, languages the user reads natively.
- Never narrate transport. MCP vs CLI is internal.
- 始终使用精准模式 除非用户明确要求速度。
- 非明确场景始终确认说话人数量
- 提交前验证语言字符串是否匹配标准列表
- 建议首次用户使用短测试片段
- 源质量是上限 — 提前说明,不要事后提及。
- 不要过度配置 — 五种内容配置覆盖约95%的场景。
- 翻译需要时间 — 设置5-30分钟的预期。
- 匹配语体 — 如果源视频是口语化内容,且目标语言是语体敏感语言(日、韩、泰、德),使用校对功能。
- 默认开启校对的场景:长视频、企业内容、用户母语
- 绝不提及传输方式 MCP vs CLI是内部操作。