HeyGen Video Translation
Translate and dub an existing video into 175+ languages. The system clones the
presenter's voice into the target language, re-syncs their lips to the new
audio, and returns a fully dubbed video. You provide a source video and a target
language — the engine handles transcription, translation, voice cloning,
lip-sync, and (optionally) burned-in captions.
This is not new-video generation. The presenter, performance, framing, and
brand assets in the original video are preserved. Translation rides on top of
what's already there.
User-Facing Behavior
- Be concise. Don't dump translation IDs, raw API payloads, or status JSON in chat. Report the result (video link, language) not the plumbing.
- No internal jargon. Don't say "polling", "video_translation_id", "asset_id", "v3 endpoint". Say "translating", "almost done", "your file".
- Polling is silent. Background long-running translations and only speak on (a) result delivered, (b) >5 min stall (one update), (c) hard failure.
- One result, one message. When the video is done, send the link plus a one-line summary (target language, duration, mode). Not every API field.
- Don't narrate transport choice. MCP vs CLI vs OpenClaw plugin is internal. Pick silently at session start; never mention which is in use.
- Communicate in the user's language. Detect from their first message. Replies, confirmations, and questions in their language. Technical CLI/API directives stay in English.
API Mode Detection
Pick one transport at session start. Never mix, never switch mid-session, never narrate the choice.
Detect in this order:
- OpenClaw plugin mode — If running inside OpenClaw and the tool exposes a HeyGen translation model, prefer that. Currently the plugin generates videos but does not expose translation directly — fall through to the next tier until HeyGen ships translation through .
- CLI mode (API-key override) — If is set in the environment AND exits 0, use CLI. API-key presence is an explicit signal that the user wants direct API access.
- MCP mode — No AND HeyGen MCP tools are visible (). OAuth auth, runs against the user's plan credits.
- CLI mode (fallback) — MCP tools not available AND exits 0. Auth via .
- Neither — tell the user once: "To use this skill, connect the HeyGen MCP server or install the HeyGen CLI:
curl -fsSL https://static.heygen.ai/cli/install.sh | bash
then ."
Auth verification (run before any API call)
After mode detection, verify auth actually works before entering Phase 1. This avoids wasting the user's time gathering inputs only to hit an auth error on submit.
- MCP mode: auth is handled by OAuth — no check needed.
- CLI mode: run (silent). If it exits 0, proceed. If it exits non-zero (no key, expired, invalid):
- Ask the user: "I need your HeyGen API key to proceed. You can grab one from https://app.heygen.com/settings?nav=API — paste it here."
- Once they provide it, persist it:
echo "<key>" | heygen auth login
(writes to , survives across sessions).
- Verify: . If still failing, surface the error and stop.
This is a
one-time setup. Once
persists the key, future sessions pick it up automatically. Don't ask again if
passes.
Hard rules:
- Never call Every operation in this skill has a CLI command and (where supported) an MCP tool. Use those.
- MCP mode: only tools. If translation isn't exposed via MCP yet, fall through to CLI for translation operations specifically. Do not synthesize raw HTTP calls.
- CLI mode: only commands. Run
heygen video-translate --help
and heygen video-translate <subcommand> --help
to discover arguments. Use to see the full JSON shape of any create command.
- Operations below show MCP and CLI side-by-side — read only the column for your detected mode.
MCP tool names (MCP mode only)
(single language). Multi-language and proofreads are
not yet exposed via MCP — fall through to CLI for those. Run
tool listing at session start to confirm what's available; tool surface evolves.
CLI command groups (CLI mode only)
heygen video-translate
├── languages list # supported target languages
├── create # submit a translation job (single or batch)
├── get <id> # check status / fetch result
├── list # list past translations
├── update # update job metadata
├── delete # delete a job
└── proofreads
├── create # extract editable subtitles before final render
├── get <id> # check proofread session status
├── srt get <id> # download the extracted SRT
├── srt update <id> # upload edited SRT
└── generate <id> # render the final video from approved SRT
heygen asset create --file <path> # for local source video uploads (max 32 MB)
Every command supports
. Use
on any
to see the full JSON body. CLI output: JSON on stdout,
{error:{code,message,hint}}
envelope on stderr, exit codes
ok ·
API ·
usage ·
auth ·
timeout. Add
on
to block until the job completes (default timeout 20m).
📖 Detailed CLI/MCP error → action mapping → references/troubleshooting.md
Default Workflow
The skill runs four phases. Phase 1 (Discovery) is the only place you ask questions. Phase 2 (Pre-flight) is silent. Phase 3 (Submit + Poll) is silent. Phase 4 (Deliver) is one short message.
Phase 1 — Discovery — gather minimum needed inputs from the user
Phase 2 — Pre-flight — validate language, classify content, set flags
Phase 3 — Submit + Poll — kick off, background poll, surface only on done/fail
Phase 4 — Deliver — post the result with one-line summary
Phase 1 — Discovery
Ask only what you don't already have. Communicate in the user's language. Never run a form. One or two questions per turn, max.
Required inputs (block until you have these):
- Source video. Public URL, local file path, or a HeyGen asset_id from a previous step. If the user hasn't supplied it, ask: "What's the source video — a URL, file path, or an existing HeyGen asset?"
- Target language(s). Ask as an open-ended question in the user's language: "Which language should I translate it into?" Do NOT present a picker or pre-assigned choices — let the user type freely. They may want one language, multiple, or a region-specific variant. Accept whatever they give and validate against the canonical languages list in Phase 2.
Important inputs (ask if not provided, with smart defaults):
- Speaker count. Single speaker is the default and what most users have. Ask once when ambiguous: "How many distinct speakers are in the video?" Wrong speaker count is the #1 quality killer — speaker confusion creates voice swaps mid-translation. Don't skip this for multi-person content.
- Content type. You usually don't need to ask — infer from the video and confirm. The five profiles below cover ~95% of cases. Only ask if genuinely ambiguous.
- Caption preference. Default ON for talking-head and corporate; default OFF for podcast/audio-only. If you flip the default, mention it briefly in Phase 4.
- Duration flexibility. Ask: "Does the translated video need to be exactly the same length as the original, or can it be slightly longer/shorter? Allowing flexibility usually sounds more natural — the translated speech gets enough room to be spoken at a comfortable pace instead of being sped up or compressed." Default recommendation: flexible (
enable_dynamic_duration: true
). Only set to when the user needs frame-exact timing (e.g., syncing to a timeline, ad slot, or external audio track).
Optional (ask only if relevant):
- Glossary / do-not-translate terms. For corporate or technical content, ask: "Any product names, company names, or jargon I should keep in the original language?" HeyGen doesn't currently accept a hard glossary, so this becomes guidance for the proofread step (Phase 3-Proofread) when stakes are high.
- Partial translation. If the user mentions a specific segment ("just the intro", "from 1:30 to 4:00"), capture and in seconds.
- Proofread before final render? Default OFF (faster, fewer approvals). Default ON for: long videos (>3 min), corporate/branded content, high-stakes legal/medical/educational, languages the user reads natively (so they can verify). Ask: "Want to review and edit the subtitles before final render? Adds about 5 minutes but lets you fix any wrong terms."
📖 Locale-pair gotchas (formality registers, RTL languages, tonal compression, lip-sync ceiling) → references/language-locale-guide.md
Phase 2 — Pre-flight
Silent. No user-facing chatter. Three checks, in order.
Check 2a: Language validation.
MCP: list_video_translation_languages()
(if exposed). Otherwise CLI.
CLI: heygen video-translate languages list | jq -r '.data.languages[]'
The list contains exact strings ("Spanish (Spain)", "Chinese (Mandarin, Simplified)", "Arabic (Saudi Arabia)"). Match the user's input case-insensitively against these exact strings. If they say "Spanish", default to "Spanish (Spain)" and confirm in Phase 4. If they say "Chinese", default to "Chinese (Mandarin, Simplified)". If they specify a region ("Mexican Spanish"), map it ("Spanish (Mexico)"). If no match: ask the user to pick from the closest options.
Check 2b: Source video routing.
| Source the user gave you | Route |
|---|
| Public HTTPS URL (no auth, returns video MIME on ) | Pass directly as {type: "url", url: "..."}
|
| Auth-walled URL, 403, 404, or HTML response | Tell the user, ask for a public URL or local file |
| Local file path | Upload via heygen asset create --file <path>
(CLI) or (MCP). Max 32 MB. Use the returned as {type: "asset_id", asset_id: "..."}
|
| Existing HeyGen asset_id | Pass directly as {type: "asset_id", asset_id: "..."}
|
📖 Asset routing edge cases (very large files, presigned URLs, auth-walled sources) → references/asset-routing.md
Check 2c: Content profile.
Pick one profile based on the source. Don't list all five to the user — propose silently and only ask if the source is genuinely ambiguous (e.g., a music-heavy talking-head where you can't tell if speech enhancement will help).
| Profile | Use when | Flags |
|---|
| Talking head / presenter (default) | One person speaks to camera; clean audio | , enable_speech_enhancement: true
, , enable_dynamic_duration: true
, keep_the_same_format: true
|
| Podcast / audio-only | The visual is static, doesn't matter, or doesn't exist | , translate_audio_only: true
, enable_speech_enhancement: true
, |
| Music / high-soundtrack | Background music interferes with speech | , disable_music_track: true
, enable_speech_enhancement: true
, enable_dynamic_duration: true
, keep_the_same_format: true
|
| Multi-speaker | Two or more distinct speakers | Talking-head defaults + . Speaker count is REQUIRED here — don't guess. |
| Corporate / branded | Brand voice, glossary discipline, high-stakes | Talking-head defaults + (if user has one) . Strongly consider proofreads for this profile. |
Always:
- unless the user explicitly asks for "fast" / "quick" / "speed".
- : set based on the user's answer to the duration flexibility question in Phase 1. Default (recommended) — lets translated speech breathe instead of being crammed into the source's exact timing. Set only when the user explicitly needs fixed-length output. Tonal compression makes flexibility especially important for en→zh, en→ja, en→ko (Asian languages run shorter); de→en, ja→en (run longer); ar/he/ur (RTL + register shifts).
keep_the_same_format: true
for visual translations — preserves the source's resolution and bitrate so the dubbed video matches the original's encoding.
- (the default).
Phase 3 — Submit + Poll
Silent. Background work. Surface only on (a) per-language completion, (b) per-language hard failure, (c) >5 min progress check.
Branching:
- Standard path (proofread = OFF): submit translations, background-poll, deliver.
- Proofread path (proofread = ON): create proofread session → download SRT → user edits or you assist → upload edited SRT → generate final → background-poll → deliver.
Standard path
Submit one job per target language using batch syntax (
accepts multiple).
MCP (single language only at time of writing):
create_video_translation(
video={type, url|asset_id},
output_languages=["Spanish (Spain)"],
mode="precision",
enable_speech_enhancement=true,
enable_caption=true,
enable_dynamic_duration=true,
keep_the_same_format=true,
speaker_num=<n>, # only when known multi-speaker
)
CLI:
bash
heygen video-translate create \
-d '{"video":{"type":"url","url":"https://..."},"output_languages":["Spanish (Spain)","Japanese (Japan)"]}' \
--mode precision \
--enable-speech-enhancement \
--enable-caption \
--enable-dynamic-duration \
--keep-the-same-format \
--speaker-num 1 \
--title "<short title>"
Response returns one
per language. Capture all of them.
Polling (silent, backgrounded):
Use
on
to block until completion when running ONE language. For batch, drop
and poll each ID:
bash
# CLI mode polling (background)
heygen video-translate get <video-translation-id>
# Returns { data: { status: "pending"|"running"|"succeeded"|"failed", video_url, ... } }
Polling cadence: 30s for the first 3 minutes, then 60s. Most translations complete in 5–15 min; some (long videos, batched languages) take 30+ min. Hard timeout: 60 min per translation — beyond that, treat as stuck and surface the issue.
MCP equivalents: get_video_translation(id)
(if exposed). Otherwise fall through to CLI for polling.
📖 Background polling pattern (don't poll in foreground / harness-specific notes) → references/troubleshooting.md#polling
Proofread path
For high-stakes content, run a proofread session first so the user can review/edit the translated subtitles before the engine commits to a final render.
bash
# 1. Create proofread session — returns proofread_ids (one per language)
heygen video-translate proofreads create \
-d '{"video":{"type":"url","url":"https://..."}}' \
--output-languages "Spanish (Spain)" \
--mode precision \
--enable-speech-enhancement \
--keep-the-same-format \
--speaker-num 1 \
--title "<short fileNAME-safe title>"
# → status: processing (3–5 min for short videos)
# 2. Poll until completed (or failed + failure_message)
heygen video-translate proofreads get <proofread-id>
# → status: completed
# 3. Fetch presigned URLs for editable + original SRTs
heygen video-translate proofreads srt get <proofread-id> > /tmp/srt-resp.json
SRT_URL=$(jq -r '.data.srt_url' /tmp/srt-resp.json) # target-lang, edit this
ORIG_URL=$(jq -r '.data.original_srt_url' /tmp/srt-resp.json) # source-lang transcript
curl -s "$SRT_URL" -o /tmp/proofread.srt
# 4. Edit /tmp/proofread.srt by hand or sed (glossary, register, names)
# See references/proofreads-workflow.md for the full edit playbook.
# 5. Host the edited SRT at a public URL, then upload by reference.
# ⚠️ asset_id route is currently BLOCKED for SRTs —
# `heygen asset create` only accepts png/jpeg/mp4/webm/mp3/wav/pdf.
# Use the URL route. (gist raw, S3 public-read, presigned ≥2h, etc.)
EDITED_URL="https://example.com/proofread-edited.srt"
heygen video-translate proofreads srt update <proofread-id> \
-d "{\"srt\":{\"type\":\"url\",\"url\":\"$EDITED_URL\"}}"
# 6. Kick off final render — returns a video_translation_id
heygen video-translate proofreads generate <proofread-id> --captions
# → {"data":{"video_translation_id":"<vid-id>","status":"processing"}}
# 7. Poll the translation to completion (NOT proofreads get — graduates here)
heygen video-translate get <vid-id>
# → status: running → succeeded; data.video_url has the final mp4
📖 When to insist on proofread, common SRT edits, glossary discipline → references/proofreads-workflow.md
Phase 4 — Deliver
One message per completed language. Format:
✅ Spanish (Spain) — <video_url>
1m 47s, precision mode, captions on.
If a language failed: one short line with the cause (from troubleshooting reference). Don't flood the user with retry options unless they ask. If the user batched many languages, deliver each as it completes — don't wait for all to finish before posting any.
Source-quality disclaimer. Translation can't improve on the source. If the source has muffled audio, fast cuts, heavy occlusion of the face, or low resolution, lip-sync and voice quality will degrade. When you detect these conditions in Phase 2 (or the user mentions them), warn upfront. Don't surface this after a bad result.
Embedded Expertise
The defaults above cover the common case. The decisions below are what separate this skill from a generic API wrapper. Use them as judgement calls during the workflow, not as a checklist to recite.
Speaker count is the #1 quality killer
For talking-head: 1 speaker. For interviews / podcasts / panels: count exactly, don't guess. The engine separates voices by
; wrong count means voices bleed across speakers in the dubbed output. If the user is unsure, ask them to scrub the video and count.
Source-quality triage (do this before submitting)
A 30-second triage in Phase 2 saves 10–30 minutes of bad translation. Watch/listen to the first ~10 seconds of the source and check:
- Audio: Is the speech clear? Background music dominant? Noise / hiss? → if speech is unclear, default
enable_speech_enhancement: true
. If music dominates, disable_music_track: true
. If both, warn the user that quality may be lower regardless of flags.
- Face visibility: Is the speaker's face on-camera most of the time, front-facing, well-lit? → heavy occlusion (sunglasses, hands on face), profile-only shots, very fast cuts, or sub-720p faces all cap lip-sync quality.
- Text on screen: Burned-in captions in the source language? → these don't get re-rendered. They'll remain in the source language in the dubbed output. If the user wants new-language captions, they'll have two caption tracks — propose AND warn about the existing burn-in.
Locale-pair gotchas
- Tonal compression / expansion. en→zh, en→ja, en→ko run ~30% shorter; de→en, ja→en run longer; en→ar/he typically expands. Dynamic duration (the Phase 1 duration flexibility question) is especially important for these pairs — without it, en→zh sounds artificially slow (speech crammed to fit a 30%-too-long timeline). If the user opted for fixed-length output, warn them that quality will degrade on high-compression pairs.
- Formality / register. ja-JP (敬語/keigo), ko-KR (honorifics), de-DE (Sie vs du), th-TH (royal/polite/casual), id-ID (formal vs colloquial) — the engine picks neutral-formal by default. If the source is conversational and the user wants matching register in target, flag it in proofread or pre-warn that it'll sound slightly more formal than the original.
- RTL languages. Arabic, Hebrew, Urdu, Persian — captions render right-to-left. Burned-in captions can collide with the source video's lower-third graphics on the wrong side. If the source has on-screen text or graphics in the lower-third, propose audio-only translation OR proofread with caption styling review.
- Regional variants matter. Spanish (Spain) vs Spanish (Mexico) vs Spanish (Argentina) have distinctly different vocabulary, intonation, and speech rate. Latin American audiences often perceive Castilian Spanish as foreign. Default to the user's stated audience region; if unspecified for Spanish, ask once. Same for Portuguese (Portugal vs Brazil), French (France vs Canada vs Switzerland), Arabic (which has 19 region variants).
- Mandarin specifically. "Chinese (Mandarin, Simplified)" is the standard default for mainland China audiences. "Chinese (Cantonese, Traditional)" for Hong Kong / overseas Cantonese-speaking diaspora. "Chinese (Taiwanese Mandarin, Traditional)" for Taiwan. These are not interchangeable.
📖 Full locale-pair table with register notes and known quirks → references/language-locale-guide.md
Lip-sync ceiling
Lip-sync is best on:
- Stable, front-facing shots
- ≥720p face resolution
- Clean, well-lit faces
- Minimal cuts (long takes work better)
Lip-sync degrades on:
- Profile shots, looking-down shots, partial-face occlusion
- Fast cuts (<2s shots)
- Low light, motion blur, or low-res faces
- Heavy gesturing where the face moves rapidly
If the user's source has these conditions, warn them in Phase 1/2: "Heads up — the source has [X], so lip-sync won't be as tight as it would on a static talking-head. Want me to proceed anyway, or switch to audio-only translation?"
Captions: burned-in vs sidecar
produces captions burned into the video by default. Pros: no separate file, plays anywhere. Cons: not editable later, can collide with source graphics, fixed font/style. For high-stakes content where the user might want to restyle captions (brand kit, language-specific font), prefer the proofreads workflow — it gives an SRT they can use as a sidecar caption file.
Audio-only translation
translate_audio_only: true
skips lip-sync entirely. Use it for:
- Podcasts (the "video" is a static image or waveform)
- Audio you'll re-composite into a different video later
- Any case where lip-sync would be impossible (no face, very poor source face quality)
Output is an audio file (typically MP3). Tell the user how to use it: "This gives you a translated audio track. Composite it back over the original video in your editor, or use it standalone." Do NOT pitch audio-only as a "quality workaround" for bad lip-sync — it's a different deliverable.
Cost & time awareness
Translations bill by source video duration. A 5-minute video translated into 5 languages = 25 billable minutes. Surface time and cost expectations in Phase 1 when the user requests batches: "That's 5 languages × 5 minutes = ~25 min of translation time. Each one will take 10–20 min to render. Sound good?"
Don't quote dollar figures (pricing changes, varies by plan). Quote source minutes × language count, plus an honest render-time range.
Failure-mode decoder
Common error responses → human-readable causes → next action:
| Symptom | Likely cause | Fix |
|---|
| "video URL not accessible" | URL requires auth, returned HTML, or wrong MIME | Ask for public URL or local file → upload route |
| "language not supported" | String didn't match canonical languages list | Re-run , present closest matches |
| status with "audio extraction" | Source has no audible speech, very corrupted audio, or wrong codec | Verify the source has speech; consider re-encoding |
| with "speaker detection" | mismatched actual speakers, or audio is too noisy | Re-submit with correct speaker count or enable_speech_enhancement: true
|
| Stuck >30 min in | Backend queue / occasional stalls | Check status, give it 60 min total, then surface to user |
| Lip-sync looks bad on output | Source face conditions (see lip-sync ceiling) | Re-frame expectation; offer audio-only as alternative |
| Captions in wrong direction | RTL language with burned-in caption colliding with source layout | Switch to proofread + sidecar SRT |
📖 Full error → action table including auth and asset upload errors → references/troubleshooting.md
First-Time User Detection
Signals the user is new to HeyGen translation specifically:
- Asks "what languages do you support?"
- Doesn't know about lip-sync vs audio-only
- Hasn't used HeyGen before (no avatars, no past translations on
heygen video-translate list
)
For first-timers, suggest a 30–60 second test clip before committing to a full video. This catches source-quality issues, voice-clone fidelity, and lip-sync ceiling without burning a long-video translation.
Source Quality Disclaimer (verbatim, for delivery)
When the source is borderline:
"Heads up — the source video has [muffled audio / dim lighting / fast cuts / heavy music / etc.]. The translation engine can't improve on the source, so the dub might inherit some of that. Want to proceed, fix the source first, or test with a short clip?"
Don't surface this after a bad result. Surface it in Phase 2.
Phase 2 / Roadmap (mention only if asked)
Currently best-supported via the proofreads workflow but not yet first-class flags:
- Brand glossary / do-not-translate lists. Inject during proofread by reverting auto-translated terms in the SRT. The flag is for voice consistency across translations, not for glossaries.
- Custom SRT input. Supported via field (Enterprise plan) on , or via for any plan. Use to apply YOUR subtitles to the source language; to apply them as the target-language captions.
- Partial translation. / in seconds. Useful for "translate just minute 2:00–4:00".
- Multi-language batch. Already supported — pass multiple values to (CLI) or array (MCP). One job ID returned per language.
- Webhook callbacks. and skip polling entirely. Use when you have a webhook endpoint and want event-driven completion.
Best Practices (the short version)
- Always precision mode unless the user explicitly asks for speed.
- Always confirm speaker count for non-obvious cases.
- Validate the language string against the canonical list before submitting.
- Suggest a short test clip for new users.
- Source quality is the ceiling — say it upfront, not after.
- Don't over-configure — the five content profiles cover ~95% of cases.
- Translations take time — set 5–30 min expectations.
- Match register — if the source is conversational and the target is a register-heavy language (ja, ko, th, de), proofread.
- Default proofread ON for: long videos, corporate, languages the user reads natively.
- Never narrate transport. MCP vs CLI is internal.