explainer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

/pika:explainer

Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.

Usage:

/pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]

生成约60-80秒的URL讲解视频：按照节拍表时间线驱动真实浏览器访问目标URL，生成旁白的头像唇形同步效果，并将所有内容合成到1280×800的macOS Sonoma框架中。框架内画布坐标(20, 476)处有一个240像素的内部头像（含3px白色描边环的外部尺寸为246像素），且每个中段节拍都会针对元素进行缩放。支持任意URL——产品页面、文档站点、博客文章、发布页面均可。GitHub URL会触发仓库感知模式（扫描README + 检测实时演示）；所有其他URL则使用通用页面导览流程。

使用方式：

/pika:explainer <url> [--focus "重点方向"] [--avatar <头像URL>] [--voice <语音ID>] [--lipsync-provider pika|kling] [--preview] [--live-url <实时演示URL>]

Behavior

行为逻辑

Defaults — fire fast, no mid-flow confirmation

默认设置——快速执行，无流程中途确认

Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (
```
--avatar
```
,
```
--voice
```
) when supplied; otherwise resolve via
```
identity_avatar_url
```
/
```
identity_voice_id
```
and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via
--preview
(for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
Do not solicit
--focus
either. Make a confident first attempt from page structure; users re-run with
```
--focus "X"
```
if the angle missed.

These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.

静默使用身份存储中的头像/语音默认值。执行前绝不询问“是否使用你的头像？”或“选择哪种语音？”。如果用户提供了显式覆盖参数（
```
--avatar
```
、
```
--voice
```
），则优先使用；否则通过
```
identity_avatar_url
```
/
```
identity_voice_id
```
获取默认值并继续执行。完整的优先级流程请参见步骤1（包括身份信息为空时的静默 fallback 逻辑）。
默认无流程中途“输入yes继续”的确认环节。步骤5的预览功能需通过
```
--preview
```
主动开启（供高级用户在耗时较长的渲染前测试新的头像/语音组合）；默认流程会从头到尾连续执行，无需暂停。
也不主动请求
--focus
参数。根据页面结构自主生成首次导览内容；如果用户觉得重点方向不符，可重新执行并添加
```
--focus "X"
```
参数。

这些默认设置符合媒体生成工具的行业标准（如Midjourney / Sora / Runway / HeyGen / Pika.art）：提交请求→渲染→返回结果。账户余额 + 服务商故障切换（步骤9）是核心的保障机制。

Local avatar images on Claude Desktop

Claude Desktop本地头像图片支持

Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as

--avatar

, pause Step 1 and kindly send them this — something like:

Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:

Paste a URL if it's already hosted (Imgur, S3, your site) — fastest

Attach the image file so I can upload it before generation.

When a local file arrives, convert it to a public URL with

upload_asset

and use the returned

public_url

--avatar <url>

before Step 1. Already-hosted

https://...

URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.

Claude Desktop目前无法将粘贴的内联图片传递给MCP工具（Anthropic侧限制）。如果用户粘贴了内联照片，或提及想要将本地文件用作

--avatar

，则暂停步骤1并友好告知用户：

注意——Claude Desktop上粘贴的图片无法传递给MCP工具（Anthropic限制）。你可以通过以下两种简单方式设置头像：

粘贴URL：如果图片已托管在Imgur、S3或你的站点上，这是最快的方式

附加图片文件：我会先上传该图片再进行生成。

当收到本地文件后，使用

upload_asset

将其转换为公共URL，然后将返回的

public_url

作为

--avatar <url>

参数，再继续执行步骤1。已托管的

https://...

URL可直接使用，无需此步骤。如果未提供任何头像，则使用身份存储中的默认值。

Step 0 — Resolve URL (empty-args menu)

步骤0——解析URL（空参数菜单）

Strip flags (

--focus

--avatar

--voice

--live-url

--lipsync-provider

--no-captions

--preview

--skip-preview

--yes

) and

key=value

parameters from

$ARGUMENTS

. If what remains contains no
https://...
URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message. Calling a tool here risks recording or explaining the wrong page. If

$ARGUMENTS

already carries a URL, skip this step silently and proceed to Step 1.

Which URL would you like me to walk through? Works on any of:
A GitHub repo — e.g.
https://github.com/anthropics/claude-code
(activates repo-aware mode: README scan + live-demo detection)
A product page / launch page — e.g.
https://pika.art
A docs site — e.g.
https://docs.anthropic.com
A blog post / article URL
Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass
--preview
if you want a 3-second lipsync sanity check first.
Reply with the URL and I'll start.
Tip: you don't need to type
/pika:explainer
— just say things like "walk me through <url>", "make a demo video of <url>", or "explain this repo: <github-url>" and I'll fire this skill automatically.

When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.

从

$ARGUMENTS

中剥离标志（

--focus

、

--avatar

、

--voice

、

--live-url

、

--lipsync-provider

、

--no-captions

、

--preview

、

--skip-preview

、

--yes

）和

key=value

参数。如果剩余内容不包含
https://...
格式的URL（或为空/仅含空白字符），则原样输出以下菜单作为完整响应，然后停止并等待用户的下一条消息。此时调用工具可能会录制或讲解错误页面。如果

$ARGUMENTS

中已包含URL，则静默跳过此步骤并进入步骤1。

**你想让我导览哪个URL？**支持以下任意类型：
GitHub仓库——例如
https://github.com/anthropics/claude-code
（触发仓库感知模式：扫描README + 检测实时演示）
产品页面/发布页面——例如
https://pika.art
文档站点——例如
https://docs.anthropic.com
博客文章/文章URL
输出结果：1280×800的macOS Sonoma框架，左下角带有头像唇形同步效果，每个中段节拍都会针对元素进行缩放。默认流程从头到尾连续执行，无确认环节——若想先进行3秒唇形同步 sanity check，可添加
--preview
参数。
回复URL即可开始。
提示：无需输入
/pika:explainer
——直接说“导览<url>”“制作<url>的演示视频”或“讲解这个仓库：<github-url>”，我会自动触发该技能。

当用户回复URL后，将其视为已解析的输入并进入步骤1，无需再次提示。

Step 1 — Parse input + detect mode

步骤1——解析输入 + 检测模式

Required:

url

(must be

https://...

). Optional:

--avatar <url>

(overrides identity-store default),

--voice <minimax-voice-id>

--focus "..."

(editorial guidance woven into vo_text),

--live-url <url>

(force-supply live demo URL — GitHub mode only),

--lipsync-provider <pika|kling>

(defaults to pika
— parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass

kling

for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders),

--no-captions

(skip the Step 11 caption burn — default is captions on),

--preview

(opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview).

--skip-preview

and

--yes

are accepted as no-ops for backward compatibility.

Mode detection:

GitHub mode — URL host is
```
github.com
```
AND path matches
```
/{owner}/{repo}
```
(no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like
```
/blob/HEAD/path
```
). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.

Avatar resolution (silent — never ask the user):

If
```
--avatar <url>
```
was passed, use it.
Else call
```
mcp__pika__identity_avatar_url
```
. If non-null, use it.
Else (fresh user, no identity avatar set yet) call
```
mcp__pika__generate_image
```
once with prompt
```
"professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"
```
and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.

Voice resolution (silent — never ask the user):

If
```
--voice <id>
```
was passed, use it.
Else call
```
mcp__pika__identity_voice_id
```
. If non-null, use it.
Else pick a casual MiniMax
```
speech-2.8-hd
```
preset matching the resolved avatar's apparent gender:
- Female-coded avatar →
```
English_PlayfulGirl
```
  (warm, casual, clearly female-voiced — verified)
- Male-coded avatar →
```
English_Jovialman
```
  (warm, casual male)
- Unclear / gender-neutral →
```
English_Jovialman
```
  (default)
Determine gender from
```
mcp__pika__identity_persona_read
```
(look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call
```
analyze_media
```
for this — it's not worth the extra ~30s round-trip. Do not ask the user.
Do NOT use
English_FriendlyPerson
— despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback.
```
English_PlayfulGirl
```
is the canonical casual-female pick. Other verified-female alternates:
```
English_Upbeat_Woman
```
,
```
English_LovelyGirl
```
,
```
English_radiant_girl
```
.

The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.

必填参数：

url

（必须为

https://...

格式）。可选参数：

--avatar <url>

（覆盖身份存储默认值）、

--voice <minimax-voice-id>

、

--focus "..."

（将编辑指导融入旁白文本）、

--live-url <url>

（强制提供实时演示URL——仅GitHub模式可用）、

--lipsync-provider <pika|kling>

（默认值为**

pika

**——parrot a2v，耗时约2-5分钟，头部动作稍显生动。若需更紧凑的面部居中输出，可选择

kling

，耗时约5-30分钟——Kling生成的演示者头部动作极小，但耗时较长；仅用于高优先级渲染）、

--no-captions

（跳过步骤11的字幕添加——默认开启字幕）、

--preview

（主动开启步骤5的预览环节——生成“嗨，我是你的演示者”的3秒唇形同步效果，用于在耗时较长的渲染前测试新的头像/语音组合；默认不开启预览）。

--skip-preview

和

--yes

作为向后兼容的无操作参数被接受。

模式检测：

GitHub模式——URL主机为
```
github.com
```
且路径匹配
```
/{owner}/{repo}
```
（仓库根目录后无其他路径段）。触发仓库感知额外功能：扫描README、检测实时演示、使用GitHub特定选择器。
通用URL模式——其他所有情况（产品页面、文档站点、博客文章、更深层的GitHub路径如
```
/blob/HEAD/path
```
）。跳过GitHub专属功能；使用通用CSS选择器并直接导览URL本身。

头像解析（静默执行——绝不询问用户）：

如果传递了
```
--avatar <url>
```
，则使用该URL。
否则调用
```
mcp__pika__identity_avatar_url
```
。若返回非空值，则使用该值。
否则（新用户，未设置身份头像）调用一次
```
mcp__pika__generate_image
```
，提示词为
```
"professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"
```
，并使用返回的URL。绝不询问用户“是否生成头像？”——直接静默生成。

语音解析（静默执行——绝不询问用户）：

如果传递了
```
--voice <id>
```
，则使用该ID。
否则调用
```
mcp__pika__identity_voice_id
```
。若返回非空值，则使用该值。
否则根据解析出的头像外观性别选择合适的MiniMax
```
speech-2.8-hd
```
预设：
- 女性化头像 →
```
English_PlayfulGirl
```
  （温暖、随意、清晰的女性语音——已验证）
- 男性化头像 →
```
English_Jovialman
```
  （温暖、随意的男性语音）
- 性别不明/中性 →
```
English_Jovialman
```
  （默认值）
若存在身份信息，则从
```
mcp__pika__identity_persona_read
```
中获取性别/代词字段；否则从解析出的头像图片推断。绝不调用
analyze_media
——额外的30秒往返耗时不值得。绝不询问用户。
请勿使用
English_FriendlyPerson
——尽管MiniMax分类中将其归为“女性”，但其显示名称为“Friendly Guy”，播放时听起来是男性。
```
English_PlayfulGirl
```
是标准的随意女性语音选择。其他已验证的女性语音备选：
```
English_Upbeat_Woman
```
、
```
English_LovelyGirl
```
、
```
English_radiant_girl
```
。

以下流程按步骤标注：仅GitHub模式、仅通用模式或两种模式通用。

Step 2 — Read source (no MCP call)

步骤2——读取源内容（无需调用MCP）

Both modes: use Claude's

WebFetch

on the input URL to pull the page's main content (h1, hero section, headings, primary copy).

GitHub mode additions: also fetch top-level file tree, (best-effort)

package.json

pyproject.toml

, and GitHub API repo metadata via

gh api repos/{owner}/{repo}

for

homepage

description

language

topics

. Detect a candidate

live_url

in this priority:

User-supplied
```
--live-url
```
.
GitHub API
meta.homepage
field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball
```
repo_analyzer.py:66-77
```
).
```
package.json
```
```
"homepage"
```
field.

First match in README of

https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*

Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g.
```
<project>-project-page.com
```
); when the README explicitly designates a project page, prefer that over the github.io fallback.
GitHub Pages convention
```
https://{owner}.github.io/{repo}
```
— but only if the deep tree contains a frontend signal (one of
```
index.html
```
,
```
App.tsx
```
,
```
App.jsx
```
,
```
App.vue
```
,
```
app.py
```
,
```
main.py
```
).

If no candidate resolves, the beat sheet skips beats 6–7.

Generic-URL mode: the input URL itself is the only URL the beats walk through — no

live_url

inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.

两种模式通用： 使用Claude的

WebFetch

获取输入URL页面的主要内容（h1、hero区域、标题、主要文本）。

GitHub模式额外操作： 同时获取顶级文件树、（尽力获取）

package.json

pyproject.toml

，并通过

gh api repos/{owner}/{repo}

获取GitHub API仓库元数据，包括

homepage

、

description

、

language

、

topics

。按以下优先级检测候选

live_url

：

用户提供的
```
--live-url
```
。
GitHub API的
meta.homepage
字段——维护者在GitHub设置中配置仓库主页时设置（与tarball文件
```
repo_analyzer.py:66-77
```
逻辑一致）。
```
package.json
```
中的
```
"homepage"
```
字段。

README中第一个匹配

https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*

的URL。

README中徽章区域/“Live Demo”/“Project Page”/“Demo”文本指向的任意其他URL。上述允许列表正则表达式可能会遗漏自定义域名（例如
```
<project>-project-page.com
```
）；当README明确指定项目页面时，优先选择该URL而非github.io备用地址。
GitHub Pages惯例地址
```
https://{owner}.github.io/{repo}
```
——但仅当深层文件树包含前端信号（
```
index.html
```
、
```
App.tsx
```
、
```
App.jsx
```
、
```
App.vue
```
、
```
app.py
```
、
```
main.py
```
中的任意一个）时才使用。

如果未解析出候选URL，则节拍表跳过第6-7拍。

通用URL模式： 输入URL本身就是节拍表导览的唯一URL——无需推断

live_url

，无需获取额外元数据。跳过步骤2.5和步骤3.0；直接进入步骤3。

Step 2.5 — Verify

live_url

reachability (GitHub mode only, no MCP call)

步骤2.5——验证

live_url

可达性（仅GitHub模式，无需调用MCP）

If a candidate

live_url

was selected, verify it serves real content before authoring beats 6–7. Use

WebFetch

on the candidate and check the response:

If the response status is 4xx / 5xx, drop
live_url
to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains
```
"There isn't a GitHub Pages site here"
```
), drop
```
live_url
```
and skip beats 6–7.
Otherwise, keep
```
live_url
```
for beats 6–7.

This mirrors the original tarball's

requests.head(live_url, timeout=6, allow_redirects=True)

reachability gate.

如果已选择候选

live_url

，则在编写第6-7拍前验证其是否能提供真实内容。使用

WebFetch

访问候选URL并检查响应：

如果响应状态码为4xx/5xx，则将
live_url
设为None并跳过第6-7拍。尤其是github.io备用地址，虽然主机可达，但对于未启用Pages的仓库，经常返回404错误（“There isn't a GitHub Pages site here”）——录制该404页面会浪费讲解视频约12秒的时间在错误内容上。
如果响应渲染了GitHub Pages的“404 — There isn't a GitHub Pages site here.”模板（启发式判断：响应正文包含
```
"There isn't a GitHub Pages site here"
```
），则将
```
live_url
```
设为None并跳过第6-7拍。
否则保留
```
live_url
```
用于第6-7拍。

这与原始tarball文件的

requests.head(live_url, timeout=6, allow_redirects=True)

可达性检查逻辑一致。

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

步骤2.6——通用URL预检查（仅通用URL模式，无需调用MCP）

Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.

A. Bot-block / captcha detection — abort if matched:

If the response body contains any of:

"Verify you are human"

"verify you are not a robot"

```
"captcha"
```
/
```
"CAPTCHA"
```
/
```
"reCAPTCHA"
```
```
"403 Forbidden"
```
/
```
"Access Denied"
```
```
"Just a moment"
```
+
```
cf-chl-bypass
```
(Cloudflare challenge)
```
"We're sorry, something went wrong"
```
(Amazon-style bot block)
A
```
<title>
```
or h1 of just "Robot Check" / "Are you a robot?"

→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."

B. Cookie / consent-banner detection — defuse with
extra_css
+ optional click:

Scan the response for these patterns (case-insensitive):

IDs / classes starting with

onetrust-

truste-

cookie-banner

cookie-consent

gdpr-

consent-

cmp-

Buttons matching

(?i)accept (all )?cookies

(?i)agree.{0,10}cookies

(?i)i (accept|agree)

Apple-specific banner: id
```
ac-gdpr-banner
```
or class
```
as-globalfooter-curtain
```
Google consent:
```
[role="dialog"]
```
with text "Before you continue"

If detected, set

cookie_banner_present = true

. Defense in depth — the recording uses BOTH:

CSS injection (
extra_css
) in the
```
capture_website
```
call to hide common banners universally — even if the click below misses, the banner is visually gone.

A
click

timed_action
at

at_s: 0.0

against the most likely dismissal selector (extracted from the WebFetch DOM, e.g.

#onetrust-accept-btn-handler

[aria-label*="Accept all" i]

button[id*="accept"]

The

extra_css

payload (use this verbatim — covers ~80% of consent platforms):

#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }

C. Real-DOM element identification — emit concrete selectors:

Generic CSS selectors (

h1

[class*="hero"]

section h2

) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses

tile-headline

as-headline-section-title

, not

hero-*

). For each beat, prefer the actual DOM elements observed in the WebFetch:

Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary
```
<h1>
```
text and class.
Note the page's section structure (h2 headings + their parent containers).
Note any prominent CTA / signup / pricing element.
Emit
```
zoom_target.selector
```
using the actual class or id observed, falling back to semantic structure (
```
main > section:nth-of-type(N) h2
```
) when class names look auto-generated (Tailwind
```
_1a2b3c
```
, CSS modules
```
module__hero___xYz
```
).

D. SPA / lazy-render detection — bump initial wait:

If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-

domcontentloaded

. Emit a longer initial

wait

action (

{type: "wait", at_s: 0.0, ms: 2500}

) before any beat fires, instead of the default 600ms settle.

E.
--focus
is honored when supplied (do not solicit):

Without

--focus

, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with

--focus "the X feature"

if the first pass misses the angle they wanted. With

--focus

supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.

在为非GitHub URL编写节拍表前，使用

WebFetch

访问输入URL并检查响应。此步骤可避免三种常见的通用URL失败模式：(a) 录制验证码/机器人拦截页面而非实际内容；(b) Cookie/同意弹窗占据视频前约3秒；(c) 通用CSS选择器无法匹配页面实际的hero/区域。

A. 机器人拦截/验证码检测——匹配则中止：

如果响应正文包含以下任意内容：

"Verify you are human"

"verify you are not a robot"

```
"captcha"
```
/
```
"CAPTCHA"
```
/
```
"reCAPTCHA"
```
```
"403 Forbidden"
```
/
```
"Access Denied"
```
```
"Just a moment"
```
+
```
cf-chl-bypass
```
（Cloudflare挑战）
```
"We're sorry, something went wrong"
```
（亚马逊风格机器人拦截）
仅包含“Robot Check”/“Are you a robot?”的
```
<title>
```
或h1

→ 中止并向用户返回清晰错误：“通用URL模式无法渲染该站点——页面在无头Chrome下显示机器人检测/验证码挑战。请尝试其他URL，或先通过真实用户访问验证页面能否正常加载。”

B. Cookie/同意弹窗检测——通过
extra_css
+可选点击消除：

扫描响应中的以下模式（不区分大小写）：

以

onetrust-

、

truste-

、

cookie-banner

、

cookie-consent

、

gdpr-

、

consent-

、

cmp-

开头的ID/类

匹配

(?i)accept (all )?cookies

(?i)agree.{0,10}cookies

(?i)i (accept|agree)

的按钮

Apple特定弹窗：ID为
```
ac-gdpr-banner
```
或类为
```
as-globalfooter-curtain
```
Google同意弹窗：
```
[role="dialog"]
```
且包含文本“Before you continue”

如果检测到，则设置

cookie_banner_present = true

。采用双重防御机制——录制时同时使用：

CSS注入（
extra_css
）：在
```
capture_website
```
调用中注入CSS以全局隐藏常见弹窗——即使下方的点击操作未命中，弹窗也会在视觉上消失。
定时点击操作（
click

timed_action
）：在
```
at_s: 0.0
```
时针对最可能的关闭选择器（从WebFetch DOM中提取，例如
```
#onetrust-accept-btn-handler
```
、
```
[aria-label*="Accept all" i]
```
、
```
button[id*="accept"]
```
）执行点击。

extra_css

payload（原样使用——覆盖约80%的同意平台）：

#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }

C. 真实DOM元素识别——输出具体选择器：

通用CSS选择器（

h1

、

[class*="hero"]

、

section h2

）在语义化/标记良好的站点上有效，但在大型企业站点（apple.com使用

tile-headline

as-headline-section-title

而非

hero-*

）的混淆类名上会失效。对于每个节拍，优先选择WebFetch中观察到的实际DOM元素：

读取WebFetch返回的渲染后HTML/Markdown。记录页面实际的主
```
<h1>
```
文本和类。
记录页面的区域结构（h2标题及其父容器）。
记录任何突出的CTA/注册/定价元素。
使用观察到的实际类或ID输出
```
zoom_target.selector
```
，当类名看起来是自动生成的（Tailwind
```
_1a2b3c
```
、CSS modules
```
module__hero___xYz
```
）时，回退到语义结构（
```
main > section:nth-of-type(N) h2
```
）。

D. SPA/懒加载检测——延长初始等待时间：

如果WebFetch返回的内容中可见标题少于3个/文本内容极少，则页面可能是

domcontentloaded

后渲染的SPA。在第一个节拍触发前输出更长的初始

wait

操作（

{type: "wait", at_s: 0.0, ms: 2500}

），而非默认的600ms等待时间。

E. 提供
--focus
参数时予以尊重（不主动请求）：

如果未提供

--focus

，则根据通用结构线索选择节拍——静默执行自信的首次尝试。绝不询问用户“你想重点关注什么？”；如果首次尝试未命中用户想要的方向，用户可重新执行并添加

--focus "X功能"

参数。如果提供了

--focus

，则将节拍选择锚定在该短语上：使用匹配该短语的具体页面区域，忽略无关的营销内容。

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

步骤3.0——必填README章节扫描（仅GitHub模式，无需调用MCP）

Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:

README contains...	Required beat
`how it works`	scroll_to that heading; zoom `article h2:has(#user-content-how-it-works)`
`audio layer` / `audio timeline`	scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading
`claude code` / `mcp integration`	scroll_to that section; zoom `article pre` or `.highlight` (terminal screenshot / code block)
`architecture` / `system design`	scroll_to that section; zoom `article h2:has(#user-content-architecture)`
`features` (when prominent at top)	scroll_to that heading; zoom `article h2:has(#user-content-features)`
`getting started` / `quick start` / `installation`	scroll_to that heading; zoom `article h2:has(#user-content-installation)` (or the matching slug) — falls back to `article pre` if you want the install code block instead
`usage` / `examples`	scroll_to that heading; zoom `article h2:has(#user-content-usage)` (or the matching slug) — or the first code block under it

GitHub heading slug rule: lowercase, spaces → dashes, strip non-

[a-z0-9-]

characters. So "How it works" →

#user-content-how-it-works

, "Quick Start" →

#user-content-quick-start

. GitHub injects the

<a id="user-content-{slug}">

anchor inside each rendered

<hN>

, so

hN:has(#user-content-{slug})

reliably grabs the heading element across any GitHub README.

Selector contract:

bbox_selector

needs to be vanilla CSS that resolves via

document.querySelector

(

capture_website

runs the post-action smooth-scroll JS via

page.evaluate

, which uses the browser's native selector engine). Avoid Playwright extensions like

:has-text("...")

text=...

, or

:visible

: those resolve in Playwright's

page.query_selector

(so the bbox capture finds the element) but silently fail in the smooth-scroll's

document.querySelector

(so the page never scrolls to the target, and

bbox.y

ends up at document-Y instead of

top - 60 px

, which trips Step 8b's

bbox.y > recording_viewport.h

degenerate filter and falls back to default-position zoom). CSS Level 4

:has(...)

is vanilla and supported in modern Chromium.

These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with

SPECIAL

rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.

在编写节拍表前，扫描README全文（不区分大小写）查找以下章节名称。如果找到匹配项，则必须在步骤3中为该章节添加专门的节拍，必要时替换通用节拍4-5中的一个：

README包含...	必填节拍
`how it works`	滚动到该标题；缩放 `article h2:has(#user-content-how-it-works)`
`audio layer` / `audio timeline`	滚动到音频层图表；缩放渲染后的图或其周围的标题
`claude code` / `mcp integration`	滚动到该章节；缩放 `article pre` 或 `.highlight` （终端截图/代码块）
`architecture` / `system design`	滚动到该章节；缩放 `article h2:has(#user-content-architecture)`
`features` （在顶部突出显示）	滚动到该标题；缩放 `article h2:has(#user-content-features)`
`getting started` / `quick start` / `installation`	滚动到该标题；缩放 `article h2:has(#user-content-installation)` （或匹配的slug）——如果需要安装代码块，可回退到 `article pre`
`usage` / `examples`	滚动到该标题；缩放 `article h2:has(#user-content-usage)` （或匹配的slug）——或其下的第一个代码块

GitHub标题slug规则： 小写，空格替换为连字符，去除非

[a-z0-9-]

字符。例如“How it works”→

#user-content-how-it-works

，“Quick Start”→

#user-content-quick-start

。GitHub会在每个渲染后的

<hN>

内注入

<a id="user-content-{slug}">

锚点，因此

hN:has(#user-content-{slug})

可可靠地获取任意GitHub README中的标题元素。

选择器约定：

bbox_selector

需要是可通过

document.querySelector

解析的原生CSS（

capture_website

通过

page.evaluate

执行操作后的平滑滚动JS，使用浏览器原生选择器引擎）。避免使用Playwright扩展如

:has-text("...")

、

text=...

或

:visible

：这些在Playwright的

page.query_selector

中可解析（因此bbox捕获能找到元素），但在平滑滚动的

document.querySelector

中会静默失败（因此页面不会滚动到目标位置，

bbox.y

最终会是文档Y值而非

top - 60 px

，这会触发步骤8b的

bbox.y > recording_viewport.h

退化过滤器并回退到默认位置缩放）。CSS Level 4的

:has(...)

是原生语法，在现代Chromium中受支持。

这些章节是大多数值得讲解的仓库中信息密度最高的视觉内容。遗漏这些章节会生成通用的导览内容；包含这些章节则会让讲解视频具有具体的“展示而非讲述”节拍。原始tarball文件的SKILL.md在Gemini提示中用

SPECIAL

规则标记了前四个章节；本步骤3.0将它们从 incidental 指导提升为硬性要求，并添加了OSS README中常见的另外三个高信号标题。

Step 3 — Author beat sheet (main thread, no MCP call)

步骤3——编写节拍表（主线程，无需调用MCP）

Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:

jsonc

{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "exact words to speak — 1 to 2 conversational sentences"
}

Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):

Every beat needs all five fields:
```
t_start
```
,
```
t_end
```
,
```
action
```
(with
```
type
```
and
```
url
```
),
```
zoom_target
```
(with
```
selector
```
),
```
vo_text
```
. Missing fields ⇒ reject and re-author. (Mirrors tarball's
```
github_explainer.py:183-190
```
validation pass.)
```
t_start
```
of beat 0 = 0.0;
```
t_end[i] == t_start[i+1]
```
(continuity).
```
len(vo_text.split()) / 2.5
```
≈
```
t_end - t_start
```
per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the
```
vo_text
```
until it fits.
Total
t_end
of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
Total spoken word count between 165 and 200 words.
Every beat's
```
zoom_target.selector
```
needs to be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors:
```
h1.f1
```
,
```
#readme
```
,
```
article h2
```
,
```
.blob-code-inner
```
,
```
.highlight
```
,
```
.octicon-star
```
,
```
nav
```
. Generic-URL mode prefers robust generic selectors:
```
h1
```
,
```
[role="main"]
```
,
```
main
```
,
```
header
```
,
```
nav
```
,
```
.hero
```
,
```
.feature
```
,
```
section h2
```
,
```
[class*="cta"]
```
,
```
[class*="hero"]
```
,
```
button
```
,
```
a[href]
```
. Selectors need to resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.
```
vo_text
```
is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.

action.url

is a valid

https://...

URL when

action.type == "navigate"

; required.

Self-check before Step 4: verify

total_words

is in

[165, 200]

AND

total_seconds

beats[-1].t_end

) is in

[65, 80]

. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)

Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):

Beat 1:
```
navigate
```
repo root, zoom
```
h1.f1
```
(repo title), hook sentence.
Beats 2–3:
```
navigate
```
to specific source files (
```
https://github.com/{owner}/{repo}/blob/HEAD/<path>
```
), zoom
```
.blob-code-inner
```
or
```
.highlight
```
. Pick files that match the narration's claim — don't navigate to a file you won't talk about.
Beats 4–5:
```
scroll_to
```
README sections, zoom
```
article h2
```
or
```
#readme
```
. If Step 3.0 surfaced required sections, replace these slots with the required ones.
Beats 6–7 (only if
live_url
survived Step 2.5):
```
navigate
```
to
```
live_url
```
, zoom
```
nav
```
/
```
h1
```
/
```
.hero
```
/
```
main
```
/
```
button
```
/
```
.feature
```
.
Beat 8: back to repo root, zoom
```
.octicon-star
```
, outro.

Structural skeleton — Generic-URL mode:

Beat 1:
```
navigate
```
to the input URL, zoom
```
h1
```
or
```
[class*="hero"] h1
```
(the page's primary headline), hook sentence.
Beats 2–3:
```
scroll_to
```
the page's hero / value-prop / first feature section. Zoom
```
.hero
```
,
```
[class*="hero"]
```
,
```
[class*="feature"]
```
, or
```
section:nth-of-type(1) h2
```
. Pick visible elements the narration references.
Beats 4–5:
```
scroll_to
```
deeper sections — feature lists, screenshots, pricing, social proof. Zoom
```
section h2
```
,
```
[class*="feature"] img
```
,
```
[class*="testimonial"]
```
,
```
[class*="pricing"]
```
, or any prominent semantic element on the page.
Beats 6–7:
```
scroll_to
```
CTA / signup / demo embed. Zoom
```
[class*="cta"]
```
,
```
button
```
,
```
a[class*="button"]
```
, or
```
[id*="signup"]
```
. (No live-demo navigation in generic mode — the input URL IS the demo.)
Beat 8:
```
scroll_to
```
footer / closing element, zoom
```
footer h2
```
,
```
footer
```
, or back to top with
```
h1
```
. Outro sentence.

--focus

is supplied, weave its angles into

vo_text

without mutating the structural skeleton. Prefer CSS selectors over
text_content
in

zoom_target.selector

— bbox capture is selector-only (see Known gaps).

编写包含8-10个节拍的JSON数组，总时长严格控制在65-80秒，总单词数严格控制在165-200词（假设语速为2.5词/秒）。每个节拍格式如下：

jsonc

{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "要朗读的精确内容——1-2句口语化句子"
}

硬性约束（输出节拍表前验证——若违反则拒绝草稿并重写）：

每个节拍必须包含所有五个字段：
```
t_start
```
、
```
t_end
```
、
```
action
```
（含
```
type
```
和
```
url
```
）、
```
zoom_target
```
（含
```
selector
```
）、
```
vo_text
```
。字段缺失→拒绝并重写。（与tarball文件
```
github_explainer.py:183-190
```
的验证逻辑一致。）
节拍0的
```
t_start
```
=0.0；
```
t_end[i] == t_start[i+1]
```
（连续性）。
```
len(vo_text.split()) / 2.5
```
≈
```
t_end - t_start
```
（每个节拍）。目标为±10%的误差；如果草稿密度超过2.5词/秒，则精简
```
vo_text
```
直到符合要求。
最后一个节拍的
t_end
≤80秒。（参考输出包括 intro 为86.5秒；唇形同步音频约83秒。在当前负载下，Kling头像/image2video在音频时长超过约90秒时会可靠地停滞——超过80秒会导致Kling超时20分钟。）
总朗读单词数在165-200词之间。
每个节拍的
```
zoom_target.selector
```
必须是该节拍所在页面的有效CSS选择器。GitHub模式优先使用GitHub特定选择器：
```
h1.f1
```
、
```
#readme
```
、
```
article h2
```
、
```
.blob-code-inner
```
、
```
.highlight
```
、
```
.octicon-star
```
、
```
nav
```
。通用URL模式优先使用健壮的通用选择器：
```
h1
```
、
```
[role="main"]
```
、
```
main
```
、
```
header
```
、
```
nav
```
、
```
.hero
```
、
```
.feature
```
、
```
section h2
```
、
```
[class*="cta"]
```
、
```
[class*="hero"]
```
、
```
button
```
、
```
a[href]
```
。选择器必须在节拍操作完成后的渲染页面上可解析——输出前需通过WebFetch查看的DOM进行验证。
```
vo_text
```
为1-2句口语化句子。使用开发者语气。无舞台提示。无Markdown格式。

当

action.type == "navigate"

时，

action.url

必须是有效的

https://...

URL；必填。

步骤4前的自检： 验证

total_words

在

[165, 200]

范围内且

total_seconds

（=

beats[-1].t_end

）在

[65, 80]

范围内。如果任一条件不满足，则重写节拍表——不要进入TTS步骤。（无需“打印”任何内容——这是内部草稿验证；只需拒绝草稿并重写直到通过。）

结构框架——GitHub模式（视觉约定的核心——与原始逻辑一致，但步骤3.0适用时覆盖）：

节拍1：
```
navigate
```
到仓库根目录，缩放
```
h1.f1
```
（仓库标题），开场句子。
节拍2-3：
```
navigate
```
到特定源文件（
```
https://github.com/{owner}/{repo}/blob/HEAD/<path>
```
），缩放
```
.blob-code-inner
```
或
```
.highlight
```
。选择与旁白内容匹配的文件——不要导航到未提及的文件。
节拍4-5：
```
scroll_to
```
到README章节，缩放
```
article h2
```
或
```
#readme
```
。如果步骤3.0发现必填章节，则用必填章节替换这些位置。
节拍6-7（仅当
live_url
通过步骤2.5验证时）：
```
navigate
```
到
```
live_url
```
，缩放
```
nav
```
/
```
h1
```
/
```
.hero
```
/
```
main
```
/
```
button
```
/
```
.feature
```
。
节拍8： 返回仓库根目录，缩放
```
.octicon-star
```
，结尾句子。

结构框架——通用URL模式：

节拍1：
```
navigate
```
到输入URL，缩放
```
h1
```
或
```
[class*="hero"] h1
```
（页面主标题），开场句子。
节拍2-3：
```
scroll_to
```
到页面的hero/价值主张/第一个功能区域。缩放
```
.hero
```
、
```
[class*="hero"]
```
、
```
[class*="feature"]
```
或
```
section:nth-of-type(1) h2
```
。选择旁白提及的可见元素。
节拍4-5：
```
scroll_to
```
到更深层区域——功能列表、截图、定价、社交证明。缩放
```
section h2
```
、
```
[class*="feature"] img
```
、
```
[class*="testimonial"]
```
、
```
[class*="pricing"]
```
或页面上任何突出的语义元素。
节拍6-7：
```
scroll_to
```
到CTA/注册/演示嵌入区域。缩放
```
[class*="cta"]
```
、
```
button
```
、
```
a[class*="button"]
```
或
```
[id*="signup"]
```
。（通用模式无实时演示导航——输入URL即为演示页面。）
节拍8：
```
scroll_to
```
到页脚/结尾元素，缩放
```
footer h2
```
、
```
footer
```
或返回顶部的
```
h1
```
。结尾句子。

如果提供了

--focus

，则将其方向融入

vo_text

，但不改变结构框架。在

zoom_target.selector

中优先使用CSS选择器而非
text_content
——bbox捕获仅支持选择器（参见已知缺陷）。

Step 4 — TTS

步骤4——文本转语音（TTS）

Call

mcp__pika__generate_speech

with

provider: "minimax-tts"

text: <full vo_text join>

, optional

voice_id

. Capture

result.audio_url

(the dispatcher returns audio under

audio_url

, not

url

) and

result.duration_seconds

. Voice defaults to identity-store injection in plugin mode.

Stale-voice fallback detection (AGNT-231): the dispatcher retries once with the default

Calm_Woman

voice on Minimax

status_code:2054

(voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough):

voice_id_requested

(the planted-but-stale id the worker tried first) and

fallback_reason: "invalid_minimax_voice_id"

. If you see
fallback_reason == "invalid_minimax_voice_id"
in the response, surface a one-line note to the user along the lines of: "your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via

clone_voice

if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.

Cookie-banner audio padding (Generic-URL mode with
cookie_banner_present == true
from Step 2.6 §B): prepend MiniMax's pause marker

<#1.5#>

to the

text:

argument before calling

generate_speech

. MiniMax's

speech-2.8-hd

honors

<#N#>

as N-second silence; the returned

audio_url

and

duration_seconds

include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.

Fallback (only if smoke-test shows the marker is ignored on this voice): call

generate_speech

normally, then

mcp__pika__edit_audio_mix

to overlay the result onto a 1.5s silent base at offset 1.5s. Then call
mcp__pika__analyze_media(url=<padded_audio_url>)
to probe the padded duration and rebind
duration_seconds = result.duration_seconds
before Step 4.5 consumes it.

analyze_media

is the single authoritative duration probe — do not rely on

edit_audio_mix

's return payload (its duration field is not contractually guaranteed).

调用

mcp__pika__generate_speech

，参数为

provider: "minimax-tts"

、

text: <拼接后的完整vo_text>

、可选

voice_id

。捕获

result.audio_url

（调度器返回的音频地址为

audio_url

，而非

url

）和

result.duration_seconds

。插件模式下语音默认使用身份存储注入的值。

过期语音 fallback 检测（AGNT-231）： 当Minimax返回

status_code:2054

（语音ID不存在——通常是代理工作区指针，Minimax在7天未使用后自动删除）时，调度器会自动重试一次，使用默认的

Calm_Woman

语音。重试成功后，响应会包含文档 schema 之外的两个额外字段（透传）：

voice_id_requested

（工作器首次尝试的已过期语音ID）和

fallback_reason: "invalid_minimax_voice_id"

。如果响应中
fallback_reason == "invalid_minimax_voice_id"
，则向用户显示一行提示：“你注册的语音在Minimax上已过期（7天未使用后自动清理）；我们使用了系统默认语音。若需恢复个性化语音，请通过

clone_voice

重新克隆。”渲染不会失败——只是使用默认语音——因此这是信息提示，而非重试触发条件。

Cookie弹窗音频填充（通用URL模式且步骤2.6§B中
cookie_banner_present == true
）：在调用

generate_speech

前，将MiniMax的暂停标记

<#1.5#>

添加到

text:

参数前。MiniMax的

speech-2.8-hd

支持

<#N#>

作为N秒静音；返回的

audio_url

和

duration_seconds

会原生包含1.5秒的前置静音。这会使音频与步骤4.5中应用的屏幕录制cookie关闭+1.5秒偏移对齐。

Fallback方案（仅当冒烟测试显示该标记被当前语音忽略时使用）：正常调用

generate_speech

，然后调用

mcp__pika__edit_audio_mix

将结果叠加到1.5秒的静音基础上，偏移量为1.5秒。然后调用
mcp__pika__analyze_media(url=<填充后的音频URL>)
探测填充后的时长，并重新绑定
duration_seconds = result.duration_seconds
，供步骤4.5使用。

analyze_media

是唯一权威的时长探测工具——不要依赖

edit_audio_mix

的返回 payload（其时长字段无契约保证）。

Step 4.5 — Audio length verification + beat-sheet rescale

步骤4.5——音频长度验证 + 节拍表缩放

Applied to

audio_duration_seconds

post Step 4 (which includes any cookie lead-in pad). End state:

beats[].t_start

t_end

are absolute wall-clock seconds matching the audio playback timeline. All
beats[]
mutations happen here; Steps 6 and 8 are read-only consumers.

Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds): If

audio_duration_seconds > 90

, abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.

Gate 2 — Degenerate TTS (spoken-content length): Compute

narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)

. If

narration_duration < 30s

, retry Step 4 once (and recompute

narration_duration

from the retry's audio). If the retry also returns

narration_duration < 30s

, abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.

Gate 3 — Rescale:

narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)

scale = narration_duration / beats[-1].t_end

If
```
scale < 0.5
```
or
```
scale > 1.5
```
, abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it.

For each beat:

beat.t_start *= scale; beat.t_end *= scale

cookie_banner_present

: for each beat,

beat.t_start += 1.5; beat.t_end += 1.5

Final clamp:
```
beats[-1].t_end = audio_duration_seconds
```
(exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.

After Gate 3 passes, emit a one-line operator log to surface the scale value for post-run diagnosis:

Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)

Advisory (not a gate): scale near 1.0 is ideal.

scale > 1.2

means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync.

scale < 0.85

means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.

应用于步骤4后的

audio_duration_seconds

（包含任何cookie前置填充）。最终状态：

beats[].t_start

t_end

为与音频播放时间线匹配的绝对挂钟秒数。所有
beats[]
修改都在此步骤进行；步骤6和8为只读消费。

关卡1——Kling停滞上限（服务商限制，原始audio_duration_seconds）： 如果

audio_duration_seconds > 90

，则中止并重写节拍表，减少单词数量。Kling头像/image2video在超过约90秒时会停滞。

关卡2——退化TTS（朗读内容长度）： 计算

narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)

。如果

narration_duration < 30s

，则重试步骤4一次（并从重试的音频中重新计算

narration_duration

）。如果重试后

narration_duration

仍<30s，则中止并排查问题——可能的失败模式：MiniMax响应截断、静音音频、vo_text拼接错误。

关卡3——缩放：

narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)

scale = narration_duration / beats[-1].t_end

如果
```
scale < 0.5
```
或
```
scale > 1.5
```
，则中止并重写。TTS结构损坏（或单词数量严重偏离预算）；缩放无法修复。

对每个节拍：

beat.t_start *= scale; beat.t_end *= scale

如果

cookie_banner_present

：对每个节拍，

beat.t_start += 1.5; beat.t_end += 1.5

最终钳位：
```
beats[-1].t_end = audio_duration_seconds
```
（精确值）。保证无论cookie模式如何或累积浮点漂移，不变量的浮点相等性。

关卡3通过后，输出一行操作日志以显示缩放值，供运行后诊断：

Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)

建议（非强制关卡）： scale接近1.0为理想状态。

scale > 1.2

意味着音频比预测慢很多——视觉效果会“拉伸”但保持同步。

scale < 0.85

意味着音频比预测快——视觉效果会“仓促”但保持同步。两种情况都通过关卡；如果用户反馈“节奏不对”而非“不同步”，则重写节拍表调整单词数量。

Step 5 — Preview gate (opt-in via

--preview

)

步骤5——预览环节（通过

--preview

主动开启）

Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed

--preview

— do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.

--skip-preview

and

--yes

are accepted as no-ops for backward compatibility — they were the old opt-out flags.

--preview

was supplied:

mcp__pika__generate_speech

with

text: "Hi, I'm your presenter. Let's explore this repo together."

→

preview_audio_url

```
mcp__pika__generate_lipsync
```
with
```
provider: <resolved_lipsync_provider>
```
(defaults to
```
pika
```
; honor
```
--lipsync-provider kling
```
if supplied),
```
image: <avatar>
```
,
```
audio: preview_audio_url
```
→
```
preview_lipsync_url
```
(bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
Present to the user verbatim:
Preview ready:
```
<preview_lipsync_url>
```
This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Reply
```
yes
```
to proceed, or anything else to cancel.
Match
```
^(yes|go|proceed|confirm|y)$
```
(case-insensitive). Anything else → STOP, no further MCP calls.

默认完全跳过步骤5。直接进入步骤6，除非用户显式传递

--preview

——不生成预览，不请求确认。这符合媒体生成工具的行业标准（如Midjourney / Sora / Runway / HeyGen / Pika.art）：提交→渲染→返回；账户余额 + 服务商故障切换是核心保障机制。

--skip-preview

和

--yes

作为向后兼容的无操作参数被接受——它们是旧版的退出标志。

如果提供了

--preview

：

调用

mcp__pika__generate_speech

，参数

text: "Hi, I'm your presenter. Let's explore this repo together."

→ 获取

preview_audio_url

。

调用
```
mcp__pika__generate_lipsync
```
，参数
```
provider: <解析后的lipsync_provider>
```
（默认
```
pika
```
；如果提供了
```
--lipsync-provider kling
```
则优先使用）、
```
image: <avatar>
```
、
```
audio: preview_audio_url
```
→ 获取
```
preview_lipsync_url
```
（纯唇形同步，约3秒）。此处使用与步骤9完整音频相同的服务商——预览的作用是在耗时较长的渲染前确认头像+语音+服务商组合。
向用户原样显示：
预览已准备好：
```
<preview_lipsync_url>
```
这确认了头像+语音组合。完整渲染耗时较长（Kling唇形同步约5-30分钟）。回复
```
yes
```
继续，回复其他内容则取消。
匹配
```
^(yes|go|proceed|confirm|y)$
```
（不区分大小写）。回复其他内容→停止，不再调用MCP。

Step 6 — Build

timed_actions

and record

步骤6——构建

timed_actions

并录制

Translate the beat sheet into

capture_website

timed_actions

. One
timed_action
per beat — set

bbox_selector

to the beat's

zoom_target.selector

and

capture_website

captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-

top - 60 px

→ 1300 ms post-anim → measure, all server-side).

For each beat in order, emit one entry:

navigate
beats:
```
{type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}
```
. The worker navigates, waits to absolute
```
at_s + 0.6 s
```
, scrolls
```
bbox_selector
```
into view, and measures the bbox — all without the caller scheduling a follow-up step.
scroll_to
/
hover
beats:
```
{type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}
```
. The action's own
```
selector
```
drives the page scroll;
```
bbox_selector
```
drives the bbox measurement (it can be the same selector or different — usually the same). (
```
capture_website
```
has no
```
hover
```
; scroll-into-view is the analog.)

Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from

t=0

of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at

t=0

with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.

Call

mcp__pika__capture_website

```
url: <beat 0's action.url>
```

timed_actions: <the N-element list built above>

(one entry per beat)

```
duration_s: ceil(audio_duration_seconds)
```
—
```
beats[].t_start
```
and
```
t_end
```
have already been rescaled to the TTS audio timeline by Step 4.5, so
```
duration_s
```
is simply the audio length. The old
```
max(...)
```
defense against TTS overrun is no longer needed.

Generic-URL mode additions (per Step 2.6 pre-flight):

```
extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B>
```
— defensive: hides common consent platforms via
```
display: none !important;
```
so even if the optional click misses, the banner is invisible in the recording.
Prepend a
wait
action
```
{type: "wait", at_s: 0.0, ms: 2500}
```
for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
If
cookie_banner_present
from Step 2.6 §B, also prepend a
```
click
```
action
```
{type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}
```
. The
beats[]
array has already been shifted by
+1.5s
in Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1's
timed_action.at_s
reads
beats[0].t_start
directly, which is 1.5 in cookie mode.
No cookie banner action needed if
```
cookie_banner_present == false
```
; just the prepended wait action.

Capture

video_url

recording_viewport

action_bboxes

. The result returns

recording_viewport: {w, h}

and

action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]

alongside

video_url

action_bboxes[].idx
semantics: the

idx

field is the position in the input

timed_actions

array.

GitHub mode: with one timed_action per beat,
```
idx
```
maps 1:1 to beat index — Step 8 uses
```
entry.idx
```
directly as
```
beat_idx
```
.
Generic-URL mode: the prepended
```
wait
```
(and optional cookie-dismissal
```
click
```
) shift the array by 1 or 2. Compute
```
beat_idx = entry.idx - prepend_count
```
where
```
prepend_count
```
is 1 (wait only) or 2 (wait + click). Skip entries where
```
beat_idx < 0
```
(those are the prepended setup actions, not beats).

The

selector

field on each entry reports

bbox_selector

(i.e.

zoom_target.selector

), not the action's own

selector

将节拍表转换为

capture_website

的

timed_actions

。每个节拍对应一个
timed_action
——将

bbox_selector

设为节拍的

zoom_target.selector

，

capture_website

会在内部捕获操作后的元素bbox（传统流程：600ms等待→平滑滚动到

top - 60 px

→1300ms动画后→测量，均在服务器端完成）。

按顺序为每个节拍输出一个条目：

navigate
节拍：
```
{type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}
```
。工作器会导航到目标URL，等待到绝对时间
```
at_s + 0.6 s
```
，滚动
```
bbox_selector
```
到视图中，并测量bbox——无需调用者调度后续步骤。
scroll_to
/
hover
节拍：
```
{type: "scroll", at_s: <t_start>, selector: <action.selector或zoom_target.selector>, bbox_selector: <zoom_target.selector>}
```
。操作自身的
```
selector
```
驱动页面滚动；
```
bbox_selector
```
驱动bbox测量（可以是相同或不同的选择器——通常相同）。（
```
capture_website
```
无
```
hover
```
操作；滚动到视图中是等效操作。）

不要添加tarball文件中使用的八步intro滚动流程。唇形同步音频从节拍表的

t=0

开始计时；添加intro会使屏幕录制向前偏移约3秒，而音频保持不变，导致音视频不同步。

capture_website

录制从

t=0

开始，此时节拍0的URL已加载——这正是tarball文件intro滚动提供的初始状态，但不会导致不同步。

调用

mcp__pika__capture_website

：

```
url: <节拍0的action.url>
```

timed_actions: <上面构建的N元素列表>

（每个节拍对应一个条目）

```
duration_s: ceil(audio_duration_seconds)
```
——步骤4.5已将
```
beats[].t_start
```
和
```
t_end
```
缩放到TTS音频时间线，因此
```
duration_s
```
即为音频长度。不再需要旧版的
```
max(...)
```
防御TTS超时。

通用URL模式额外操作（根据步骤2.6预检查）：

```
extra_css: <步骤2.6§B中的cookie弹窗隐藏CSS payload>
```
——防御性措施：通过
```
display: none !important;
```
隐藏常见同意平台，即使可选点击未命中，弹窗在录制中也会不可见。
添加前置
wait
操作：对于SPA/懒加载页面（步骤2.6§D），添加
```
{type: "wait", at_s: 0.0, ms: 2500}
```
；“正常”页面使用1500ms。这为hero图片懒加载、字体替换、滚动触发动画提供时间，确保第一个节拍触发时已准备就绪。
如果步骤2.6§B中
cookie_banner_present == true
，还需添加前置
```
click
```
操作
```
{type: "click", at_s: 0.5, selector: <从WebFetch DOM中检测到的关闭选择器>}
```
。步骤4.5已将
beats[]
数组偏移
+1.5s
以适应cookie关闭延迟，且TTS音频已填充1.5秒静音（步骤4）；此处无需进一步偏移。节拍1的
timed_action.at_s
直接读取
beats[0].t_start
，在cookie模式下为1.5秒。
如果
cookie_banner_present == false
，无需cookie弹窗操作；仅添加前置wait操作。

捕获

video_url

、

recording_viewport

、

action_bboxes

。结果返回

recording_viewport: {w, h}

和

action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]

以及

video_url

。

action_bboxes[].idx
语义：

idx

字段对应输入

timed_actions

数组中的位置。

GitHub模式： 每个节拍对应一个timed_action，
```
idx
```
与节拍索引1:1映射——步骤8直接使用
```
entry.idx
```
作为
```
beat_idx
```
。
通用URL模式： 前置的
```
wait
```
（和可选的cookie关闭
```
click
```
）会使数组偏移1或2位。计算
```
beat_idx = entry.idx - prepend_count
```
，其中
```
prepend_count
```
为1（仅wait）或2（wait + click）。跳过
```
beat_idx < 0
```
的条目（这些是前置设置操作，非节拍）。

每个条目的

selector

字段报告的是

bbox_selector

（即

zoom_target.selector

），而非操作自身的

selector

。

Step 7 — Browser chrome

步骤7——浏览器框架

mcp__pika__edit_browser_frame

```
video_url: <Step 6 video_url>
```

url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)

```
tab_title: <30-char title>
```
— GitHub mode:
```
(meta.description or repo_name or "")[:30]
```
. Generic-URL mode: the page's
```
<title>
```
(from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against
```
None
```
/empty.

Returns

framed_url

(1280×800 Sonoma + chrome).

调用

mcp__pika__edit_browser_frame

：

```
video_url: <步骤6的video_url>
```

url: (如果是GitHub模式且通过步骤2.5验证则为live_url，否则为input_url，截断为65字符)

```
tab_title: <30字符标题>
```
——GitHub模式：
```
(meta.description或repo_name或"")[:30]
```
。通用URL模式：页面的
```
<title>
```
（步骤2中WebFetch获取）或URL主机名，截断为30字符。处理
```
None
```
/空值情况。

framed_url

（1280×800 Sonoma框架+浏览器控件）。

Step 8 — Build

zoom_keyframes

and apply

步骤8——构建

zoom_keyframes

并应用

Constants:

```
INTRO_BEATS = 2
```
— gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
```
HOLD_GAP = 0.6
```
— seconds of 1.0× before each zoom-in and after each zoom-out.
```
MIN_BEAT_DUR = 1.5
```
— beats shorter than this are skipped (no room for a meaningful zoom).
```
SCALE = 1.35
```
(precise element-targeted zoom).
```
FALLBACK_SCALE = 1.25
```
(default-position fallback when no usable bbox).
```
FALLBACK_RAMP = 0.4
```
.

Note:

beats[].t_start

t_end

were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.

edit_browser_frame

's inner-content offsets:

CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637

(verified against the worker's

edit_browser_frame/main.py

Coord transform (recording px → framed px):

cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)

Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:

Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.

Constants for the default position:

```
DEFAULT_CX = 56 + 1168 // 2
```
(screen center of the framed canvas)
```
DEFAULT_CY = 108 + 637 // 3
```
(upper-third of the content area, where most GitHub UI prominence lives)

Walk the beat sheet from index

INTRO_BEATS

(= 2) to the end. For each beat:

If
```
t_end - t_start < MIN_BEAT_DUR
```
(1.5s), skip — too short for a meaningful zoom.
Compute the keyframe's interior interval as
```
[t_start + HOLD_GAP, t_end - HOLD_GAP]
```
. If that interval is shorter than 1.0s, skip.

Otherwise pre-fill that beat's slot in a per-beat map (call it

zoom_keyframes_by_beat[beat_idx]

) with

{cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}

plus the trimmed

t_start

t_end

Step 8b — Override with bbox-derived precise zoom where
action_bboxes
provided a usable measurement.

For each entry in

action_bboxes

```
beat_idx = entry.idx
```
(since Step 6 emits one timed_action per beat). If
```
beat_idx < INTRO_BEATS
```
, skip.
If
```
entry.found
```
is false, skip.
If the beat isn't already in
```
zoom_keyframes_by_beat
```
(was filtered out in Step 8a by
```
MIN_BEAT_DUR
```
/
```
1.0s
```
rules), skip.
Filter degenerate bboxes: skip if
```
bbox.y > recording_viewport.h
```
(offscreen capture — page didn't scroll the element into view in time) or
```
bbox.h > recording_viewport.h * 1.5
```
(full-page
```
<main>
```
element — yields a meaningless zoom center).
Compute
```
cx_framed
```
/
```
cy_framed
```
from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with
```
{cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}
```
.

Final list: sort the values of

zoom_keyframes_by_beat

t_start

to produce the

zoom_keyframes

array.

This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.

len(zoom_keyframes) > 0

, call

mcp__pika__edit_animate_zoom

with

video_url: framed_url, zoom_keyframes

. Returns

zoomed_url

. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use

framed_url

zoomed_url

常量：

```
INTRO_BEATS = 2
```
——按节拍表索引过滤。跳过节拍索引0和1（上述结构框架中的“节拍1”和“节拍2”）的缩放。
```
HOLD_GAP = 0.6
```
——每次放大前和缩小后的1.0×保持时间（秒）。
```
MIN_BEAT_DUR = 1.5
```
——短于此时长的节拍跳过（无足够空间进行有意义的缩放）。
```
SCALE = 1.35
```
（精确的元素目标缩放）。
```
FALLBACK_SCALE = 1.25
```
（无可用bbox时的默认位置 fallback）。
```
FALLBACK_RAMP = 0.4
```
。

注意： 步骤4.5已将

beats[].t_start

t_end

缩放（并在cookie模式下偏移）到音频时间线。HOLD_GAP（0.6秒）、MIN_BEAT_DUR（1.5秒）和1.0秒内部间隔检查均基于这些最终值——它们是渲染视频上的真实视觉秒数。

edit_browser_frame

的内部内容偏移：

CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637

（已与工作器的

edit_browser_frame/main.py

验证一致）。

坐标转换（录制像素→框架像素）：

cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)

构建缩放列表：每个节拍默认值 + bbox覆盖模式。旧版遵循“每个非intro节拍都进行缩放——可用bbox则基于bbox，否则使用默认位置”的规则。此处重现该逻辑：

步骤8a——为每个非intro、时长足够的节拍预填充默认位置关键帧。

默认位置常量：

```
DEFAULT_CX = 56 + 1168 // 2
```
（框架画布的屏幕中心）
```
DEFAULT_CY = 108 + 637 // 3
```
（内容区域上三分之一，GitHub UI最突出的位置）

从索引

INTRO_BEATS

（=2）到末尾遍历节拍表。对于每个节拍：

如果
```
t_end - t_start < MIN_BEAT_DUR
```
（1.5秒），则跳过——时长太短无法进行有意义的缩放。
计算关键帧的内部间隔为
```
[t_start + HOLD_GAP, t_end - HOLD_GAP]
```
。如果该间隔短于1.0秒，则跳过。

否则在每个节拍的映射（称为

zoom_keyframes_by_beat[beat_idx]

）中预填充该节拍的条目，包含

{cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}

以及修剪后的

t_start

t_end

。

步骤8b——在
action_bboxes
提供可用测量值的情况下，用bbox派生的精确缩放覆盖默认值。

遍历

action_bboxes

中的每个条目：

```
beat_idx = entry.idx
```
（步骤6为每个节拍输出一个timed_action）。如果
```
beat_idx < INTRO_BEATS
```
，则跳过。
如果
```
entry.found
```
为false，则跳过。
如果该节拍未在
```
zoom_keyframes_by_beat
```
中（步骤8a中被
```
MIN_BEAT_DUR
```
/
```
1.0s
```
规则过滤），则跳过。
过滤退化bbox： 如果
```
bbox.y > recording_viewport.h
```
（屏幕外捕获——页面未及时将元素滚动到视图中）或
```
bbox.h > recording_viewport.h * 1.5
```
（全页
```
<main>
```
元素——缩放中心无意义），则跳过。

使用录制像素→框架像素转换公式从bbox中心计算

cx_framed

cy_framed

。用

{cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}

覆盖该节拍的条目。

最终列表： 按

t_start

对

zoom_keyframes_by_beat

的值排序，生成

zoom_keyframes

数组。

这保证了每个非intro、时长足够的节拍都有缩放——bbox捕获成功则使用精确缩放，否则使用默认位置缩放。避免了“整个视频全程无缩放”的失败模式。

如果

len(zoom_keyframes) > 0

，调用

mcp__pika__edit_animate_zoom

，参数为

video_url: framed_url, zoom_keyframes

。返回

zoomed_url

。否则（无符合条件的节拍——步骤3的65-80秒约束下应很少见）跳过并使用

framed_url

作为

zoomed_url

。

Step 9 — Lipsync the full audio

步骤9——完整音频唇形同步

mcp__pika__generate_lipsync

provider: <resolved_lipsync_provider>

— default:
pika
(parrot a2v). Honor

--lipsync-provider kling

if explicitly passed.

```
image: <avatar>
```
```
audio: <Step 4 audio_url>
```

kling-only knobs (since

pika-mcp-server

BACK-339, 2026-05-10): when

provider == "kling"

, add

mode: "pro"

and

prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"

for the polished-presenter feel. Both are silently ignored on

pika

(parrot has its own driver).

Provider tradeoffs:

Provider	Wall-clock	Head motion	When to use
`pika` (default)	~2–5 min	Slightly more dramatic, naturalistic	Default for most runs — fast iteration, watchable output, ~10× faster than kling
`kling` (opt-in)	~5–30 min	Minimal, face-centered, presenter-style	High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole

Server-side-await covers the call inline; if the response shape is

{task_id, status: "queued"}

, poll

mcp__pika__task_status

in a tight loop (no sleep) until the status reaches a terminal state (

completed

failed

, or

cancelled

). On

completed

, capture

lipsync_url

. On

failed

cancelled

, fall back to the other provider (kling ↔ pika) per the failover note below.

Failover:

If
```
pika
```
fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with
```
provider: "kling"
```
.
If
```
kling
```
stalls past the worker's 1200s ceiling (visible as repeated
```
processing
```
status with no completion) → fall back to
```
provider: "pika"
```
. Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.

Why pika is the default:

Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.

For the canonical "polished presenter" feel of the original tarball reference output, pass

--lipsync-provider kling

explicitly.

调用

mcp__pika__generate_lipsync

：

```
provider: <解析后的lipsync_provider>
```
——默认：
pika
（parrot a2v）。如果显式传递
```
--lipsync-provider kling
```
则优先使用。
```
image: <avatar>
```
```
audio: <步骤4的audio_url>
```
仅kling可用的参数（自
```
pika-mcp-server
```
BACK-339，2026-05-10起）：当
```
provider == "kling"
```
时，添加
```
mode: "pro"
```
和
```
prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
```
以获得 polished 演示者效果。这两个参数在
```
pika
```
模式下会被静默忽略（parrot有自己的驱动逻辑）。

服务商权衡：

服务商	耗时	头部动作	使用场景
`pika` （默认）	~2–5分钟	稍显生动、自然	大多数运行的默认选择——迭代速度快，输出可观看，比kling快约10倍
`kling` （主动开启）	~5–30分钟	极小、面部居中、演示者风格	高优先级渲染，要求头像看起来像专业演示者；可容忍较长耗时

服务器端会等待调用完成；如果响应格式为

{task_id, status: "queued"}

，则循环调用

mcp__pika__task_status

（无睡眠）直到状态达到终端状态（

completed

、

failed

或

cancelled

）。状态为

completed

时，捕获

lipsync_url

。状态为

failed

cancelled

时，按以下故障切换逻辑回退到另一个服务商（kling↔pika）。

故障切换：

如果
```
pika
```
失败（罕见——parrot a2v在典型讲解音频长度下很健壮）→ 重试一次，使用
```
provider: "kling"
```
。
如果
```
kling
```
在工作器1200秒上限后仍停滞（表现为重复
```
processing
```
状态但未完成）→ 回退到
```
provider: "pika"
```
。步骤4.5的音频长度关卡应在此之前捕获长音频情况，但故障切换可处理剩余风险。

为什么pika是默认选择：

速度——典型讲解视频的总耗时从约10-15分钟降至约5-7分钟，因为唇形同步是耗时最长的环节。
质量足够——parrot a2v效果自然；在60-80秒的圆形头像PiP视频中，稍多的头部动作会显得更有吸引力而非分散注意力。
Kling模式的 polished 效果在246像素的圆形中大多不可见——面部区域太小，大多数观众无法察觉头部动作极小的差异。

若要获得原始tarball参考输出的标准“polished演示者”效果，请显式传递

--lipsync-provider kling

。

Step 10 — PiP composite

步骤10——画中画（PiP）合成

mcp__pika__edit_pip

```
main_video_url: <zoomed_url>
```
```
overlay_video_url: <lipsync_url>
```
```
shape: "circle"
```
```
size_px: 246
```
← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's
```
CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
```
```
stroke_width_px: 3
```
```
stroke_color: "white"
```

position_px: {x: 20, y: 476}

←

800 − 246 − 78

for dock clearance (matches tarball's

H − CIRCLE_OUT − 78

)

Pass

size_px

, not

size

; the fields are mutually exclusive. Returns

final_url

Master-duration / audio-source contract (matching tarball

github_explainer.py:418-419, 531-533, 578-582

edit_pip

uses

shortest=1

semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's

duration_s = ceil(audio_duration_seconds)

ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via

edit_pip

shortest=1

semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone

audio_url

is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.

调用

mcp__pika__edit_pip

：

```
main_video_url: <zoomed_url>
```
```
overlay_video_url: <lipsync_url>
```
```
shape: "circle"
```
```
size_px: 246
```
← 固定像素的246px外径（240px内部头像 + 3+3px描边环）；与tarball文件的
```
CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
```
一致
```
stroke_width_px: 3
```
```
stroke_color: "white"
```

position_px: {x: 20, y: 476}

←

800 − 246 − 78

以避开dock（与tarball文件的

H − CIRCLE_OUT − 78

一致）

传递

size_px

，而非

size

；这两个字段互斥。返回

final_url

。

主时长/音频源约定（与tarball文件

github_explainer.py:418-419, 531-533, 578-582

一致）：

edit_pip

默认使用

shortest=1

语义，即合成视频的时长为（缩放后的屏幕录制）和（唇形同步视频）中较短的一个。步骤6的

duration_s = ceil(audio_duration_seconds)

确保屏幕录制长度与唇形同步完全匹配（步骤4.5已将节拍缩放到音频时间线）。合成时长由唇形同步视频通过

edit_pip

的

shortest=1

语义决定。音频来自唇形同步视频的音轨（唇形同步嵌入了原始TTS音频）；无需重新混合独立的

audio_url

。如果唇形同步视频比屏幕录制短（Kling有时会修剪末尾静音），则屏幕录制会在唇形同步结束处被截断——接受此情况；循环屏幕录制对讲解内容来说更糟。

Step 11 — Burn captions

步骤11——添加字幕

Call

mcp__pika__add_captions(video_url=<final_url>, style="classic")

classic

renders a bottom subtitle bar — the right register for an explainer video (use

tiktok

hormozi

karaoke

only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as

captioned_url

Skip this step only if the user passed

--no-captions

(parsed in Step 1) — the default is captions on. (Note:

/pika:podcast

does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)

调用

mcp__pika__add_captions(video_url=<final_url>, style="classic")

。

classic

样式渲染底部字幕栏——这是讲解视频的合适样式（仅当用户显式要求逐词高亮时才使用

tiktok

hormozi

karaoke

样式）。音频从PiP合成视频的唇形同步音轨中提取，因此转录内容与旁白完全一致。捕获结果为

captioned_url

。

仅当用户传递

--no-captions

（步骤1中解析）时跳过此步骤——默认开启字幕。（注意：

/pika:podcast

不添加字幕——讲解视频的旁白比快速的双人对话更适合转录。）

Step 12 — Return

步骤12——返回结果

Emit

captioned_url

(or

final_url

if Step 11 was skipped) on one line:

Done: <url>

在一行中输出

captioned_url

（如果跳过步骤11则输出

final_url

）：

Done: <url>

。

Load-bearing phrases

核心短语

These anchors preserve the visual contract across page types:

Phrase	Where	Why load-bearing
`vanilla CSS that resolves via document.querySelector`	Selector contract	Keeps scroll, bbox capture, and zoom targeting aligned inside `capture_website` .
`GitHub URLs activate repo-aware mode`	Mode detection	Prevents generic product-page beats from replacing README/code walkthrough beats.
`8-10 beats` , `65-80 seconds` , `165-200 words`	Beat-sheet authoring	Keeps narration, screen recording, lipsync, and captions within the reliable duration envelope.
`all beats[] mutations happen here`	Audio rescale step	Ensures later capture/zoom/composite steps consume one stable timeline.
`extra_css` cookie-banner hiding payload	Generic URL pre-flight	Reduces first-frame banner occlusion when a banner click misses.

这些锚点确保跨页面类型的视觉约定一致：

短语	位置	核心原因
`vanilla CSS that resolves via document.querySelector`	选择器约定	保持 `capture_website` 内部的滚动、bbox捕获和缩放目标对齐。
`GitHub URLs activate repo-aware mode`	模式检测	防止通用产品页面节拍替换README/代码导览节拍。
`8-10 beats` , `65-80 seconds` , `165-200 words`	节拍表编写	保持旁白、屏幕录制、唇形同步和字幕在可靠的时长范围内。
`all beats[] mutations happen here`	音频缩放步骤	确保后续捕获/缩放/合成步骤使用稳定的时间线。
`extra_css` cookie-banner hiding payload	通用URL预检查	当弹窗点击未命中时，减少首帧弹窗遮挡。

Engine choice: Pika lipsync default, Kling opt-in

引擎选择：默认Pika唇形同步，Kling主动开启

Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests

--lipsync-provider kling

or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.

默认使用Pika/parrot唇形同步，因为速度更快，大多数讲解视频可在短迭代周期内完成。仅当用户显式请求

--lipsync-provider kling

或高优先级渲染需要更居中的演示者外观且可容忍更长耗时环节时，才使用Kling。屏幕捕获、浏览器框架、缩放、PiP和字幕围绕唇形同步选择保持确定性编辑/合成步骤。

Runtime expectations

运行时预期

Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:

Step	Wall clock	Notes
URL read + pre-flight	10-60s	GitHub README scan or generic URL DOM/cookie checks
TTS + audio rescale	30-90s	Beat timing is normalized after actual audio length
Screen recording	60-180s	Depends on page load and navigation count
Browser frame + zooms	1-3 min	Deterministic edit/composite stages
Lipsync	2-5 min Pika / 5-30 min Kling	Kling is opt-in because it is the long pole
PiP + captions	1-3 min	Captions skipped when `--no-captions` is set

使用Pika唇形同步的典型耗时为5-10分钟，使用Kling唇形同步为10-30+分钟：

步骤	耗时	说明
URL读取 + 预检查	10-60秒	GitHub README扫描或通用URL DOM/cookie检查
TTS + 音频缩放	30-90秒	节拍时间根据实际音频长度归一化
屏幕录制	60-180秒	取决于页面加载和导航次数
浏览器框架 + 缩放	1-3分钟	确定性编辑/合成环节
唇形同步	Pika 2-5分钟 / Kling 5-30分钟	Kling需主动开启，因为耗时最长
PiP + 字幕	1-3分钟	当 `--no-captions` 设置时跳过字幕

Known gaps (carried as follow-up server-side work)

已知缺陷（作为后续服务器端工作）

Kling avatar
mode:"pro"
and
prompt
not exposed. Resolved by
```
pika-mcp-server
```
BACK-339 (PR #186, shipped 2026-05-10):
```
generate_lipsync
```
now wires both
```
mode
```
(e.g.
```
"pro"
```
) and
```
prompt
```
end-to-end for the kling provider. To enable polished-presenter mode here, pass
```
--lipsync-provider kling
```
and the Step 9 call should add
```
mode: "pro"
```
plus a prompt like
```
"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
```
. Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap.
No caller-controlled white-frame trim on the screen recording.
```
capture_website
```
has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms
```
wait
```
action at
```
at_s: 0.0
```
mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
No
networkidle
wait on per-beat navigation. Tarball uses
```
page.goto(url, wait_until="networkidle", timeout=20000)
```
plus
```
wait_for_timeout(600)
```
after every navigate.
```
capture_website
```
settles to
```
domcontentloaded
```
plus the bbox-capture branch's 600 ms post-action settle (server-side, when
```
bbox_selector
```
is set), but SPA blob pages whose final render happens after
```
domcontentloaded
```
can still get bbox'd against unmounted code blocks. Worker enhancement: expose a
```
wait_until
```
knob on
```
timed_actions[].navigate
```
.
No per-step output-size verification gates. Tarball's
```
verify()
```
helper at
```
github_explainer.py:35-39
```
checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra
```
mcp__pika__analyze_media
```
call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
text_content
bbox capture not implemented.
```
capture_website
```
v1 returns
```
action_bboxes
```
only for steps with a CSS
```
selector
```
.
```
text_content
```
-only steps produce no entry. Prefer CSS selectors in
```
zoom_target
```
for guaranteed zoom coverage.
Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (
```
<h1>
```
+ clear
```
<section>
```
+ named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names —
```
tile-headline
```
instead of
```
hero-title
```
defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's
```
scrollIntoView
```
; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with
```
loading="lazy"
```
may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass
```
--focus "the X feature"
```
to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry +
```
wait_until=networkidle
```
+ animation-trigger via
```
IntersectionObserver
```
polyfill).
Cookie-banner click is single-attempt. Step 2.6 §B emits one
```
click
```
against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the
```
extra_css
```
payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per
```
click
```
action so the worker tries each in order.
Step 8b ↔ Step 6
idx
-mapping mismatch in Generic-URL mode. Step 6 maps
```
beat_idx = entry.idx - prepend_count
```
, but Step 8b uses
```
beat_idx = entry.idx
```
naively. Pre-existing bug independent of rescaling.

Kling头像
mode:"pro"
和
prompt
未暴露。 已解决：
```
pika-mcp-server
```
BACK-339（PR #186，2026-05-10发布）：
```
generate_lipsync
```
现在为kling服务商端到端传递
```
mode
```
（例如
```
"pro"
```
）和
```
prompt
```
。要在此处启用polished演示者模式，传递
```
--lipsync-provider kling
```
，步骤9调用应添加
```
mode: "pro"
```
和类似
```
"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
```
的prompt。这是减少唇形同步中头部动作的有效质量控制手段——不再是服务器端缺陷。
无调用者可控的屏幕录制白帧修剪。
```
capture_website
```
有内部修剪启发式逻辑，但未暴露给调用者。表现为页面仍在加载时，讲解视频开头出现短暂白闪。在
```
at_s: 0.0
```
添加800ms的
```
wait
```
操作可部分缓解此问题，给页面绘制时间，但无法修剪已录制的白帧。需工作器增强。
每个节拍导航无
networkidle
等待。 Tarball文件在每次导航后使用
```
page.goto(url, wait_until="networkidle", timeout=20000)
```
+
```
wait_for_timeout(600)
```
。
```
capture_website
```
会等待到
```
domcontentloaded
```
加上bbox捕获分支的600ms操作后等待（服务器端，当设置
```
bbox_selector
```
时），但
```
domcontentloaded
```
后渲染的SPA blob页面仍可能针对未挂载的代码块进行bbox捕获。需工作器增强：在
```
timed_actions[].navigate
```
上暴露
```
wait_until
```
参数。
无每步输出大小验证关卡。 Tarball文件的
```
verify()
```
助手（
```
github_explainer.py:35-39
```
）在每步后检查TTS≥50KB、预览≥100KB、屏幕录制≥200KB、唇形同步≥500KB、最终视频≥1MB。MCP路径仅返回URL；验证文件大小需每步额外调用
```
mcp__pika__analyze_media
```
（每步约30秒开销）。当用户端延迟预算允许时值得添加。目前，下游故障级联（例如零字节TTS→静音唇形同步→空白合成）仅在步骤11才会显现。
未实现
text_content
bbox捕获。
```
capture_website
```
v1仅为带有CSS
```
selector
```
的步骤返回
```
action_bboxes
```
。仅含
```
text_content
```
的步骤无条目。为保证缩放覆盖，
```
zoom_target
```
中优先使用CSS选择器。
节拍表措辞非确定性。 相同输入运行两次会产生不同的vo_text和不同的缩放位置。视觉类型是约定，而非像素级精确复制。
通用URL模式质量因站点而异。 具有语义标记（
```
<h1>
```
+清晰
```
<section>
```
+命名类钩子）的现代独立/SaaS落地页效果良好。大型企业站点（apple.com、microsoft.com、amazon.com）存在多个已知限制：(a) 机器人检测——页面可能在无头Chrome下提供降级版本或验证码；步骤2.6§A会在这些情况中止，但启发式逻辑并非 exhaustive；(b) 混淆类名——
```
tile-headline
```
而非
```
hero-title
```
会使通用选择器失效；步骤2.6§C的WebFetch DOM扫描有帮助但并非完美；(c) 滚动触发动画不播放——IntersectionObserver驱动的hero揭示在真实用户滚动时触发，而非Playwright的
```
scrollIntoView
```
；录制帧可能是静态占位符；(d) 懒加载图片——带有
```
loading="lazy"
```
的picture/source元素可能在600ms或2500ms等待窗口内未解析；bbox会落在透明占位符上。解决方法：选择更简单/更小的营销页面进行发布演示，始终传递
```
--focus "X功能"
```
以锚定节拍选择，接受大型站点需要后续服务器PR（cookie弹窗点击重试 +
```
wait_until=networkidle
```
+ IntersectionObserver polyfill触发动画）。
Cookie弹窗点击仅尝试一次。 步骤2.6§B输出一个针对WebFetch DOM中提取的关闭选择器的
```
click
```
操作。如果WebFetch的HTML不包含弹窗（JS后渲染）或选择器错误，点击会静默失败——
```
extra_css
```
payload是核心防御措施。需工作器增强：支持每个
```
click
```
操作的 fallback 选择器列表，工作器会按顺序尝试。
步骤8b与步骤6在通用URL模式下的
idx
映射不匹配。步骤6计算
```
beat_idx = entry.idx - prepend_count
```
，但步骤8b天真地使用
```
beat_idx = entry.idx
```
。这是与缩放无关的预存bug。

Auth

认证

If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for

@pika.art

Google login). For non-interactive environments, set

MCP_AUTH_TOKEN

如果任何调用返回401：用户的OAuth令牌已过期或未颁发。下一次认证MCP调用会自动触发OAuth（浏览器打开

@pika.art

谷歌登录）。对于非交互式环境，设置

MCP_AUTH_TOKEN

。

Examples

示例

GitHub-mode (repo-aware: README scan + live-demo detection):

/pika:explainer https://github.com/leigest519/OpenGame

/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"

/pika:explainer https://github.com/openai/whisper --preview

(opt-in to the preview gate when testing a new avatar)

Generic-URL mode (any non-GitHub URL — drives through the page directly):

```
/pika:explainer https://pika.art
```

/pika:explainer https://vercel.com --focus "the deployment workflow"

/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins

/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview

GitHub模式（仓库感知：扫描README + 检测实时演示）：

/pika:explainer https://github.com/leigest519/OpenGame

/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"

/pika:explainer https://github.com/openai/whisper --preview

（测试新头像时主动开启预览环节）

通用URL模式（任意非GitHub URL——直接导览页面）：

```
/pika:explainer https://pika.art
```

/pika:explainer https://vercel.com --focus "the deployment workflow"

/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins

/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview

explainer

Original

Translation

/pika:explainer

/pika:explainer

Behavior

行为逻辑

Defaults — fire fast, no mid-flow confirmation

默认设置——快速执行，无流程中途确认

Local avatar images on Claude Desktop

Claude Desktop本地头像图片支持

Step 0 — Resolve URL (empty-args menu)

步骤0——解析URL（空参数菜单）

Step 1 — Parse input + detect mode

步骤1——解析输入 + 检测模式

Step 2 — Read source (no MCP call)

步骤2——读取源内容（无需调用MCP）

Step 2.5 — Verify live_url reachability (GitHub mode only, no MCP call)

步骤2.5——验证live_url可达性（仅GitHub模式，无需调用MCP）

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

步骤2.6——通用URL预检查（仅通用URL模式，无需调用MCP）

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

步骤3.0——必填README章节扫描（仅GitHub模式，无需调用MCP）

Step 3 — Author beat sheet (main thread, no MCP call)

步骤3——编写节拍表（主线程，无需调用MCP）

Step 4 — TTS

步骤4——文本转语音（TTS）

Step 4.5 — Audio length verification + beat-sheet rescale

步骤4.5——音频长度验证 + 节拍表缩放

Step 5 — Preview gate (opt-in via --preview)

步骤5——预览环节（通过--preview主动开启）

Step 6 — Build timed_actions and record

步骤6——构建timed_actions并录制

Step 7 — Browser chrome

步骤7——浏览器框架

Step 8 — Build zoom_keyframes and apply

步骤8——构建zoom_keyframes并应用

Step 9 — Lipsync the full audio

步骤9——完整音频唇形同步

Step 10 — PiP composite

步骤10——画中画（PiP）合成

Step 11 — Burn captions

步骤11——添加字幕

Step 12 — Return

步骤12——返回结果

Load-bearing phrases

核心短语

Engine choice: Pika lipsync default, Kling opt-in

引擎选择：默认Pika唇形同步，Kling主动开启

Runtime expectations

运行时预期

Known gaps (carried as follow-up server-side work)

已知缺陷（作为后续服务器端工作）

Auth

认证

Examples

示例

Step 2.5 — Verify
`live_url`
reachability (GitHub mode only, no MCP call)

步骤2.5——验证
`live_url`
可达性（仅GitHub模式，无需调用MCP）

Step 5 — Preview gate (opt-in via
`--preview`
)

步骤5——预览环节（通过
`--preview`
主动开启）

Step 6 — Build
`timed_actions`
and record

步骤6——构建
`timed_actions`
并录制

Step 8 — Build
`zoom_keyframes`
and apply

步骤8——构建
`zoom_keyframes`
并应用