explainer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

/pika:explainer

/pika:explainer

Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.
Usage:
/pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]
生成约60-80秒的URL讲解视频:按照节拍表时间线驱动真实浏览器访问目标URL,生成旁白的头像唇形同步效果,并将所有内容合成到1280×800的macOS Sonoma框架中。框架内画布坐标(20, 476)处有一个240像素的内部头像(含3px白色描边环的外部尺寸为246像素),且每个中段节拍都会针对元素进行缩放。支持任意URL——产品页面、文档站点、博客文章、发布页面均可。GitHub URL会触发仓库感知模式(扫描README + 检测实时演示);所有其他URL则使用通用页面导览流程。
使用方式:
/pika:explainer <url> [--focus "重点方向"] [--avatar <头像URL>] [--voice <语音ID>] [--lipsync-provider pika|kling] [--preview] [--live-url <实时演示URL>]

Behavior

行为逻辑

Defaults — fire fast, no mid-flow confirmation

默认设置——快速执行,无流程中途确认

  • Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (
    --avatar
    ,
    --voice
    ) when supplied; otherwise resolve via
    identity_avatar_url
    /
    identity_voice_id
    and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
  • No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via
    --preview
    (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
  • Do not solicit
    --focus
    either.
    Make a confident first attempt from page structure; users re-run with
    --focus "X"
    if the angle missed.
These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.
  • 静默使用身份存储中的头像/语音默认值。执行前绝不询问“是否使用你的头像?”或“选择哪种语音?”。如果用户提供了显式覆盖参数(
    --avatar
    --voice
    ),则优先使用;否则通过
    identity_avatar_url
    /
    identity_voice_id
    获取默认值并继续执行。完整的优先级流程请参见步骤1(包括身份信息为空时的静默 fallback 逻辑)。
  • 默认无流程中途“输入yes继续”的确认环节。步骤5的预览功能需通过
    --preview
    主动开启(供高级用户在耗时较长的渲染前测试新的头像/语音组合);默认流程会从头到尾连续执行,无需暂停。
  • 也不主动请求
    --focus
    参数
    。根据页面结构自主生成首次导览内容;如果用户觉得重点方向不符,可重新执行并添加
    --focus "X"
    参数。
这些默认设置符合媒体生成工具的行业标准(如Midjourney / Sora / Runway / HeyGen / Pika.art):提交请求→渲染→返回结果。账户余额 + 服务商故障切换(步骤9)是核心的保障机制。

Local avatar images on Claude Desktop

Claude Desktop本地头像图片支持

Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as
--avatar
, pause Step 1 and kindly send them this — something like:
Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:
  • Paste a URL if it's already hosted (Imgur, S3, your site) — fastest
  • Attach the image file so I can upload it before generation.
When a local file arrives, convert it to a public URL with
upload_asset
and use the returned
public_url
as
--avatar <url>
before Step 1. Already-hosted
https://...
URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.
Claude Desktop目前无法将粘贴的内联图片传递给MCP工具(Anthropic侧限制)。如果用户粘贴了内联照片,或提及想要将本地文件用作
--avatar
,则暂停步骤1并友好告知用户:
注意——Claude Desktop上粘贴的图片无法传递给MCP工具(Anthropic限制)。你可以通过以下两种简单方式设置头像:
  • 粘贴URL:如果图片已托管在Imgur、S3或你的站点上,这是最快的方式
  • 附加图片文件:我会先上传该图片再进行生成。
当收到本地文件后,使用
upload_asset
将其转换为公共URL,然后将返回的
public_url
作为
--avatar <url>
参数,再继续执行步骤1。已托管的
https://...
URL可直接使用,无需此步骤。如果未提供任何头像,则使用身份存储中的默认值。

Step 0 — Resolve URL (empty-args menu)

步骤0——解析URL(空参数菜单)

Strip flags (
--focus
,
--avatar
,
--voice
,
--live-url
,
--lipsync-provider
,
--no-captions
,
--preview
,
--skip-preview
,
--yes
) and
key=value
parameters from
$ARGUMENTS
. If what remains contains no
https://...
URL
(or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message. Calling a tool here risks recording or explaining the wrong page. If
$ARGUMENTS
already carries a URL, skip this step silently and proceed to Step 1.
Which URL would you like me to walk through? Works on any of:
  • A GitHub repo — e.g.
    https://github.com/anthropics/claude-code
    (activates repo-aware mode: README scan + live-demo detection)
  • A product page / launch page — e.g.
    https://pika.art
  • A docs site — e.g.
    https://docs.anthropic.com
  • A blog post / article URL
Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass
--preview
if you want a 3-second lipsync sanity check first.
Reply with the URL and I'll start.
Tip: you don't need to type
/pika:explainer
— just say things like "walk me through <url>", "make a demo video of <url>", or "explain this repo: <github-url>" and I'll fire this skill automatically.
When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.
$ARGUMENTS
中剥离标志(
--focus
--avatar
--voice
--live-url
--lipsync-provider
--no-captions
--preview
--skip-preview
--yes
)和
key=value
参数。如果剩余内容不包含
https://...
格式的URL
(或为空/仅含空白字符),则原样输出以下菜单作为完整响应,然后停止并等待用户的下一条消息。此时调用工具可能会录制或讲解错误页面。如果
$ARGUMENTS
中已包含URL,则静默跳过此步骤并进入步骤1。
**你想让我导览哪个URL?**支持以下任意类型:
  • GitHub仓库——例如
    https://github.com/anthropics/claude-code
    (触发仓库感知模式:扫描README + 检测实时演示)
  • 产品页面/发布页面——例如
    https://pika.art
  • 文档站点——例如
    https://docs.anthropic.com
  • 博客文章/文章URL
输出结果:1280×800的macOS Sonoma框架,左下角带有头像唇形同步效果,每个中段节拍都会针对元素进行缩放。默认流程从头到尾连续执行,无确认环节——若想先进行3秒唇形同步 sanity check,可添加
--preview
参数。
回复URL即可开始。
提示:无需输入
/pika:explainer
——直接说“导览<url>”“制作<url>的演示视频”或“讲解这个仓库:<github-url>”,我会自动触发该技能。
当用户回复URL后,将其视为已解析的输入并进入步骤1,无需再次提示。

Step 1 — Parse input + detect mode

步骤1——解析输入 + 检测模式

Required:
url
(must be
https://...
). Optional:
--avatar <url>
(overrides identity-store default),
--voice <minimax-voice-id>
,
--focus "..."
(editorial guidance woven into vo_text),
--live-url <url>
(force-supply live demo URL — GitHub mode only),
--lipsync-provider <pika|kling>
(defaults to
pika
— parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass
kling
for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders),
--no-captions
(skip the Step 11 caption burn — default is captions on),
--preview
(opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview).
--skip-preview
and
--yes
are accepted as no-ops for backward compatibility.
Mode detection:
  • GitHub mode — URL host is
    github.com
    AND path matches
    /{owner}/{repo}
    (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
  • Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like
    /blob/HEAD/path
    ). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.
Avatar resolution (silent — never ask the user):
  1. If
    --avatar <url>
    was passed, use it.
  2. Else call
    mcp__pika__identity_avatar_url
    . If non-null, use it.
  3. Else (fresh user, no identity avatar set yet) call
    mcp__pika__generate_image
    once with prompt
    "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"
    and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.
Voice resolution (silent — never ask the user):
  1. If
    --voice <id>
    was passed, use it.
  2. Else call
    mcp__pika__identity_voice_id
    . If non-null, use it.
  3. Else pick a casual MiniMax
    speech-2.8-hd
    preset matching the resolved avatar's apparent gender:
    • Female-coded avatar
      English_PlayfulGirl
      (warm, casual, clearly female-voiced — verified)
    • Male-coded avatar
      English_Jovialman
      (warm, casual male)
    • Unclear / gender-neutral
      English_Jovialman
      (default)
    Determine gender from
    mcp__pika__identity_persona_read
    (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call
    analyze_media
    for this — it's not worth the extra ~30s round-trip. Do not ask the user.
    Do NOT use
    English_FriendlyPerson
    — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback.
    English_PlayfulGirl
    is the canonical casual-female pick. Other verified-female alternates:
    English_Upbeat_Woman
    ,
    English_LovelyGirl
    ,
    English_radiant_girl
    .
The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.
必填参数:
url
(必须为
https://...
格式)。 可选参数:
--avatar <url>
(覆盖身份存储默认值)、
--voice <minimax-voice-id>
--focus "..."
(将编辑指导融入旁白文本)、
--live-url <url>
(强制提供实时演示URL——仅GitHub模式可用)、
--lipsync-provider <pika|kling>
(默认值为**
pika
**——parrot a2v,耗时约2-5分钟,头部动作稍显生动。若需更紧凑的面部居中输出,可选择
kling
,耗时约5-30分钟——Kling生成的演示者头部动作极小,但耗时较长;仅用于高优先级渲染)、
--no-captions
(跳过步骤11的字幕添加——默认开启字幕)、
--preview
(主动开启步骤5的预览环节——生成“嗨,我是你的演示者”的3秒唇形同步效果,用于在耗时较长的渲染前测试新的头像/语音组合;默认不开启预览)。
--skip-preview
--yes
作为向后兼容的无操作参数被接受。
模式检测:
  • GitHub模式——URL主机为
    github.com
    且路径匹配
    /{owner}/{repo}
    (仓库根目录后无其他路径段)。触发仓库感知额外功能:扫描README、检测实时演示、使用GitHub特定选择器。
  • 通用URL模式——其他所有情况(产品页面、文档站点、博客文章、更深层的GitHub路径如
    /blob/HEAD/path
    )。跳过GitHub专属功能;使用通用CSS选择器并直接导览URL本身。
头像解析(静默执行——绝不询问用户):
  1. 如果传递了
    --avatar <url>
    ,则使用该URL。
  2. 否则调用
    mcp__pika__identity_avatar_url
    。若返回非空值,则使用该值。
  3. 否则(新用户,未设置身份头像)调用一次
    mcp__pika__generate_image
    ,提示词为
    "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"
    ,并使用返回的URL。绝不询问用户“是否生成头像?”——直接静默生成。
语音解析(静默执行——绝不询问用户):
  1. 如果传递了
    --voice <id>
    ,则使用该ID。
  2. 否则调用
    mcp__pika__identity_voice_id
    。若返回非空值,则使用该值。
  3. 否则根据解析出的头像外观性别选择合适的MiniMax
    speech-2.8-hd
    预设:
    • 女性化头像
      English_PlayfulGirl
      (温暖、随意、清晰的女性语音——已验证)
    • 男性化头像
      English_Jovialman
      (温暖、随意的男性语音)
    • 性别不明/中性
      English_Jovialman
      (默认值)
    若存在身份信息,则从
    mcp__pika__identity_persona_read
    中获取性别/代词字段;否则从解析出的头像图片推断。绝不调用
    analyze_media
    ——额外的30秒往返耗时不值得。绝不询问用户
    请勿使用
    English_FriendlyPerson
    ——尽管MiniMax分类中将其归为“女性”,但其显示名称为“Friendly Guy”,播放时听起来是男性。
    English_PlayfulGirl
    是标准的随意女性语音选择。其他已验证的女性语音备选:
    English_Upbeat_Woman
    English_LovelyGirl
    English_radiant_girl
以下流程按步骤标注:仅GitHub模式仅通用模式两种模式通用

Step 2 — Read source (no MCP call)

步骤2——读取源内容(无需调用MCP)

Both modes: use Claude's
WebFetch
on the input URL to pull the page's main content (h1, hero section, headings, primary copy).
GitHub mode additions: also fetch top-level file tree, (best-effort)
package.json
/
pyproject.toml
, and GitHub API repo metadata via
gh api repos/{owner}/{repo}
for
homepage
,
description
,
language
,
topics
. Detect a candidate
live_url
in this priority:
  1. User-supplied
    --live-url
    .
  2. GitHub API
    meta.homepage
    field
    — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball
    repo_analyzer.py:66-77
    ).
  3. package.json
    "homepage"
    field.
  4. First match in README of
    https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*
    .
  5. Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g.
    <project>-project-page.com
    ); when the README explicitly designates a project page, prefer that over the github.io fallback.
  6. GitHub Pages convention
    https://{owner}.github.io/{repo}
    — but only if the deep tree contains a frontend signal (one of
    index.html
    ,
    App.tsx
    ,
    App.jsx
    ,
    App.vue
    ,
    app.py
    ,
    main.py
    ).
If no candidate resolves, the beat sheet skips beats 6–7.
Generic-URL mode: the input URL itself is the only URL the beats walk through — no
live_url
inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.
两种模式通用: 使用Claude的
WebFetch
获取输入URL页面的主要内容(h1、hero区域、标题、主要文本)。
GitHub模式额外操作: 同时获取顶级文件树、(尽力获取)
package.json
/
pyproject.toml
,并通过
gh api repos/{owner}/{repo}
获取GitHub API仓库元数据,包括
homepage
description
language
topics
。按以下优先级检测候选
live_url
  1. 用户提供的
    --live-url
  2. GitHub API的
    meta.homepage
    字段
    ——维护者在GitHub设置中配置仓库主页时设置(与tarball文件
    repo_analyzer.py:66-77
    逻辑一致)。
  3. package.json
    中的
    "homepage"
    字段。
  4. README中第一个匹配
    https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*
    的URL。
  5. README中徽章区域/“Live Demo”/“Project Page”/“Demo”文本指向的任意其他URL。上述允许列表正则表达式可能会遗漏自定义域名(例如
    <project>-project-page.com
    );当README明确指定项目页面时,优先选择该URL而非github.io备用地址。
  6. GitHub Pages惯例地址
    https://{owner}.github.io/{repo}
    ——但仅当深层文件树包含前端信号(
    index.html
    App.tsx
    App.jsx
    App.vue
    app.py
    main.py
    中的任意一个)时才使用。
如果未解析出候选URL,则节拍表跳过第6-7拍。
通用URL模式: 输入URL本身就是节拍表导览的唯一URL——无需推断
live_url
,无需获取额外元数据。跳过步骤2.5和步骤3.0;直接进入步骤3。

Step 2.5 — Verify
live_url
reachability (GitHub mode only, no MCP call)

步骤2.5——验证
live_url
可达性(仅GitHub模式,无需调用MCP)

If a candidate
live_url
was selected, verify it serves real content before authoring beats 6–7. Use
WebFetch
on the candidate and check the response:
  • If the response status is 4xx / 5xx, drop
    live_url
    to None
    and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
  • If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains
    "There isn't a GitHub Pages site here"
    ), drop
    live_url
    and skip beats 6–7.
  • Otherwise, keep
    live_url
    for beats 6–7.
This mirrors the original tarball's
requests.head(live_url, timeout=6, allow_redirects=True)
reachability gate.
如果已选择候选
live_url
,则在编写第6-7拍前验证其是否能提供真实内容。使用
WebFetch
访问候选URL并检查响应:
  • 如果响应状态码为4xx/5xx,则
    live_url
    设为None
    并跳过第6-7拍。尤其是github.io备用地址,虽然主机可达,但对于未启用Pages的仓库,经常返回404错误(“There isn't a GitHub Pages site here”)——录制该404页面会浪费讲解视频约12秒的时间在错误内容上。
  • 如果响应渲染了GitHub Pages的“404 — There isn't a GitHub Pages site here.”模板(启发式判断:响应正文包含
    "There isn't a GitHub Pages site here"
    ),则将
    live_url
    设为None并跳过第6-7拍。
  • 否则保留
    live_url
    用于第6-7拍。
这与原始tarball文件的
requests.head(live_url, timeout=6, allow_redirects=True)
可达性检查逻辑一致。

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

步骤2.6——通用URL预检查(仅通用URL模式,无需调用MCP)

Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.
A. Bot-block / captcha detection — abort if matched:
If the response body contains any of:
  • "Verify you are human"
    /
    "verify you are not a robot"
  • "captcha"
    /
    "CAPTCHA"
    /
    "reCAPTCHA"
  • "403 Forbidden"
    /
    "Access Denied"
  • "Just a moment"
    +
    cf-chl-bypass
    (Cloudflare challenge)
  • "We're sorry, something went wrong"
    (Amazon-style bot block)
  • A
    <title>
    or h1 of just "Robot Check" / "Are you a robot?"
ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."
B. Cookie / consent-banner detection — defuse with
extra_css
+ optional click:
Scan the response for these patterns (case-insensitive):
  • IDs / classes starting with
    onetrust-
    ,
    truste-
    ,
    cookie-banner
    ,
    cookie-consent
    ,
    gdpr-
    ,
    consent-
    ,
    cmp-
  • Buttons matching
    (?i)accept (all )?cookies
    /
    (?i)agree.{0,10}cookies
    /
    (?i)i (accept|agree)
  • Apple-specific banner: id
    ac-gdpr-banner
    or class
    as-globalfooter-curtain
  • Google consent:
    [role="dialog"]
    with text "Before you continue"
If detected, set
cookie_banner_present = true
. Defense in depth — the recording uses BOTH:
  1. CSS injection (
    extra_css
    )
    in the
    capture_website
    call to hide common banners universally — even if the click below misses, the banner is visually gone.
  2. A
    click
    timed_action
    at
    at_s: 0.0
    against the most likely dismissal selector (extracted from the WebFetch DOM, e.g.
    #onetrust-accept-btn-handler
    ,
    [aria-label*="Accept all" i]
    ,
    button[id*="accept"]
    ).
The
extra_css
payload (use this verbatim — covers ~80% of consent platforms):
#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }
C. Real-DOM element identification — emit concrete selectors:
Generic CSS selectors (
h1
,
[class*="hero"]
,
section h2
) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses
tile-headline
/
as-headline-section-title
, not
hero-*
). For each beat, prefer the actual DOM elements observed in the WebFetch:
  • Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary
    <h1>
    text and class.
  • Note the page's section structure (h2 headings + their parent containers).
  • Note any prominent CTA / signup / pricing element.
  • Emit
    zoom_target.selector
    using the actual class or id observed, falling back to semantic structure (
    main > section:nth-of-type(N) h2
    ) when class names look auto-generated (Tailwind
    _1a2b3c
    , CSS modules
    module__hero___xYz
    ).
D. SPA / lazy-render detection — bump initial wait:
If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-
domcontentloaded
. Emit a longer initial
wait
action (
{type: "wait", at_s: 0.0, ms: 2500}
) before any beat fires, instead of the default 600ms settle.
E.
--focus
is honored when supplied (do not solicit):
Without
--focus
, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with
--focus "the X feature"
if the first pass misses the angle they wanted. With
--focus
supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.
在为非GitHub URL编写节拍表前,使用
WebFetch
访问输入URL并检查响应。此步骤可避免三种常见的通用URL失败模式:(a) 录制验证码/机器人拦截页面而非实际内容;(b) Cookie/同意弹窗占据视频前约3秒;(c) 通用CSS选择器无法匹配页面实际的hero/区域。
A. 机器人拦截/验证码检测——匹配则中止:
如果响应正文包含以下任意内容:
  • "Verify you are human"
    /
    "verify you are not a robot"
  • "captcha"
    /
    "CAPTCHA"
    /
    "reCAPTCHA"
  • "403 Forbidden"
    /
    "Access Denied"
  • "Just a moment"
    +
    cf-chl-bypass
    (Cloudflare挑战)
  • "We're sorry, something went wrong"
    (亚马逊风格机器人拦截)
  • 仅包含“Robot Check”/“Are you a robot?”的
    <title>
    或h1
中止并向用户返回清晰错误:“通用URL模式无法渲染该站点——页面在无头Chrome下显示机器人检测/验证码挑战。请尝试其他URL,或先通过真实用户访问验证页面能否正常加载。”
B. Cookie/同意弹窗检测——通过
extra_css
+可选点击消除:
扫描响应中的以下模式(不区分大小写):
  • onetrust-
    truste-
    cookie-banner
    cookie-consent
    gdpr-
    consent-
    cmp-
    开头的ID/类
  • 匹配
    (?i)accept (all )?cookies
    /
    (?i)agree.{0,10}cookies
    /
    (?i)i (accept|agree)
    的按钮
  • Apple特定弹窗:ID为
    ac-gdpr-banner
    或类为
    as-globalfooter-curtain
  • Google同意弹窗:
    [role="dialog"]
    且包含文本“Before you continue”
如果检测到,则设置
cookie_banner_present = true
。采用双重防御机制——录制时同时使用:
  1. CSS注入(
    extra_css
    :在
    capture_website
    调用中注入CSS以全局隐藏常见弹窗——即使下方的点击操作未命中,弹窗也会在视觉上消失。
  2. 定时点击操作(
    click
    timed_action
    :在
    at_s: 0.0
    时针对最可能的关闭选择器(从WebFetch DOM中提取,例如
    #onetrust-accept-btn-handler
    [aria-label*="Accept all" i]
    button[id*="accept"]
    )执行点击。
extra_css
payload(原样使用——覆盖约80%的同意平台):
#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }
C. 真实DOM元素识别——输出具体选择器:
通用CSS选择器(
h1
[class*="hero"]
section h2
)在语义化/标记良好的站点上有效,但在大型企业站点(apple.com使用
tile-headline
/
as-headline-section-title
而非
hero-*
)的混淆类名上会失效。对于每个节拍,优先选择WebFetch中观察到的实际DOM元素
  • 读取WebFetch返回的渲染后HTML/Markdown。记录页面实际的主
    <h1>
    文本和类。
  • 记录页面的区域结构(h2标题及其父容器)。
  • 记录任何突出的CTA/注册/定价元素。
  • 使用观察到的实际类或ID输出
    zoom_target.selector
    ,当类名看起来是自动生成的(Tailwind
    _1a2b3c
    、CSS modules
    module__hero___xYz
    )时,回退到语义结构(
    main > section:nth-of-type(N) h2
    )。
D. SPA/懒加载检测——延长初始等待时间:
如果WebFetch返回的内容中可见标题少于3个/文本内容极少,则页面可能是
domcontentloaded
后渲染的SPA。在第一个节拍触发前输出更长的初始
wait
操作(
{type: "wait", at_s: 0.0, ms: 2500}
),而非默认的600ms等待时间。
E. 提供
--focus
参数时予以尊重(不主动请求):
如果未提供
--focus
,则根据通用结构线索选择节拍——静默执行自信的首次尝试。绝不询问用户“你想重点关注什么?”;如果首次尝试未命中用户想要的方向,用户可重新执行并添加
--focus "X功能"
参数。如果提供了
--focus
,则将节拍选择锚定在该短语上:使用匹配该短语的具体页面区域,忽略无关的营销内容。

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

步骤3.0——必填README章节扫描(仅GitHub模式,无需调用MCP)

Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:
README contains...Required beat
how it works
scroll_to that heading; zoom
article h2:has(#user-content-how-it-works)
audio layer
/
audio timeline
scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading
claude code
/
mcp integration
scroll_to that section; zoom
article pre
or
.highlight
(terminal screenshot / code block)
architecture
/
system design
scroll_to that section; zoom
article h2:has(#user-content-architecture)
features
(when prominent at top)
scroll_to that heading; zoom
article h2:has(#user-content-features)
getting started
/
quick start
/
installation
scroll_to that heading; zoom
article h2:has(#user-content-installation)
(or the matching slug) — falls back to
article pre
if you want the install code block instead
usage
/
examples
scroll_to that heading; zoom
article h2:has(#user-content-usage)
(or the matching slug) — or the first code block under it
GitHub heading slug rule: lowercase, spaces → dashes, strip non-
[a-z0-9-]
characters. So "How it works" →
#user-content-how-it-works
, "Quick Start" →
#user-content-quick-start
. GitHub injects the
<a id="user-content-{slug}">
anchor inside each rendered
<hN>
, so
hN:has(#user-content-{slug})
reliably grabs the heading element across any GitHub README.
Selector contract:
bbox_selector
needs to be vanilla CSS that resolves via
document.querySelector
(
capture_website
runs the post-action smooth-scroll JS via
page.evaluate
, which uses the browser's native selector engine). Avoid Playwright extensions like
:has-text("...")
,
text=...
, or
:visible
: those resolve in Playwright's
page.query_selector
(so the bbox capture finds the element) but silently fail in the smooth-scroll's
document.querySelector
(so the page never scrolls to the target, and
bbox.y
ends up at document-Y instead of
top - 60 px
, which trips Step 8b's
bbox.y > recording_viewport.h
degenerate filter and falls back to default-position zoom). CSS Level 4
:has(...)
is vanilla and supported in modern Chromium.
These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with
SPECIAL
rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.
在编写节拍表前,扫描README全文(不区分大小写)查找以下章节名称。如果找到匹配项,则必须在步骤3中为该章节添加专门的节拍,必要时替换通用节拍4-5中的一个:
README包含...必填节拍
how it works
滚动到该标题;缩放
article h2:has(#user-content-how-it-works)
audio layer
/
audio timeline
滚动到音频层图表;缩放渲染后的图或其周围的标题
claude code
/
mcp integration
滚动到该章节;缩放
article pre
.highlight
(终端截图/代码块)
architecture
/
system design
滚动到该章节;缩放
article h2:has(#user-content-architecture)
features
(在顶部突出显示)
滚动到该标题;缩放
article h2:has(#user-content-features)
getting started
/
quick start
/
installation
滚动到该标题;缩放
article h2:has(#user-content-installation)
(或匹配的slug)——如果需要安装代码块,可回退到
article pre
usage
/
examples
滚动到该标题;缩放
article h2:has(#user-content-usage)
(或匹配的slug)——或其下的第一个代码块
GitHub标题slug规则: 小写,空格替换为连字符,去除非
[a-z0-9-]
字符。例如“How it works”→
#user-content-how-it-works
,“Quick Start”→
#user-content-quick-start
。GitHub会在每个渲染后的
<hN>
内注入
<a id="user-content-{slug}">
锚点,因此
hN:has(#user-content-{slug})
可可靠地获取任意GitHub README中的标题元素。
选择器约定:
bbox_selector
需要是可通过
document.querySelector
解析的原生CSS(
capture_website
通过
page.evaluate
执行操作后的平滑滚动JS,使用浏览器原生选择器引擎)。避免使用Playwright扩展如
:has-text("...")
text=...
:visible
:这些在Playwright的
page.query_selector
中可解析(因此bbox捕获能找到元素),但在平滑滚动的
document.querySelector
中会静默失败(因此页面不会滚动到目标位置,
bbox.y
最终会是文档Y值而非
top - 60 px
,这会触发步骤8b的
bbox.y > recording_viewport.h
退化过滤器并回退到默认位置缩放)。CSS Level 4的
:has(...)
是原生语法,在现代Chromium中受支持。
这些章节是大多数值得讲解的仓库中信息密度最高的视觉内容。遗漏这些章节会生成通用的导览内容;包含这些章节则会让讲解视频具有具体的“展示而非讲述”节拍。原始tarball文件的SKILL.md在Gemini提示中用
SPECIAL
规则标记了前四个章节;本步骤3.0将它们从 incidental 指导提升为硬性要求,并添加了OSS README中常见的另外三个高信号标题。

Step 3 — Author beat sheet (main thread, no MCP call)

步骤3——编写节拍表(主线程,无需调用MCP)

Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:
jsonc
{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "exact words to speak — 1 to 2 conversational sentences"
}
Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):
  1. Every beat needs all five fields:
    t_start
    ,
    t_end
    ,
    action
    (with
    type
    and
    url
    ),
    zoom_target
    (with
    selector
    ),
    vo_text
    . Missing fields ⇒ reject and re-author. (Mirrors tarball's
    github_explainer.py:183-190
    validation pass.)
  2. t_start
    of beat 0 = 0.0;
    t_end[i] == t_start[i+1]
    (continuity).
  3. len(vo_text.split()) / 2.5
    t_end - t_start
    per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the
    vo_text
    until it fits.
  4. Total
    t_end
    of last beat ≤ 80 seconds.
    (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
  5. Total spoken word count between 165 and 200 words.
  6. Every beat's
    zoom_target.selector
    needs to be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors:
    h1.f1
    ,
    #readme
    ,
    article h2
    ,
    .blob-code-inner
    ,
    .highlight
    ,
    .octicon-star
    ,
    nav
    . Generic-URL mode prefers robust generic selectors:
    h1
    ,
    [role="main"]
    ,
    main
    ,
    header
    ,
    nav
    ,
    .hero
    ,
    .feature
    ,
    section h2
    ,
    [class*="cta"]
    ,
    [class*="hero"]
    ,
    button
    ,
    a[href]
    . Selectors need to resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.
  7. vo_text
    is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
  8. action.url
    is a valid
    https://...
    URL when
    action.type == "navigate"
    ; required.
Self-check before Step 4: verify
total_words
is in
[165, 200]
AND
total_seconds
(=
beats[-1].t_end
) is in
[65, 80]
. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)
Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):
  • Beat 1:
    navigate
    repo root, zoom
    h1.f1
    (repo title), hook sentence.
  • Beats 2–3:
    navigate
    to specific source files (
    https://github.com/{owner}/{repo}/blob/HEAD/<path>
    ), zoom
    .blob-code-inner
    or
    .highlight
    . Pick files that match the narration's claim — don't navigate to a file you won't talk about.
  • Beats 4–5:
    scroll_to
    README sections, zoom
    article h2
    or
    #readme
    . If Step 3.0 surfaced required sections, replace these slots with the required ones.
  • Beats 6–7 (only if
    live_url
    survived Step 2.5):
    navigate
    to
    live_url
    , zoom
    nav
    /
    h1
    /
    .hero
    /
    main
    /
    button
    /
    .feature
    .
  • Beat 8: back to repo root, zoom
    .octicon-star
    , outro.
Structural skeleton — Generic-URL mode:
  • Beat 1:
    navigate
    to the input URL, zoom
    h1
    or
    [class*="hero"] h1
    (the page's primary headline), hook sentence.
  • Beats 2–3:
    scroll_to
    the page's hero / value-prop / first feature section. Zoom
    .hero
    ,
    [class*="hero"]
    ,
    [class*="feature"]
    , or
    section:nth-of-type(1) h2
    . Pick visible elements the narration references.
  • Beats 4–5:
    scroll_to
    deeper sections — feature lists, screenshots, pricing, social proof. Zoom
    section h2
    ,
    [class*="feature"] img
    ,
    [class*="testimonial"]
    ,
    [class*="pricing"]
    , or any prominent semantic element on the page.
  • Beats 6–7:
    scroll_to
    CTA / signup / demo embed. Zoom
    [class*="cta"]
    ,
    button
    ,
    a[class*="button"]
    , or
    [id*="signup"]
    . (No live-demo navigation in generic mode — the input URL IS the demo.)
  • Beat 8:
    scroll_to
    footer / closing element, zoom
    footer h2
    ,
    footer
    , or back to top with
    h1
    . Outro sentence.
If
--focus
is supplied, weave its angles into
vo_text
without mutating the structural skeleton. Prefer CSS selectors over
text_content
in
zoom_target.selector
— bbox capture is selector-only (see Known gaps).
编写包含8-10个节拍的JSON数组,总时长严格控制在65-80秒,总单词数严格控制在165-200词(假设语速为2.5词/秒)。每个节拍格式如下:
jsonc
{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "要朗读的精确内容——1-2句口语化句子"
}
硬性约束(输出节拍表前验证——若违反则拒绝草稿并重写):
  1. 每个节拍必须包含所有五个字段:
    t_start
    t_end
    action
    (含
    type
    url
    )、
    zoom_target
    (含
    selector
    )、
    vo_text
    。字段缺失→拒绝并重写。(与tarball文件
    github_explainer.py:183-190
    的验证逻辑一致。)
  2. 节拍0的
    t_start
    =0.0;
    t_end[i] == t_start[i+1]
    (连续性)。
  3. len(vo_text.split()) / 2.5
    t_end - t_start
    (每个节拍)。目标为±10%的误差;如果草稿密度超过2.5词/秒,则精简
    vo_text
    直到符合要求。
  4. 最后一个节拍的
    t_end
    ≤80秒
    。(参考输出包括 intro 为86.5秒;唇形同步音频约83秒。在当前负载下,Kling头像/image2video在音频时长超过约90秒时会可靠地停滞——超过80秒会导致Kling超时20分钟。)
  5. 总朗读单词数在165-200词之间
  6. 每个节拍的
    zoom_target.selector
    必须是该节拍所在页面的有效CSS选择器。GitHub模式优先使用GitHub特定选择器:
    h1.f1
    #readme
    article h2
    .blob-code-inner
    .highlight
    .octicon-star
    nav
    通用URL模式优先使用健壮的通用选择器:
    h1
    [role="main"]
    main
    header
    nav
    .hero
    .feature
    section h2
    [class*="cta"]
    [class*="hero"]
    button
    a[href]
    选择器必须在节拍操作完成后的渲染页面上可解析——输出前需通过WebFetch查看的DOM进行验证。
  7. vo_text
    为1-2句口语化句子。使用开发者语气。无舞台提示。无Markdown格式。
  8. action.type == "navigate"
    时,
    action.url
    必须是有效的
    https://...
    URL;必填。
步骤4前的自检: 验证
total_words
[165, 200]
范围内且
total_seconds
(=
beats[-1].t_end
)在
[65, 80]
范围内。如果任一条件不满足,则重写节拍表——不要进入TTS步骤。(无需“打印”任何内容——这是内部草稿验证;只需拒绝草稿并重写直到通过。)
结构框架——GitHub模式(视觉约定的核心——与原始逻辑一致,但步骤3.0适用时覆盖):
  • 节拍1:
    navigate
    到仓库根目录,缩放
    h1.f1
    (仓库标题),开场句子。
  • 节拍2-3:
    navigate
    到特定源文件(
    https://github.com/{owner}/{repo}/blob/HEAD/<path>
    ),缩放
    .blob-code-inner
    .highlight
    。选择与旁白内容匹配的文件——不要导航到未提及的文件。
  • 节拍4-5:
    scroll_to
    到README章节,缩放
    article h2
    #readme
    如果步骤3.0发现必填章节,则用必填章节替换这些位置
  • 节拍6-7(仅当
    live_url
    通过步骤2.5验证时):
    navigate
    live_url
    ,缩放
    nav
    /
    h1
    /
    .hero
    /
    main
    /
    button
    /
    .feature
  • 节拍8: 返回仓库根目录,缩放
    .octicon-star
    ,结尾句子。
结构框架——通用URL模式:
  • 节拍1:
    navigate
    到输入URL,缩放
    h1
    [class*="hero"] h1
    (页面主标题),开场句子。
  • 节拍2-3:
    scroll_to
    到页面的hero/价值主张/第一个功能区域。缩放
    .hero
    [class*="hero"]
    [class*="feature"]
    section:nth-of-type(1) h2
    。选择旁白提及的可见元素。
  • 节拍4-5:
    scroll_to
    到更深层区域——功能列表、截图、定价、社交证明。缩放
    section h2
    [class*="feature"] img
    [class*="testimonial"]
    [class*="pricing"]
    或页面上任何突出的语义元素。
  • 节拍6-7:
    scroll_to
    到CTA/注册/演示嵌入区域。缩放
    [class*="cta"]
    button
    a[class*="button"]
    [id*="signup"]
    。(通用模式无实时演示导航——输入URL即为演示页面。)
  • 节拍8:
    scroll_to
    到页脚/结尾元素,缩放
    footer h2
    footer
    或返回顶部的
    h1
    。结尾句子。
如果提供了
--focus
,则将其方向融入
vo_text
,但不改变结构框架。在
zoom_target.selector
优先使用CSS选择器而非
text_content
——bbox捕获仅支持选择器(参见已知缺陷)。

Step 4 — TTS

步骤4——文本转语音(TTS)

Call
mcp__pika__generate_speech
with
provider: "minimax-tts"
,
text: <full vo_text join>
, optional
voice_id
. Capture
result.audio_url
(the dispatcher returns audio under
audio_url
, not
url
) and
result.duration_seconds
. Voice defaults to identity-store injection in plugin mode.
Stale-voice fallback detection (AGNT-231): the dispatcher retries once with the default
Calm_Woman
voice on Minimax
status_code:2054
(voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough):
voice_id_requested
(the planted-but-stale id the worker tried first) and
fallback_reason: "invalid_minimax_voice_id"
. If you see
fallback_reason == "invalid_minimax_voice_id"
in the response, surface a one-line note to the user along the lines of:
"your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via
clone_voice
if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.
Cookie-banner audio padding (Generic-URL mode with
cookie_banner_present == true
from Step 2.6 §B):
prepend MiniMax's pause marker
<#1.5#>
to the
text:
argument before calling
generate_speech
. MiniMax's
speech-2.8-hd
honors
<#N#>
as N-second silence; the returned
audio_url
and
duration_seconds
include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.
Fallback (only if smoke-test shows the marker is ignored on this voice): call
generate_speech
normally, then
mcp__pika__edit_audio_mix
to overlay the result onto a 1.5s silent base at offset 1.5s. Then call
mcp__pika__analyze_media(url=<padded_audio_url>)
to probe the padded duration and rebind
duration_seconds = result.duration_seconds
before Step 4.5 consumes it.
analyze_media
is the single authoritative duration probe — do not rely on
edit_audio_mix
's return payload (its duration field is not contractually guaranteed).
调用
mcp__pika__generate_speech
,参数为
provider: "minimax-tts"
text: <拼接后的完整vo_text>
、可选
voice_id
。捕获
result.audio_url
(调度器返回的音频地址为
audio_url
,而非
url
)和
result.duration_seconds
。插件模式下语音默认使用身份存储注入的值。
过期语音 fallback 检测(AGNT-231): 当Minimax返回
status_code:2054
(语音ID不存在——通常是代理工作区指针,Minimax在7天未使用后自动删除)时,调度器会自动重试一次,使用默认的
Calm_Woman
语音。重试成功后,响应会包含文档 schema 之外的两个额外字段(透传):
voice_id_requested
(工作器首次尝试的已过期语音ID)和
fallback_reason: "invalid_minimax_voice_id"
如果响应中
fallback_reason == "invalid_minimax_voice_id"
,则向用户显示一行提示:
“你注册的语音在Minimax上已过期(7天未使用后自动清理);我们使用了系统默认语音。若需恢复个性化语音,请通过
clone_voice
重新克隆。”渲染不会失败——只是使用默认语音——因此这是信息提示,而非重试触发条件。
Cookie弹窗音频填充(通用URL模式且步骤2.6§B中
cookie_banner_present == true
):
在调用
generate_speech
前,将MiniMax的暂停标记
<#1.5#>
添加到
text:
参数前。MiniMax的
speech-2.8-hd
支持
<#N#>
作为N秒静音;返回的
audio_url
duration_seconds
会原生包含1.5秒的前置静音。这会使音频与步骤4.5中应用的屏幕录制cookie关闭+1.5秒偏移对齐。
Fallback方案(仅当冒烟测试显示该标记被当前语音忽略时使用):正常调用
generate_speech
,然后调用
mcp__pika__edit_audio_mix
将结果叠加到1.5秒的静音基础上,偏移量为1.5秒。然后调用
mcp__pika__analyze_media(url=<填充后的音频URL>)
探测填充后的时长,并重新绑定
duration_seconds = result.duration_seconds
,供步骤4.5使用。
analyze_media
唯一权威的时长探测工具——不要依赖
edit_audio_mix
的返回 payload(其时长字段无契约保证)。

Step 4.5 — Audio length verification + beat-sheet rescale

步骤4.5——音频长度验证 + 节拍表缩放

Applied to
audio_duration_seconds
post Step 4 (which includes any cookie lead-in pad). End state:
beats[].t_start
/
t_end
are absolute wall-clock seconds matching the audio playback timeline. All
beats[]
mutations happen here
; Steps 6 and 8 are read-only consumers.
Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds): If
audio_duration_seconds > 90
, abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.
Gate 2 — Degenerate TTS (spoken-content length): Compute
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)
. If
narration_duration < 30s
, retry Step 4 once (and recompute
narration_duration
from the retry's audio). If the retry also returns
narration_duration < 30s
, abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.
Gate 3 — Rescale:
  • narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)
  • scale = narration_duration / beats[-1].t_end
  • If
    scale < 0.5
    or
    scale > 1.5
    , abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it.
  • For each beat:
    beat.t_start *= scale; beat.t_end *= scale
  • If
    cookie_banner_present
    : for each beat,
    beat.t_start += 1.5; beat.t_end += 1.5
  • Final clamp:
    beats[-1].t_end = audio_duration_seconds
    (exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.
After Gate 3 passes, emit a one-line operator log to surface the scale value for post-run diagnosis:
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)
Advisory (not a gate): scale near 1.0 is ideal.
scale > 1.2
means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync.
scale < 0.85
means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.
应用于步骤4后的
audio_duration_seconds
(包含任何cookie前置填充)。最终状态:
beats[].t_start
/
t_end
为与音频播放时间线匹配的绝对挂钟秒数。所有
beats[]
修改都在此步骤进行
;步骤6和8为只读消费。
关卡1——Kling停滞上限(服务商限制,原始audio_duration_seconds): 如果
audio_duration_seconds > 90
,则中止并重写节拍表,减少单词数量。Kling头像/image2video在超过约90秒时会停滞。
关卡2——退化TTS(朗读内容长度): 计算
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)
。如果
narration_duration < 30s
,则重试步骤4一次(并从重试的音频中重新计算
narration_duration
)。如果重试后
narration_duration
仍<30s,则中止并排查问题——可能的失败模式:MiniMax响应截断、静音音频、vo_text拼接错误。
关卡3——缩放:
  • narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)
  • scale = narration_duration / beats[-1].t_end
  • 如果
    scale < 0.5
    scale > 1.5
    ,则中止并重写。TTS结构损坏(或单词数量严重偏离预算);缩放无法修复。
  • 对每个节拍:
    beat.t_start *= scale; beat.t_end *= scale
  • 如果
    cookie_banner_present
    :对每个节拍,
    beat.t_start += 1.5; beat.t_end += 1.5
  • 最终钳位:
    beats[-1].t_end = audio_duration_seconds
    (精确值)。保证无论cookie模式如何或累积浮点漂移,不变量的浮点相等性。
关卡3通过后,输出一行操作日志以显示缩放值,供运行后诊断:
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)
建议(非强制关卡): scale接近1.0为理想状态。
scale > 1.2
意味着音频比预测慢很多——视觉效果会“拉伸”但保持同步。
scale < 0.85
意味着音频比预测快——视觉效果会“仓促”但保持同步。两种情况都通过关卡;如果用户反馈“节奏不对”而非“不同步”,则重写节拍表调整单词数量。

Step 5 — Preview gate (opt-in via
--preview
)

步骤5——预览环节(通过
--preview
主动开启)

Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed
--preview
— do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.
--skip-preview
and
--yes
are accepted as no-ops for backward compatibility — they were the old opt-out flags.
If
--preview
was supplied:
  1. mcp__pika__generate_speech
    with
    text: "Hi, I'm your presenter. Let's explore this repo together."
    preview_audio_url
    .
  2. mcp__pika__generate_lipsync
    with
    provider: <resolved_lipsync_provider>
    (defaults to
    pika
    ; honor
    --lipsync-provider kling
    if supplied),
    image: <avatar>
    ,
    audio: preview_audio_url
    preview_lipsync_url
    (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
  3. Present to the user verbatim:
    Preview ready:
    <preview_lipsync_url>
    This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Reply
    yes
    to proceed, or anything else to cancel.
  4. Match
    ^(yes|go|proceed|confirm|y)$
    (case-insensitive). Anything else → STOP, no further MCP calls.
默认完全跳过步骤5。直接进入步骤6,除非用户显式传递
--preview
——不生成预览,不请求确认。这符合媒体生成工具的行业标准(如Midjourney / Sora / Runway / HeyGen / Pika.art):提交→渲染→返回;账户余额 + 服务商故障切换是核心保障机制。
--skip-preview
--yes
作为向后兼容的无操作参数被接受——它们是旧版的退出标志。
如果提供了
--preview
  1. 调用
    mcp__pika__generate_speech
    ,参数
    text: "Hi, I'm your presenter. Let's explore this repo together."
    → 获取
    preview_audio_url
  2. 调用
    mcp__pika__generate_lipsync
    ,参数
    provider: <解析后的lipsync_provider>
    (默认
    pika
    ;如果提供了
    --lipsync-provider kling
    则优先使用)、
    image: <avatar>
    audio: preview_audio_url
    → 获取
    preview_lipsync_url
    (纯唇形同步,约3秒)。此处使用与步骤9完整音频相同的服务商——预览的作用是在耗时较长的渲染前确认头像+语音+服务商组合。
  3. 向用户原样显示:
    预览已准备好:
    <preview_lipsync_url>
    这确认了头像+语音组合。完整渲染耗时较长(Kling唇形同步约5-30分钟)。 回复
    yes
    继续,回复其他内容则取消。
  4. 匹配
    ^(yes|go|proceed|confirm|y)$
    (不区分大小写)。回复其他内容→停止,不再调用MCP。

Step 6 — Build
timed_actions
and record

步骤6——构建
timed_actions
并录制

Translate the beat sheet into
capture_website
timed_actions
. One
timed_action
per beat
— set
bbox_selector
to the beat's
zoom_target.selector
and
capture_website
captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-
top - 60 px
→ 1300 ms post-anim → measure, all server-side).
For each beat in order, emit one entry:
  • navigate
    beats
    :
    {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}
    . The worker navigates, waits to absolute
    at_s + 0.6 s
    , scrolls
    bbox_selector
    into view, and measures the bbox — all without the caller scheduling a follow-up step.
  • scroll_to
    /
    hover
    beats
    :
    {type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}
    . The action's own
    selector
    drives the page scroll;
    bbox_selector
    drives the bbox measurement (it can be the same selector or different — usually the same). (
    capture_website
    has no
    hover
    ; scroll-into-view is the analog.)
Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from
t=0
of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at
t=0
with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.
Call
mcp__pika__capture_website
:
  • url: <beat 0's action.url>
  • timed_actions: <the N-element list built above>
    (one entry per beat)
  • duration_s: ceil(audio_duration_seconds)
    beats[].t_start
    and
    t_end
    have already been rescaled to the TTS audio timeline by Step 4.5, so
    duration_s
    is simply the audio length. The old
    max(...)
    defense against TTS overrun is no longer needed.
Generic-URL mode additions (per Step 2.6 pre-flight):
  • extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B>
    — defensive: hides common consent platforms via
    display: none !important;
    so even if the optional click misses, the banner is invisible in the recording.
  • Prepend a
    wait
    action
    {type: "wait", at_s: 0.0, ms: 2500}
    for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
  • If
    cookie_banner_present
    from Step 2.6 §B
    , also prepend a
    click
    action
    {type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}
    . The
    beats[]
    array has already been shifted by
    +1.5s
    in Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1's
    timed_action.at_s
    reads
    beats[0].t_start
    directly, which is 1.5 in cookie mode.
  • No cookie banner action needed if
    cookie_banner_present == false
    ; just the prepended wait action.
Capture
video_url
,
recording_viewport
,
action_bboxes
. The result returns
recording_viewport: {w, h}
and
action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]
alongside
video_url
.
action_bboxes[].idx
semantics:
the
idx
field is the position in the input
timed_actions
array.
  • GitHub mode: with one timed_action per beat,
    idx
    maps 1:1 to beat index — Step 8 uses
    entry.idx
    directly as
    beat_idx
    .
  • Generic-URL mode: the prepended
    wait
    (and optional cookie-dismissal
    click
    ) shift the array by 1 or 2. Compute
    beat_idx = entry.idx - prepend_count
    where
    prepend_count
    is 1 (wait only) or 2 (wait + click). Skip entries where
    beat_idx < 0
    (those are the prepended setup actions, not beats).
The
selector
field on each entry reports
bbox_selector
(i.e.
zoom_target.selector
), not the action's own
selector
.
将节拍表转换为
capture_website
timed_actions
每个节拍对应一个
timed_action
——将
bbox_selector
设为节拍的
zoom_target.selector
capture_website
会在内部捕获操作后的元素bbox(传统流程:600ms等待→平滑滚动到
top - 60 px
→1300ms动画后→测量,均在服务器端完成)。
按顺序为每个节拍输出一个条目:
  • navigate
    节拍:
    {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}
    。工作器会导航到目标URL,等待到绝对时间
    at_s + 0.6 s
    ,滚动
    bbox_selector
    到视图中,并测量bbox——无需调用者调度后续步骤。
  • scroll_to
    /
    hover
    节拍:
    {type: "scroll", at_s: <t_start>, selector: <action.selector或zoom_target.selector>, bbox_selector: <zoom_target.selector>}
    。操作自身的
    selector
    驱动页面滚动;
    bbox_selector
    驱动bbox测量(可以是相同或不同的选择器——通常相同)。(
    capture_website
    hover
    操作;滚动到视图中是等效操作。)
不要添加tarball文件中使用的八步intro滚动流程。唇形同步音频从节拍表的
t=0
开始计时;添加intro会使屏幕录制向前偏移约3秒,而音频保持不变,导致音视频不同步。
capture_website
录制从
t=0
开始,此时节拍0的URL已加载——这正是tarball文件intro滚动提供的初始状态,但不会导致不同步。
调用
mcp__pika__capture_website
  • url: <节拍0的action.url>
  • timed_actions: <上面构建的N元素列表>
    (每个节拍对应一个条目)
  • duration_s: ceil(audio_duration_seconds)
    ——步骤4.5已将
    beats[].t_start
    t_end
    缩放到TTS音频时间线,因此
    duration_s
    即为音频长度。不再需要旧版的
    max(...)
    防御TTS超时。
通用URL模式额外操作(根据步骤2.6预检查):
  • extra_css: <步骤2.6§B中的cookie弹窗隐藏CSS payload>
    ——防御性措施:通过
    display: none !important;
    隐藏常见同意平台,即使可选点击未命中,弹窗在录制中也会不可见。
  • 添加前置
    wait
    操作
    :对于SPA/懒加载页面(步骤2.6§D),添加
    {type: "wait", at_s: 0.0, ms: 2500}
    ;“正常”页面使用1500ms。这为hero图片懒加载、字体替换、滚动触发动画提供时间,确保第一个节拍触发时已准备就绪。
  • 如果步骤2.6§B中
    cookie_banner_present == true
    ,还需添加前置
    click
    操作
    {type: "click", at_s: 0.5, selector: <从WebFetch DOM中检测到的关闭选择器>}
    步骤4.5已将
    beats[]
    数组偏移
    +1.5s
    以适应cookie关闭延迟,且TTS音频已填充1.5秒静音(步骤4);此处无需进一步偏移。节拍1的
    timed_action.at_s
    直接读取
    beats[0].t_start
    ,在cookie模式下为1.5秒。
  • 如果
    cookie_banner_present == false
    ,无需cookie弹窗操作;仅添加前置wait操作。
捕获
video_url
recording_viewport
action_bboxes
。结果返回
recording_viewport: {w, h}
action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]
以及
video_url
action_bboxes[].idx
语义:
idx
字段对应输入
timed_actions
数组中的位置。
  • GitHub模式: 每个节拍对应一个timed_action,
    idx
    与节拍索引1:1映射——步骤8直接使用
    entry.idx
    作为
    beat_idx
  • 通用URL模式: 前置的
    wait
    (和可选的cookie关闭
    click
    )会使数组偏移1或2位。计算
    beat_idx = entry.idx - prepend_count
    ,其中
    prepend_count
    为1(仅wait)或2(wait + click)。跳过
    beat_idx < 0
    的条目(这些是前置设置操作,非节拍)。
每个条目的
selector
字段报告的是
bbox_selector
(即
zoom_target.selector
),而非操作自身的
selector

Step 7 — Browser chrome

步骤7——浏览器框架

mcp__pika__edit_browser_frame
:
  • video_url: <Step 6 video_url>
  • url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)
  • tab_title: <30-char title>
    — GitHub mode:
    (meta.description or repo_name or "")[:30]
    . Generic-URL mode: the page's
    <title>
    (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against
    None
    /empty.
Returns
framed_url
(1280×800 Sonoma + chrome).
调用
mcp__pika__edit_browser_frame
  • video_url: <步骤6的video_url>
  • url: (如果是GitHub模式且通过步骤2.5验证则为live_url,否则为input_url,截断为65字符)
  • tab_title: <30字符标题>
    ——GitHub模式:
    (meta.description或repo_name或"")[:30]
    。通用URL模式:页面的
    <title>
    (步骤2中WebFetch获取)或URL主机名,截断为30字符。处理
    None
    /空值情况。
返回
framed_url
(1280×800 Sonoma框架+浏览器控件)。

Step 8 — Build
zoom_keyframes
and apply

步骤8——构建
zoom_keyframes
并应用

Constants:
  • INTRO_BEATS = 2
    — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
  • HOLD_GAP = 0.6
    — seconds of 1.0× before each zoom-in and after each zoom-out.
  • MIN_BEAT_DUR = 1.5
    — beats shorter than this are skipped (no room for a meaningful zoom).
  • SCALE = 1.35
    (precise element-targeted zoom).
  • FALLBACK_SCALE = 1.25
    (default-position fallback when no usable bbox).
  • FALLBACK_RAMP = 0.4
    .
Note:
beats[].t_start
/
t_end
were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.
edit_browser_frame
's inner-content offsets:
CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637
(verified against the worker's
edit_browser_frame/main.py
).
Coord transform (recording px → framed px):
cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)
Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:
Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.
Constants for the default position:
  • DEFAULT_CX = 56 + 1168 // 2
    (screen center of the framed canvas)
  • DEFAULT_CY = 108 + 637 // 3
    (upper-third of the content area, where most GitHub UI prominence lives)
Walk the beat sheet from index
INTRO_BEATS
(= 2) to the end. For each beat:
  • If
    t_end - t_start < MIN_BEAT_DUR
    (1.5s), skip — too short for a meaningful zoom.
  • Compute the keyframe's interior interval as
    [t_start + HOLD_GAP, t_end - HOLD_GAP]
    . If that interval is shorter than 1.0s, skip.
  • Otherwise pre-fill that beat's slot in a per-beat map (call it
    zoom_keyframes_by_beat[beat_idx]
    ) with
    {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}
    plus the trimmed
    t_start
    /
    t_end
    .
Step 8b — Override with bbox-derived precise zoom where
action_bboxes
provided a usable measurement.
For each entry in
action_bboxes
:
  • beat_idx = entry.idx
    (since Step 6 emits one timed_action per beat). If
    beat_idx < INTRO_BEATS
    , skip.
  • If
    entry.found
    is false, skip.
  • If the beat isn't already in
    zoom_keyframes_by_beat
    (was filtered out in Step 8a by
    MIN_BEAT_DUR
    /
    1.0s
    rules), skip.
  • Filter degenerate bboxes: skip if
    bbox.y > recording_viewport.h
    (offscreen capture — page didn't scroll the element into view in time) or
    bbox.h > recording_viewport.h * 1.5
    (full-page
    <main>
    element — yields a meaningless zoom center).
  • Compute
    cx_framed
    /
    cy_framed
    from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with
    {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}
    .
Final list: sort the values of
zoom_keyframes_by_beat
by
t_start
to produce the
zoom_keyframes
array.
This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.
If
len(zoom_keyframes) > 0
, call
mcp__pika__edit_animate_zoom
with
video_url: framed_url, zoom_keyframes
. Returns
zoomed_url
. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use
framed_url
as
zoomed_url
.
常量:
  • INTRO_BEATS = 2
    ——按节拍表索引过滤。跳过节拍索引0和1(上述结构框架中的“节拍1”和“节拍2”)的缩放。
  • HOLD_GAP = 0.6
    ——每次放大前和缩小后的1.0×保持时间(秒)。
  • MIN_BEAT_DUR = 1.5
    ——短于此时长的节拍跳过(无足够空间进行有意义的缩放)。
  • SCALE = 1.35
    (精确的元素目标缩放)。
  • FALLBACK_SCALE = 1.25
    (无可用bbox时的默认位置 fallback)。
  • FALLBACK_RAMP = 0.4
注意: 步骤4.5已将
beats[].t_start
/
t_end
缩放(并在cookie模式下偏移)到音频时间线。HOLD_GAP(0.6秒)、MIN_BEAT_DUR(1.5秒)和1.0秒内部间隔检查均基于这些最终值——它们是渲染视频上的真实视觉秒数。
edit_browser_frame
的内部内容偏移:
CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637
(已与工作器的
edit_browser_frame/main.py
验证一致)。
坐标转换(录制像素→框架像素):
cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)
构建缩放列表:每个节拍默认值 + bbox覆盖模式。旧版遵循“每个非intro节拍都进行缩放——可用bbox则基于bbox,否则使用默认位置”的规则。此处重现该逻辑:
步骤8a——为每个非intro、时长足够的节拍预填充默认位置关键帧。
默认位置常量:
  • DEFAULT_CX = 56 + 1168 // 2
    (框架画布的屏幕中心)
  • DEFAULT_CY = 108 + 637 // 3
    (内容区域上三分之一,GitHub UI最突出的位置)
从索引
INTRO_BEATS
(=2)到末尾遍历节拍表。对于每个节拍:
  • 如果
    t_end - t_start < MIN_BEAT_DUR
    (1.5秒),则跳过——时长太短无法进行有意义的缩放。
  • 计算关键帧的内部间隔为
    [t_start + HOLD_GAP, t_end - HOLD_GAP]
    。如果该间隔短于1.0秒,则跳过。
  • 否则在每个节拍的映射(称为
    zoom_keyframes_by_beat[beat_idx]
    )中预填充该节拍的条目,包含
    {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}
    以及修剪后的
    t_start
    /
    t_end
步骤8b——在
action_bboxes
提供可用测量值的情况下,用bbox派生的精确缩放覆盖默认值。
遍历
action_bboxes
中的每个条目:
  • beat_idx = entry.idx
    (步骤6为每个节拍输出一个timed_action)。如果
    beat_idx < INTRO_BEATS
    ,则跳过。
  • 如果
    entry.found
    为false,则跳过。
  • 如果该节拍未在
    zoom_keyframes_by_beat
    中(步骤8a中被
    MIN_BEAT_DUR
    /
    1.0s
    规则过滤),则跳过。
  • 过滤退化bbox: 如果
    bbox.y > recording_viewport.h
    (屏幕外捕获——页面未及时将元素滚动到视图中)或
    bbox.h > recording_viewport.h * 1.5
    (全页
    <main>
    元素——缩放中心无意义),则跳过。
  • 使用录制像素→框架像素转换公式从bbox中心计算
    cx_framed
    /
    cy_framed
    。用
    {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}
    覆盖该节拍的条目。
最终列表:
t_start
zoom_keyframes_by_beat
的值排序,生成
zoom_keyframes
数组。
这保证了每个非intro、时长足够的节拍都有缩放——bbox捕获成功则使用精确缩放,否则使用默认位置缩放。避免了“整个视频全程无缩放”的失败模式。
如果
len(zoom_keyframes) > 0
,调用
mcp__pika__edit_animate_zoom
,参数为
video_url: framed_url, zoom_keyframes
。返回
zoomed_url
。否则(无符合条件的节拍——步骤3的65-80秒约束下应很少见)跳过并使用
framed_url
作为
zoomed_url

Step 9 — Lipsync the full audio

步骤9——完整音频唇形同步

mcp__pika__generate_lipsync
:
  • provider: <resolved_lipsync_provider>
    default:
    pika
    (parrot a2v). Honor
    --lipsync-provider kling
    if explicitly passed.
  • image: <avatar>
  • audio: <Step 4 audio_url>
  • kling-only knobs (since
    pika-mcp-server
    BACK-339, 2026-05-10): when
    provider == "kling"
    , add
    mode: "pro"
    and
    prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
    for the polished-presenter feel. Both are silently ignored on
    pika
    (parrot has its own driver).
Provider tradeoffs:
ProviderWall-clockHead motionWhen to use
pika
(default)
~2–5 minSlightly more dramatic, naturalisticDefault for most runs — fast iteration, watchable output, ~10× faster than kling
kling
(opt-in)
~5–30 minMinimal, face-centered, presenter-styleHigh-stakes renders where the avatar must read like a polished presenter; tolerate the long pole
Server-side-await covers the call inline; if the response shape is
{task_id, status: "queued"}
, poll
mcp__pika__task_status
in a tight loop (no sleep) until the status reaches a terminal state (
completed
,
failed
, or
cancelled
). On
completed
, capture
lipsync_url
. On
failed
/
cancelled
, fall back to the other provider (kling ↔ pika) per the failover note below.
Failover:
  • If
    pika
    fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with
    provider: "kling"
    .
  • If
    kling
    stalls past the worker's 1200s ceiling (visible as repeated
    processing
    status with no completion) → fall back to
    provider: "pika"
    . Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.
Why pika is the default:
  • Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
  • Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
  • Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.
For the canonical "polished presenter" feel of the original tarball reference output, pass
--lipsync-provider kling
explicitly.
调用
mcp__pika__generate_lipsync
  • provider: <解析后的lipsync_provider>
    ——默认:
    pika
    (parrot a2v)。如果显式传递
    --lipsync-provider kling
    则优先使用。
  • image: <avatar>
  • audio: <步骤4的audio_url>
  • 仅kling可用的参数(自
    pika-mcp-server
    BACK-339,2026-05-10起):当
    provider == "kling"
    时,添加
    mode: "pro"
    prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
    以获得 polished 演示者效果。这两个参数在
    pika
    模式下会被静默忽略(parrot有自己的驱动逻辑)。
服务商权衡:
服务商耗时头部动作使用场景
pika
(默认)
~2–5分钟稍显生动、自然大多数运行的默认选择——迭代速度快,输出可观看,比kling快约10倍
kling
(主动开启)
~5–30分钟极小、面部居中、演示者风格高优先级渲染,要求头像看起来像专业演示者;可容忍较长耗时
服务器端会等待调用完成;如果响应格式为
{task_id, status: "queued"}
,则循环调用
mcp__pika__task_status
(无睡眠)直到状态达到终端状态(
completed
failed
cancelled
)。状态为
completed
时,捕获
lipsync_url
。状态为
failed
/
cancelled
时,按以下故障切换逻辑回退到另一个服务商(kling↔pika)。
故障切换:
  • 如果
    pika
    失败(罕见——parrot a2v在典型讲解音频长度下很健壮)→ 重试一次,使用
    provider: "kling"
  • 如果
    kling
    在工作器1200秒上限后仍停滞(表现为重复
    processing
    状态但未完成)→ 回退到
    provider: "pika"
    。步骤4.5的音频长度关卡应在此之前捕获长音频情况,但故障切换可处理剩余风险。
为什么pika是默认选择:
  • 速度——典型讲解视频的总耗时从约10-15分钟降至约5-7分钟,因为唇形同步是耗时最长的环节。
  • 质量足够——parrot a2v效果自然;在60-80秒的圆形头像PiP视频中,稍多的头部动作会显得更有吸引力而非分散注意力。
  • Kling模式的 polished 效果在246像素的圆形中大多不可见——面部区域太小,大多数观众无法察觉头部动作极小的差异。
若要获得原始tarball参考输出的标准“polished演示者”效果,请显式传递
--lipsync-provider kling

Step 10 — PiP composite

步骤10——画中画(PiP)合成

mcp__pika__edit_pip
:
  • main_video_url: <zoomed_url>
  • overlay_video_url: <lipsync_url>
  • shape: "circle"
  • size_px: 246
    ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's
    CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
  • stroke_width_px: 3
  • stroke_color: "white"
  • position_px: {x: 20, y: 476}
    800 − 246 − 78
    for dock clearance (matches tarball's
    H − CIRCLE_OUT − 78
    )
Pass
size_px
, not
size
; the fields are mutually exclusive. Returns
final_url
.
Master-duration / audio-source contract (matching tarball
github_explainer.py:418-419, 531-533, 578-582
):
edit_pip
uses
shortest=1
semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's
duration_s = ceil(audio_duration_seconds)
ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via
edit_pip
's
shortest=1
semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone
audio_url
is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.
调用
mcp__pika__edit_pip
  • main_video_url: <zoomed_url>
  • overlay_video_url: <lipsync_url>
  • shape: "circle"
  • size_px: 246
    ← 固定像素的246px外径(240px内部头像 + 3+3px描边环);与tarball文件的
    CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
    一致
  • stroke_width_px: 3
  • stroke_color: "white"
  • position_px: {x: 20, y: 476}
    800 − 246 − 78
    以避开dock(与tarball文件的
    H − CIRCLE_OUT − 78
    一致)
传递
size_px
,而非
size
;这两个字段互斥。返回
final_url
主时长/音频源约定(与tarball文件
github_explainer.py:418-419, 531-533, 578-582
一致):
edit_pip
默认使用
shortest=1
语义,即合成视频的时长为(缩放后的屏幕录制)和(唇形同步视频)中较短的一个。步骤6的
duration_s = ceil(audio_duration_seconds)
确保屏幕录制长度与唇形同步完全匹配(步骤4.5已将节拍缩放到音频时间线)。合成时长由唇形同步视频通过
edit_pip
shortest=1
语义决定。音频来自唇形同步视频的音轨(唇形同步嵌入了原始TTS音频);无需重新混合独立的
audio_url
。如果唇形同步视频比屏幕录制短(Kling有时会修剪末尾静音),则屏幕录制会在唇形同步结束处被截断——接受此情况;循环屏幕录制对讲解内容来说更糟。

Step 11 — Burn captions

步骤11——添加字幕

Call
mcp__pika__add_captions(video_url=<final_url>, style="classic")
.
classic
renders a bottom subtitle bar — the right register for an explainer video (use
tiktok
/
hormozi
/
karaoke
only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as
captioned_url
.
Skip this step only if the user passed
--no-captions
(parsed in Step 1) — the default is captions on. (Note:
/pika:podcast
does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)
调用
mcp__pika__add_captions(video_url=<final_url>, style="classic")
classic
样式渲染底部字幕栏——这是讲解视频的合适样式(仅当用户显式要求逐词高亮时才使用
tiktok
/
hormozi
/
karaoke
样式)。音频从PiP合成视频的唇形同步音轨中提取,因此转录内容与旁白完全一致。捕获结果为
captioned_url
仅当用户传递
--no-captions
(步骤1中解析)时跳过此步骤——默认开启字幕。(注意:
/pika:podcast
添加字幕——讲解视频的旁白比快速的双人对话更适合转录。)

Step 12 — Return

步骤12——返回结果

Emit
captioned_url
(or
final_url
if Step 11 was skipped) on one line:
Done: <url>
.
在一行中输出
captioned_url
(如果跳过步骤11则输出
final_url
):
Done: <url>

Load-bearing phrases

核心短语

These anchors preserve the visual contract across page types:
PhraseWhereWhy load-bearing
vanilla CSS that resolves via document.querySelector
Selector contractKeeps scroll, bbox capture, and zoom targeting aligned inside
capture_website
.
GitHub URLs activate repo-aware mode
Mode detectionPrevents generic product-page beats from replacing README/code walkthrough beats.
8-10 beats
,
65-80 seconds
,
165-200 words
Beat-sheet authoringKeeps narration, screen recording, lipsync, and captions within the reliable duration envelope.
all beats[] mutations happen here
Audio rescale stepEnsures later capture/zoom/composite steps consume one stable timeline.
extra_css
cookie-banner hiding payload
Generic URL pre-flightReduces first-frame banner occlusion when a banner click misses.
这些锚点确保跨页面类型的视觉约定一致:
短语位置核心原因
vanilla CSS that resolves via document.querySelector
选择器约定保持
capture_website
内部的滚动、bbox捕获和缩放目标对齐。
GitHub URLs activate repo-aware mode
模式检测防止通用产品页面节拍替换README/代码导览节拍。
8-10 beats
,
65-80 seconds
,
165-200 words
节拍表编写保持旁白、屏幕录制、唇形同步和字幕在可靠的时长范围内。
all beats[] mutations happen here
音频缩放步骤确保后续捕获/缩放/合成步骤使用稳定的时间线。
extra_css
cookie-banner hiding payload
通用URL预检查当弹窗点击未命中时,减少首帧弹窗遮挡。

Engine choice: Pika lipsync default, Kling opt-in

引擎选择:默认Pika唇形同步,Kling主动开启

Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests
--lipsync-provider kling
or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.
默认使用Pika/parrot唇形同步,因为速度更快,大多数讲解视频可在短迭代周期内完成。仅当用户显式请求
--lipsync-provider kling
或高优先级渲染需要更居中的演示者外观且可容忍更长耗时环节时,才使用Kling。屏幕捕获、浏览器框架、缩放、PiP和字幕围绕唇形同步选择保持确定性编辑/合成步骤。

Runtime expectations

运行时预期

Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:
StepWall clockNotes
URL read + pre-flight10-60sGitHub README scan or generic URL DOM/cookie checks
TTS + audio rescale30-90sBeat timing is normalized after actual audio length
Screen recording60-180sDepends on page load and navigation count
Browser frame + zooms1-3 minDeterministic edit/composite stages
Lipsync2-5 min Pika / 5-30 min KlingKling is opt-in because it is the long pole
PiP + captions1-3 minCaptions skipped when
--no-captions
is set
使用Pika唇形同步的典型耗时为5-10分钟,使用Kling唇形同步为10-30+分钟:
步骤耗时说明
URL读取 + 预检查10-60秒GitHub README扫描或通用URL DOM/cookie检查
TTS + 音频缩放30-90秒节拍时间根据实际音频长度归一化
屏幕录制60-180秒取决于页面加载和导航次数
浏览器框架 + 缩放1-3分钟确定性编辑/合成环节
唇形同步Pika 2-5分钟 / Kling 5-30分钟Kling需主动开启,因为耗时最长
PiP + 字幕1-3分钟
--no-captions
设置时跳过字幕

Known gaps (carried as follow-up server-side work)

已知缺陷(作为后续服务器端工作)

  • Kling avatar
    mode:"pro"
    and
    prompt
    not exposed.
    Resolved by
    pika-mcp-server
    BACK-339 (PR #186, shipped 2026-05-10):
    generate_lipsync
    now wires both
    mode
    (e.g.
    "pro"
    ) and
    prompt
    end-to-end for the kling provider. To enable polished-presenter mode here, pass
    --lipsync-provider kling
    and the Step 9 call should add
    mode: "pro"
    plus a prompt like
    "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
    . Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap.
  • No caller-controlled white-frame trim on the screen recording.
    capture_website
    has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms
    wait
    action at
    at_s: 0.0
    mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
  • No
    networkidle
    wait on per-beat navigation.
    Tarball uses
    page.goto(url, wait_until="networkidle", timeout=20000)
    plus
    wait_for_timeout(600)
    after every navigate.
    capture_website
    settles to
    domcontentloaded
    plus the bbox-capture branch's 600 ms post-action settle (server-side, when
    bbox_selector
    is set), but SPA blob pages whose final render happens after
    domcontentloaded
    can still get bbox'd against unmounted code blocks. Worker enhancement: expose a
    wait_until
    knob on
    timed_actions[].navigate
    .
  • No per-step output-size verification gates. Tarball's
    verify()
    helper at
    github_explainer.py:35-39
    checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra
    mcp__pika__analyze_media
    call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
  • text_content
    bbox capture not implemented.
    capture_website
    v1 returns
    action_bboxes
    only for steps with a CSS
    selector
    .
    text_content
    -only steps produce no entry. Prefer CSS selectors in
    zoom_target
    for guaranteed zoom coverage.
  • Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
  • Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (
    <h1>
    + clear
    <section>
    + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names
    tile-headline
    instead of
    hero-title
    defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's
    scrollIntoView
    ; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with
    loading="lazy"
    may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass
    --focus "the X feature"
    to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry +
    wait_until=networkidle
    + animation-trigger via
    IntersectionObserver
    polyfill).
  • Cookie-banner click is single-attempt. Step 2.6 §B emits one
    click
    against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the
    extra_css
    payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per
    click
    action so the worker tries each in order.
  • Step 8b ↔ Step 6
    idx
    -mapping mismatch in Generic-URL mode.
    Step 6 maps
    beat_idx = entry.idx - prepend_count
    , but Step 8b uses
    beat_idx = entry.idx
    naively. Pre-existing bug independent of rescaling.
  • Kling头像
    mode:"pro"
    prompt
    未暴露。
    已解决
    pika-mcp-server
    BACK-339(PR #186,2026-05-10发布):
    generate_lipsync
    现在为kling服务商端到端传递
    mode
    (例如
    "pro"
    )和
    prompt
    。要在此处启用polished演示者模式,传递
    --lipsync-provider kling
    ,步骤9调用应添加
    mode: "pro"
    和类似
    "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"
    的prompt。这是减少唇形同步中头部动作的有效质量控制手段——不再是服务器端缺陷。
  • 无调用者可控的屏幕录制白帧修剪。
    capture_website
    有内部修剪启发式逻辑,但未暴露给调用者。表现为页面仍在加载时,讲解视频开头出现短暂白闪。在
    at_s: 0.0
    添加800ms的
    wait
    操作可部分缓解此问题,给页面绘制时间,但无法修剪已录制的白帧。需工作器增强。
  • 每个节拍导航无
    networkidle
    等待。
    Tarball文件在每次导航后使用
    page.goto(url, wait_until="networkidle", timeout=20000)
    +
    wait_for_timeout(600)
    capture_website
    会等待到
    domcontentloaded
    加上bbox捕获分支的600ms操作后等待(服务器端,当设置
    bbox_selector
    时),但
    domcontentloaded
    后渲染的SPA blob页面仍可能针对未挂载的代码块进行bbox捕获。需工作器增强:在
    timed_actions[].navigate
    上暴露
    wait_until
    参数。
  • 无每步输出大小验证关卡。 Tarball文件的
    verify()
    助手(
    github_explainer.py:35-39
    )在每步后检查TTS≥50KB、预览≥100KB、屏幕录制≥200KB、唇形同步≥500KB、最终视频≥1MB。MCP路径仅返回URL;验证文件大小需每步额外调用
    mcp__pika__analyze_media
    (每步约30秒开销)。当用户端延迟预算允许时值得添加。目前,下游故障级联(例如零字节TTS→静音唇形同步→空白合成)仅在步骤11才会显现。
  • 未实现
    text_content
    bbox捕获。
    capture_website
    v1仅为带有CSS
    selector
    的步骤返回
    action_bboxes
    。仅含
    text_content
    的步骤无条目。为保证缩放覆盖,
    zoom_target
    中优先使用CSS选择器。
  • 节拍表措辞非确定性。 相同输入运行两次会产生不同的vo_text和不同的缩放位置。视觉类型是约定,而非像素级精确复制。
  • 通用URL模式质量因站点而异。 具有语义标记(
    <h1>
    +清晰
    <section>
    +命名类钩子)的现代独立/SaaS落地页效果良好。大型企业站点(apple.com、microsoft.com、amazon.com)存在多个已知限制:(a) 机器人检测——页面可能在无头Chrome下提供降级版本或验证码;步骤2.6§A会在这些情况中止,但启发式逻辑并非 exhaustive;(b) 混淆类名——
    tile-headline
    而非
    hero-title
    会使通用选择器失效;步骤2.6§C的WebFetch DOM扫描有帮助但并非完美;(c) 滚动触发动画不播放——IntersectionObserver驱动的hero揭示在真实用户滚动时触发,而非Playwright的
    scrollIntoView
    ;录制帧可能是静态占位符;(d) 懒加载图片——带有
    loading="lazy"
    的picture/source元素可能在600ms或2500ms等待窗口内未解析;bbox会落在透明占位符上。解决方法:选择更简单/更小的营销页面进行发布演示,始终传递
    --focus "X功能"
    以锚定节拍选择,接受大型站点需要后续服务器PR(cookie弹窗点击重试 +
    wait_until=networkidle
    + IntersectionObserver polyfill触发动画)。
  • Cookie弹窗点击仅尝试一次。 步骤2.6§B输出一个针对WebFetch DOM中提取的关闭选择器的
    click
    操作。如果WebFetch的HTML不包含弹窗(JS后渲染)或选择器错误,点击会静默失败——
    extra_css
    payload是核心防御措施。需工作器增强:支持每个
    click
    操作的 fallback 选择器列表,工作器会按顺序尝试。
  • 步骤8b与步骤6在通用URL模式下的
    idx
    映射不匹配。
    步骤6计算
    beat_idx = entry.idx - prepend_count
    ,但步骤8b天真地使用
    beat_idx = entry.idx
    。这是与缩放无关的预存bug。

Auth

认证

If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for
@pika.art
Google login). For non-interactive environments, set
MCP_AUTH_TOKEN
.
如果任何调用返回401:用户的OAuth令牌已过期或未颁发。下一次认证MCP调用会自动触发OAuth(浏览器打开
@pika.art
谷歌登录)。对于非交互式环境,设置
MCP_AUTH_TOKEN

Examples

示例

GitHub-mode (repo-aware: README scan + live-demo detection):
  • /pika:explainer https://github.com/leigest519/OpenGame
  • /pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"
  • /pika:explainer https://github.com/openai/whisper --preview
    (opt-in to the preview gate when testing a new avatar)
Generic-URL mode (any non-GitHub URL — drives through the page directly):
  • /pika:explainer https://pika.art
  • /pika:explainer https://vercel.com --focus "the deployment workflow"
  • /pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins
  • /pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview
GitHub模式(仓库感知:扫描README + 检测实时演示):
  • /pika:explainer https://github.com/leigest519/OpenGame
  • /pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"
  • /pika:explainer https://github.com/openai/whisper --preview
    (测试新头像时主动开启预览环节)
通用URL模式(任意非GitHub URL——直接导览页面):
  • /pika:explainer https://pika.art
  • /pika:explainer https://vercel.com --focus "the deployment workflow"
  • /pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins
  • /pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview