explainer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/pika:explainer
/pika:explainer
Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.
Usage:
/pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]生成约60-80秒的URL讲解视频:按照节拍表时间线驱动真实浏览器访问目标URL,生成旁白的头像唇形同步效果,并将所有内容合成到1280×800的macOS Sonoma框架中。框架内画布坐标(20, 476)处有一个240像素的内部头像(含3px白色描边环的外部尺寸为246像素),且每个中段节拍都会针对元素进行缩放。支持任意URL——产品页面、文档站点、博客文章、发布页面均可。GitHub URL会触发仓库感知模式(扫描README + 检测实时演示);所有其他URL则使用通用页面导览流程。
使用方式:
/pika:explainer <url> [--focus "重点方向"] [--avatar <头像URL>] [--voice <语音ID>] [--lipsync-provider pika|kling] [--preview] [--live-url <实时演示URL>]Behavior
行为逻辑
Defaults — fire fast, no mid-flow confirmation
默认设置——快速执行,无流程中途确认
- Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (,
--avatar) when supplied; otherwise resolve via--voice/identity_avatar_urland proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).identity_voice_id - No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
--preview - Do not solicit either. Make a confident first attempt from page structure; users re-run with
--focusif the angle missed.--focus "X"
These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.
- 静默使用身份存储中的头像/语音默认值。执行前绝不询问“是否使用你的头像?”或“选择哪种语音?”。如果用户提供了显式覆盖参数(、
--avatar),则优先使用;否则通过--voice/identity_avatar_url获取默认值并继续执行。完整的优先级流程请参见步骤1(包括身份信息为空时的静默 fallback 逻辑)。identity_voice_id - 默认无流程中途“输入yes继续”的确认环节。步骤5的预览功能需通过主动开启(供高级用户在耗时较长的渲染前测试新的头像/语音组合);默认流程会从头到尾连续执行,无需暂停。
--preview - 也不主动请求参数。根据页面结构自主生成首次导览内容;如果用户觉得重点方向不符,可重新执行并添加
--focus参数。--focus "X"
这些默认设置符合媒体生成工具的行业标准(如Midjourney / Sora / Runway / HeyGen / Pika.art):提交请求→渲染→返回结果。账户余额 + 服务商故障切换(步骤9)是核心的保障机制。
Local avatar images on Claude Desktop
Claude Desktop本地头像图片支持
Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as , pause Step 1 and kindly send them this — something like:
--avatarHeads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:
- Paste a URL if it's already hosted (Imgur, S3, your site) — fastest
- Attach the image file so I can upload it before generation.
When a local file arrives, convert it to a public URL with and use the returned as before Step 1. Already-hosted URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.
upload_assetpublic_url--avatar <url>https://...Claude Desktop目前无法将粘贴的内联图片传递给MCP工具(Anthropic侧限制)。如果用户粘贴了内联照片,或提及想要将本地文件用作,则暂停步骤1并友好告知用户:
--avatar注意——Claude Desktop上粘贴的图片无法传递给MCP工具(Anthropic限制)。你可以通过以下两种简单方式设置头像:
- 粘贴URL:如果图片已托管在Imgur、S3或你的站点上,这是最快的方式
- 附加图片文件:我会先上传该图片再进行生成。
当收到本地文件后,使用将其转换为公共URL,然后将返回的作为参数,再继续执行步骤1。已托管的URL可直接使用,无需此步骤。如果未提供任何头像,则使用身份存储中的默认值。
upload_assetpublic_url--avatar <url>https://...Step 0 — Resolve URL (empty-args menu)
步骤0——解析URL(空参数菜单)
Strip flags (, , , , , , , , ) and parameters from . If what remains contains no URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message. Calling a tool here risks recording or explaining the wrong page. If already carries a URL, skip this step silently and proceed to Step 1.
--focus--avatar--voice--live-url--lipsync-provider--no-captions--preview--skip-preview--yeskey=value$ARGUMENTShttps://...$ARGUMENTSWhich URL would you like me to walk through? Works on any of:
- A GitHub repo — e.g.
(activates repo-aware mode: README scan + live-demo detection)https://github.com/anthropics/claude-code- A product page / launch page — e.g.
https://pika.art- A docs site — e.g.
https://docs.anthropic.com- A blog post / article URL
Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — passif you want a 3-second lipsync sanity check first.--previewReply with the URL and I'll start.Tip: you don't need to type— just say things like "walk me through <url>", "make a demo video of <url>", or "explain this repo: <github-url>" and I'll fire this skill automatically./pika:explainer
When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.
从中剥离标志(、、、、、、、、)和参数。如果剩余内容不包含格式的URL(或为空/仅含空白字符),则原样输出以下菜单作为完整响应,然后停止并等待用户的下一条消息。此时调用工具可能会录制或讲解错误页面。如果中已包含URL,则静默跳过此步骤并进入步骤1。
$ARGUMENTS--focus--avatar--voice--live-url--lipsync-provider--no-captions--preview--skip-preview--yeskey=valuehttps://...$ARGUMENTS**你想让我导览哪个URL?**支持以下任意类型:
- GitHub仓库——例如
(触发仓库感知模式:扫描README + 检测实时演示)https://github.com/anthropics/claude-code- 产品页面/发布页面——例如
https://pika.art- 文档站点——例如
https://docs.anthropic.com- 博客文章/文章URL
输出结果:1280×800的macOS Sonoma框架,左下角带有头像唇形同步效果,每个中段节拍都会针对元素进行缩放。默认流程从头到尾连续执行,无确认环节——若想先进行3秒唇形同步 sanity check,可添加参数。--preview回复URL即可开始。提示:无需输入——直接说“导览<url>”“制作<url>的演示视频”或“讲解这个仓库:<github-url>”,我会自动触发该技能。/pika:explainer
当用户回复URL后,将其视为已解析的输入并进入步骤1,无需再次提示。
Step 1 — Parse input + detect mode
步骤1——解析输入 + 检测模式
Required: (must be ).
Optional: (overrides identity-store default), , (editorial guidance woven into vo_text), (force-supply live demo URL — GitHub mode only), (defaults to — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), (skip the Step 11 caption burn — default is captions on), (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). and are accepted as no-ops for backward compatibility.
urlhttps://...--avatar <url>--voice <minimax-voice-id>--focus "..."--live-url <url>--lipsync-provider <pika|kling>pikakling--no-captions--preview--skip-preview--yesMode detection:
- GitHub mode — URL host is AND path matches
github.com(no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors./{owner}/{repo} - Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like ). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.
/blob/HEAD/path
Avatar resolution (silent — never ask the user):
- If was passed, use it.
--avatar <url> - Else call . If non-null, use it.
mcp__pika__identity_avatar_url - Else (fresh user, no identity avatar set yet) call once with prompt
mcp__pika__generate_imageand use the returned URL. Do not ask the user "should I generate one?" — just generate silently."professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"
Voice resolution (silent — never ask the user):
-
Ifwas passed, use it.
--voice <id> -
Else call. If non-null, use it.
mcp__pika__identity_voice_id -
Else pick a casual MiniMaxpreset matching the resolved avatar's apparent gender:
speech-2.8-hd- Female-coded avatar → (warm, casual, clearly female-voiced — verified)
English_PlayfulGirl - Male-coded avatar → (warm, casual male)
English_Jovialman - Unclear / gender-neutral → (default)
English_Jovialman
Determine gender from(look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not callmcp__pika__identity_persona_readfor this — it's not worth the extra ~30s round-trip. Do not ask the user.analyze_mediaDo NOT use— despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback.English_FriendlyPersonis the canonical casual-female pick. Other verified-female alternates:English_PlayfulGirl,English_Upbeat_Woman,English_LovelyGirl.English_radiant_girl - Female-coded avatar →
The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.
必填参数:(必须为格式)。
可选参数:(覆盖身份存储默认值)、、(将编辑指导融入旁白文本)、(强制提供实时演示URL——仅GitHub模式可用)、(默认值为****——parrot a2v,耗时约2-5分钟,头部动作稍显生动。若需更紧凑的面部居中输出,可选择,耗时约5-30分钟——Kling生成的演示者头部动作极小,但耗时较长;仅用于高优先级渲染)、(跳过步骤11的字幕添加——默认开启字幕)、(主动开启步骤5的预览环节——生成“嗨,我是你的演示者”的3秒唇形同步效果,用于在耗时较长的渲染前测试新的头像/语音组合;默认不开启预览)。和作为向后兼容的无操作参数被接受。
urlhttps://...--avatar <url>--voice <minimax-voice-id>--focus "..."--live-url <url>--lipsync-provider <pika|kling>pikakling--no-captions--preview--skip-preview--yes模式检测:
- GitHub模式——URL主机为且路径匹配
github.com(仓库根目录后无其他路径段)。触发仓库感知额外功能:扫描README、检测实时演示、使用GitHub特定选择器。/{owner}/{repo} - 通用URL模式——其他所有情况(产品页面、文档站点、博客文章、更深层的GitHub路径如)。跳过GitHub专属功能;使用通用CSS选择器并直接导览URL本身。
/blob/HEAD/path
头像解析(静默执行——绝不询问用户):
- 如果传递了,则使用该URL。
--avatar <url> - 否则调用。若返回非空值,则使用该值。
mcp__pika__identity_avatar_url - 否则(新用户,未设置身份头像)调用一次,提示词为
mcp__pika__generate_image,并使用返回的URL。绝不询问用户“是否生成头像?”——直接静默生成。"professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"
语音解析(静默执行——绝不询问用户):
-
如果传递了,则使用该ID。
--voice <id> -
否则调用。若返回非空值,则使用该值。
mcp__pika__identity_voice_id -
否则根据解析出的头像外观性别选择合适的MiniMax预设:
speech-2.8-hd- 女性化头像 → (温暖、随意、清晰的女性语音——已验证)
English_PlayfulGirl - 男性化头像 → (温暖、随意的男性语音)
English_Jovialman - 性别不明/中性 → (默认值)
English_Jovialman
若存在身份信息,则从中获取性别/代词字段;否则从解析出的头像图片推断。绝不调用mcp__pika__identity_persona_read——额外的30秒往返耗时不值得。绝不询问用户。analyze_media请勿使用——尽管MiniMax分类中将其归为“女性”,但其显示名称为“Friendly Guy”,播放时听起来是男性。English_FriendlyPerson是标准的随意女性语音选择。其他已验证的女性语音备选:English_PlayfulGirl、English_Upbeat_Woman、English_LovelyGirl。English_radiant_girl - 女性化头像 →
以下流程按步骤标注:仅GitHub模式、仅通用模式或两种模式通用。
Step 2 — Read source (no MCP call)
步骤2——读取源内容(无需调用MCP)
Both modes: use Claude's on the input URL to pull the page's main content (h1, hero section, headings, primary copy).
WebFetchGitHub mode additions: also fetch top-level file tree, (best-effort) / , and GitHub API repo metadata via for , , , . Detect a candidate in this priority:
package.jsonpyproject.tomlgh api repos/{owner}/{repo}homepagedescriptionlanguagetopicslive_url- User-supplied .
--live-url - GitHub API field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball
meta.homepage).repo_analyzer.py:66-77 package.jsonfield."homepage"- First match in README of .
https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]* - Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g. ); when the README explicitly designates a project page, prefer that over the github.io fallback.
<project>-project-page.com - GitHub Pages convention — but only if the deep tree contains a frontend signal (one of
https://{owner}.github.io/{repo},index.html,App.tsx,App.jsx,App.vue,app.py).main.py
If no candidate resolves, the beat sheet skips beats 6–7.
Generic-URL mode: the input URL itself is the only URL the beats walk through — no inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.
live_url两种模式通用: 使用Claude的获取输入URL页面的主要内容(h1、hero区域、标题、主要文本)。
WebFetchGitHub模式额外操作: 同时获取顶级文件树、(尽力获取)/,并通过获取GitHub API仓库元数据,包括、、、。按以下优先级检测候选:
package.jsonpyproject.tomlgh api repos/{owner}/{repo}homepagedescriptionlanguagetopicslive_url- 用户提供的。
--live-url - GitHub API的字段——维护者在GitHub设置中配置仓库主页时设置(与tarball文件
meta.homepage逻辑一致)。repo_analyzer.py:66-77 - 中的
package.json字段。"homepage" - README中第一个匹配的URL。
https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]* - README中徽章区域/“Live Demo”/“Project Page”/“Demo”文本指向的任意其他URL。上述允许列表正则表达式可能会遗漏自定义域名(例如);当README明确指定项目页面时,优先选择该URL而非github.io备用地址。
<project>-project-page.com - GitHub Pages惯例地址——但仅当深层文件树包含前端信号(
https://{owner}.github.io/{repo}、index.html、App.tsx、App.jsx、App.vue、app.py中的任意一个)时才使用。main.py
如果未解析出候选URL,则节拍表跳过第6-7拍。
通用URL模式: 输入URL本身就是节拍表导览的唯一URL——无需推断,无需获取额外元数据。跳过步骤2.5和步骤3.0;直接进入步骤3。
live_urlStep 2.5 — Verify live_url
reachability (GitHub mode only, no MCP call)
live_url步骤2.5——验证live_url
可达性(仅GitHub模式,无需调用MCP)
live_urlIf a candidate was selected, verify it serves real content before authoring beats 6–7. Use on the candidate and check the response:
live_urlWebFetch- If the response status is 4xx / 5xx, drop to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
live_url - If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains ), drop
"There isn't a GitHub Pages site here"and skip beats 6–7.live_url - Otherwise, keep for beats 6–7.
live_url
This mirrors the original tarball's reachability gate.
requests.head(live_url, timeout=6, allow_redirects=True)如果已选择候选,则在编写第6-7拍前验证其是否能提供真实内容。使用访问候选URL并检查响应:
live_urlWebFetch- 如果响应状态码为4xx/5xx,则将设为None并跳过第6-7拍。尤其是github.io备用地址,虽然主机可达,但对于未启用Pages的仓库,经常返回404错误(“There isn't a GitHub Pages site here”)——录制该404页面会浪费讲解视频约12秒的时间在错误内容上。
live_url - 如果响应渲染了GitHub Pages的“404 — There isn't a GitHub Pages site here.”模板(启发式判断:响应正文包含),则将
"There isn't a GitHub Pages site here"设为None并跳过第6-7拍。live_url - 否则保留用于第6-7拍。
live_url
这与原始tarball文件的可达性检查逻辑一致。
requests.head(live_url, timeout=6, allow_redirects=True)Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)
步骤2.6——通用URL预检查(仅通用URL模式,无需调用MCP)
Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.
A. Bot-block / captcha detection — abort if matched:
If the response body contains any of:
- /
"Verify you are human""verify you are not a robot" - /
"captcha"/"CAPTCHA""reCAPTCHA" - /
"403 Forbidden""Access Denied" - +
"Just a moment"(Cloudflare challenge)cf-chl-bypass - (Amazon-style bot block)
"We're sorry, something went wrong" - A or h1 of just "Robot Check" / "Are you a robot?"
<title>
→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."
B. Cookie / consent-banner detection — defuse with + optional click:
extra_cssScan the response for these patterns (case-insensitive):
- IDs / classes starting with ,
onetrust-,truste-,cookie-banner,cookie-consent,gdpr-,consent-cmp- - Buttons matching /
(?i)accept (all )?cookies/(?i)agree.{0,10}cookies(?i)i (accept|agree) - Apple-specific banner: id or class
ac-gdpr-banneras-globalfooter-curtain - Google consent: with text "Before you continue"
[role="dialog"]
If detected, set . Defense in depth — the recording uses BOTH:
cookie_banner_present = true- CSS injection () in the
extra_csscall to hide common banners universally — even if the click below misses, the banner is visually gone.capture_website - A
clickattimed_actionagainst the most likely dismissal selector (extracted from the WebFetch DOM, e.g.at_s: 0.0,#onetrust-accept-btn-handler,[aria-label*="Accept all" i]).button[id*="accept"]
The payload (use this verbatim — covers ~80% of consent platforms):
extra_css#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; } /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }C. Real-DOM element identification — emit concrete selectors:
Generic CSS selectors (, , ) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses / , not ). For each beat, prefer the actual DOM elements observed in the WebFetch:
h1[class*="hero"]section h2tile-headlineas-headline-section-titlehero-*- Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary text and class.
<h1> - Note the page's section structure (h2 headings + their parent containers).
- Note any prominent CTA / signup / pricing element.
- Emit using the actual class or id observed, falling back to semantic structure (
zoom_target.selector) when class names look auto-generated (Tailwindmain > section:nth-of-type(N) h2, CSS modules_1a2b3c).module__hero___xYz
D. SPA / lazy-render detection — bump initial wait:
If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-. Emit a longer initial action () before any beat fires, instead of the default 600ms settle.
domcontentloadedwait{type: "wait", at_s: 0.0, ms: 2500}E. is honored when supplied (do not solicit):
--focusWithout , select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with if the first pass misses the angle they wanted. With supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.
--focus--focus "the X feature"--focus在为非GitHub URL编写节拍表前,使用访问输入URL并检查响应。此步骤可避免三种常见的通用URL失败模式:(a) 录制验证码/机器人拦截页面而非实际内容;(b) Cookie/同意弹窗占据视频前约3秒;(c) 通用CSS选择器无法匹配页面实际的hero/区域。
WebFetchA. 机器人拦截/验证码检测——匹配则中止:
如果响应正文包含以下任意内容:
- /
"Verify you are human""verify you are not a robot" - /
"captcha"/"CAPTCHA""reCAPTCHA" - /
"403 Forbidden""Access Denied" - +
"Just a moment"(Cloudflare挑战)cf-chl-bypass - (亚马逊风格机器人拦截)
"We're sorry, something went wrong" - 仅包含“Robot Check”/“Are you a robot?”的或h1
<title>
→ 中止并向用户返回清晰错误:“通用URL模式无法渲染该站点——页面在无头Chrome下显示机器人检测/验证码挑战。请尝试其他URL,或先通过真实用户访问验证页面能否正常加载。”
B. Cookie/同意弹窗检测——通过+可选点击消除:
extra_css扫描响应中的以下模式(不区分大小写):
- 以、
onetrust-、truste-、cookie-banner、cookie-consent、gdpr-、consent-开头的ID/类cmp- - 匹配/
(?i)accept (all )?cookies/(?i)agree.{0,10}cookies的按钮(?i)i (accept|agree) - Apple特定弹窗:ID为或类为
ac-gdpr-banneras-globalfooter-curtain - Google同意弹窗:且包含文本“Before you continue”
[role="dialog"]
如果检测到,则设置。采用双重防御机制——录制时同时使用:
cookie_banner_present = true- CSS注入():在
extra_css调用中注入CSS以全局隐藏常见弹窗——即使下方的点击操作未命中,弹窗也会在视觉上消失。capture_website - 定时点击操作(
click):在timed_action时针对最可能的关闭选择器(从WebFetch DOM中提取,例如at_s: 0.0、#onetrust-accept-btn-handler、[aria-label*="Accept all" i])执行点击。button[id*="accept"]
extra_css#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; } /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }C. 真实DOM元素识别——输出具体选择器:
通用CSS选择器(、、)在语义化/标记良好的站点上有效,但在大型企业站点(apple.com使用/而非)的混淆类名上会失效。对于每个节拍,优先选择WebFetch中观察到的实际DOM元素:
h1[class*="hero"]section h2tile-headlineas-headline-section-titlehero-*- 读取WebFetch返回的渲染后HTML/Markdown。记录页面实际的主文本和类。
<h1> - 记录页面的区域结构(h2标题及其父容器)。
- 记录任何突出的CTA/注册/定价元素。
- 使用观察到的实际类或ID输出,当类名看起来是自动生成的(Tailwind
zoom_target.selector、CSS modules_1a2b3c)时,回退到语义结构(module__hero___xYz)。main > section:nth-of-type(N) h2
D. SPA/懒加载检测——延长初始等待时间:
如果WebFetch返回的内容中可见标题少于3个/文本内容极少,则页面可能是后渲染的SPA。在第一个节拍触发前输出更长的初始操作(),而非默认的600ms等待时间。
domcontentloadedwait{type: "wait", at_s: 0.0, ms: 2500}E. 提供参数时予以尊重(不主动请求):
--focus如果未提供,则根据通用结构线索选择节拍——静默执行自信的首次尝试。绝不询问用户“你想重点关注什么?”;如果首次尝试未命中用户想要的方向,用户可重新执行并添加参数。如果提供了,则将节拍选择锚定在该短语上:使用匹配该短语的具体页面区域,忽略无关的营销内容。
--focus--focus "X功能"--focusStep 3.0 — Required README section scan (GitHub mode only, no MCP call)
步骤3.0——必填README章节扫描(仅GitHub模式,无需调用MCP)
Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:
| README contains... | Required beat |
|---|---|
| scroll_to that heading; zoom |
| scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading |
| scroll_to that section; zoom |
| scroll_to that section; zoom |
| scroll_to that heading; zoom |
| scroll_to that heading; zoom |
| scroll_to that heading; zoom |
GitHub heading slug rule: lowercase, spaces → dashes, strip non- characters. So "How it works" → , "Quick Start" → . GitHub injects the anchor inside each rendered , so reliably grabs the heading element across any GitHub README.
[a-z0-9-]#user-content-how-it-works#user-content-quick-start<a id="user-content-{slug}"><hN>hN:has(#user-content-{slug})Selector contract: needs to be vanilla CSS that resolves via ( runs the post-action smooth-scroll JS via , which uses the browser's native selector engine). Avoid Playwright extensions like , , or : those resolve in Playwright's (so the bbox capture finds the element) but silently fail in the smooth-scroll's (so the page never scrolls to the target, and ends up at document-Y instead of , which trips Step 8b's degenerate filter and falls back to default-position zoom). CSS Level 4 is vanilla and supported in modern Chromium.
bbox_selectordocument.querySelectorcapture_websitepage.evaluate:has-text("...")text=...:visiblepage.query_selectordocument.querySelectorbbox.ytop - 60 pxbbox.y > recording_viewport.h:has(...)These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.
SPECIAL在编写节拍表前,扫描README全文(不区分大小写)查找以下章节名称。如果找到匹配项,则必须在步骤3中为该章节添加专门的节拍,必要时替换通用节拍4-5中的一个:
| README包含... | 必填节拍 |
|---|---|
| 滚动到该标题;缩放 |
| 滚动到音频层图表;缩放渲染后的图或其周围的标题 |
| 滚动到该章节;缩放 |
| 滚动到该章节;缩放 |
| 滚动到该标题;缩放 |
| 滚动到该标题;缩放 |
| 滚动到该标题;缩放 |
GitHub标题slug规则: 小写,空格替换为连字符,去除非字符。例如“How it works”→,“Quick Start”→。GitHub会在每个渲染后的内注入锚点,因此可可靠地获取任意GitHub README中的标题元素。
[a-z0-9-]#user-content-how-it-works#user-content-quick-start<hN><a id="user-content-{slug}">hN:has(#user-content-{slug})选择器约定: 需要是可通过解析的原生CSS(通过执行操作后的平滑滚动JS,使用浏览器原生选择器引擎)。避免使用Playwright扩展如、或:这些在Playwright的中可解析(因此bbox捕获能找到元素),但在平滑滚动的中会静默失败(因此页面不会滚动到目标位置,最终会是文档Y值而非,这会触发步骤8b的退化过滤器并回退到默认位置缩放)。CSS Level 4的是原生语法,在现代Chromium中受支持。
bbox_selectordocument.querySelectorcapture_websitepage.evaluate:has-text("...")text=...:visiblepage.query_selectordocument.querySelectorbbox.ytop - 60 pxbbox.y > recording_viewport.h:has(...)这些章节是大多数值得讲解的仓库中信息密度最高的视觉内容。遗漏这些章节会生成通用的导览内容;包含这些章节则会让讲解视频具有具体的“展示而非讲述”节拍。原始tarball文件的SKILL.md在Gemini提示中用规则标记了前四个章节;本步骤3.0将它们从 incidental 指导提升为硬性要求,并添加了OSS README中常见的另外三个高信号标题。
SPECIALStep 3 — Author beat sheet (main thread, no MCP call)
步骤3——编写节拍表(主线程,无需调用MCP)
Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:
jsonc
{
"t_start": 0.0,
"t_end": 7.5,
"action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
"zoom_target": { "selector": "...", "description": "..." },
"vo_text": "exact words to speak — 1 to 2 conversational sentences"
}Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):
- Every beat needs all five fields: ,
t_start,t_end(withactionandtype),url(withzoom_target),selector. Missing fields ⇒ reject and re-author. (Mirrors tarball'svo_textvalidation pass.)github_explainer.py:183-190 - of beat 0 = 0.0;
t_start(continuity).t_end[i] == t_start[i+1] - ≈
len(vo_text.split()) / 2.5per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten thet_end - t_startuntil it fits.vo_text - Total of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
t_end - Total spoken word count between 165 and 200 words.
- Every beat's needs to be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors:
zoom_target.selector,h1.f1,#readme,article h2,.blob-code-inner,.highlight,.octicon-star. Generic-URL mode prefers robust generic selectors:nav,h1,[role="main"],main,header,nav,.hero,.feature,section h2,[class*="cta"],[class*="hero"],button. Selectors need to resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.a[href] - is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
vo_text - is a valid
action.urlURL whenhttps://...; required.action.type == "navigate"
Self-check before Step 4: verify is in AND (= ) is in . If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)
total_words[165, 200]total_secondsbeats[-1].t_end[65, 80]Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):
- Beat 1: repo root, zoom
navigate(repo title), hook sentence.h1.f1 - Beats 2–3: to specific source files (
navigate), zoomhttps://github.com/{owner}/{repo}/blob/HEAD/<path>or.blob-code-inner. Pick files that match the narration's claim — don't navigate to a file you won't talk about..highlight - Beats 4–5: README sections, zoom
scroll_toorarticle h2. If Step 3.0 surfaced required sections, replace these slots with the required ones.#readme - Beats 6–7 (only if survived Step 2.5):
live_urltonavigate, zoomlive_url/nav/h1/.hero/main/button..feature - Beat 8: back to repo root, zoom , outro.
.octicon-star
Structural skeleton — Generic-URL mode:
- Beat 1: to the input URL, zoom
navigateorh1(the page's primary headline), hook sentence.[class*="hero"] h1 - Beats 2–3: the page's hero / value-prop / first feature section. Zoom
scroll_to,.hero,[class*="hero"], or[class*="feature"]. Pick visible elements the narration references.section:nth-of-type(1) h2 - Beats 4–5: deeper sections — feature lists, screenshots, pricing, social proof. Zoom
scroll_to,section h2,[class*="feature"] img,[class*="testimonial"], or any prominent semantic element on the page.[class*="pricing"] - Beats 6–7: CTA / signup / demo embed. Zoom
scroll_to,[class*="cta"],button, ora[class*="button"]. (No live-demo navigation in generic mode — the input URL IS the demo.)[id*="signup"] - Beat 8: footer / closing element, zoom
scroll_to,footer h2, or back to top withfooter. Outro sentence.h1
If is supplied, weave its angles into without mutating the structural skeleton. Prefer CSS selectors over in — bbox capture is selector-only (see Known gaps).
--focusvo_texttext_contentzoom_target.selector编写包含8-10个节拍的JSON数组,总时长严格控制在65-80秒,总单词数严格控制在165-200词(假设语速为2.5词/秒)。每个节拍格式如下:
jsonc
{
"t_start": 0.0,
"t_end": 7.5,
"action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
"zoom_target": { "selector": "...", "description": "..." },
"vo_text": "要朗读的精确内容——1-2句口语化句子"
}硬性约束(输出节拍表前验证——若违反则拒绝草稿并重写):
- 每个节拍必须包含所有五个字段:、
t_start、t_end(含action和type)、url(含zoom_target)、selector。字段缺失→拒绝并重写。(与tarball文件vo_text的验证逻辑一致。)github_explainer.py:183-190 - 节拍0的=0.0;
t_start(连续性)。t_end[i] == t_start[i+1] - ≈
len(vo_text.split()) / 2.5(每个节拍)。目标为±10%的误差;如果草稿密度超过2.5词/秒,则精简t_end - t_start直到符合要求。vo_text - 最后一个节拍的≤80秒。(参考输出包括 intro 为86.5秒;唇形同步音频约83秒。在当前负载下,Kling头像/image2video在音频时长超过约90秒时会可靠地停滞——超过80秒会导致Kling超时20分钟。)
t_end - 总朗读单词数在165-200词之间。
- 每个节拍的必须是该节拍所在页面的有效CSS选择器。GitHub模式优先使用GitHub特定选择器:
zoom_target.selector、h1.f1、#readme、article h2、.blob-code-inner、.highlight、.octicon-star。通用URL模式优先使用健壮的通用选择器:nav、h1、[role="main"]、main、header、nav、.hero、.feature、section h2、[class*="cta"]、[class*="hero"]、button。选择器必须在节拍操作完成后的渲染页面上可解析——输出前需通过WebFetch查看的DOM进行验证。a[href] - 为1-2句口语化句子。使用开发者语气。无舞台提示。无Markdown格式。
vo_text - 当时,
action.type == "navigate"必须是有效的action.urlURL;必填。https://...
步骤4前的自检: 验证在范围内且(= )在范围内。如果任一条件不满足,则重写节拍表——不要进入TTS步骤。(无需“打印”任何内容——这是内部草稿验证;只需拒绝草稿并重写直到通过。)
total_words[165, 200]total_secondsbeats[-1].t_end[65, 80]结构框架——GitHub模式(视觉约定的核心——与原始逻辑一致,但步骤3.0适用时覆盖):
- 节拍1: 到仓库根目录,缩放
navigate(仓库标题),开场句子。h1.f1 - 节拍2-3: 到特定源文件(
navigate),缩放https://github.com/{owner}/{repo}/blob/HEAD/<path>或.blob-code-inner。选择与旁白内容匹配的文件——不要导航到未提及的文件。.highlight - 节拍4-5: 到README章节,缩放
scroll_to或article h2。如果步骤3.0发现必填章节,则用必填章节替换这些位置。#readme - 节拍6-7(仅当通过步骤2.5验证时):
live_url到navigate,缩放live_url/nav/h1/.hero/main/button。.feature - 节拍8: 返回仓库根目录,缩放,结尾句子。
.octicon-star
结构框架——通用URL模式:
- 节拍1: 到输入URL,缩放
navigate或h1(页面主标题),开场句子。[class*="hero"] h1 - 节拍2-3: 到页面的hero/价值主张/第一个功能区域。缩放
scroll_to、.hero、[class*="hero"]或[class*="feature"]。选择旁白提及的可见元素。section:nth-of-type(1) h2 - 节拍4-5: 到更深层区域——功能列表、截图、定价、社交证明。缩放
scroll_to、section h2、[class*="feature"] img、[class*="testimonial"]或页面上任何突出的语义元素。[class*="pricing"] - 节拍6-7: 到CTA/注册/演示嵌入区域。缩放
scroll_to、[class*="cta"]、button或a[class*="button"]。(通用模式无实时演示导航——输入URL即为演示页面。)[id*="signup"] - 节拍8: 到页脚/结尾元素,缩放
scroll_to、footer h2或返回顶部的footer。结尾句子。h1
如果提供了,则将其方向融入,但不改变结构框架。在中优先使用CSS选择器而非——bbox捕获仅支持选择器(参见已知缺陷)。
--focusvo_textzoom_target.selectortext_contentStep 4 — TTS
步骤4——文本转语音(TTS)
Call with , , optional . Capture (the dispatcher returns audio under , not ) and . Voice defaults to identity-store injection in plugin mode.
mcp__pika__generate_speechprovider: "minimax-tts"text: <full vo_text join>voice_idresult.audio_urlaudio_urlurlresult.duration_secondsStale-voice fallback detection (AGNT-231): the dispatcher retries once with the default voice on Minimax (voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough): (the planted-but-stale id the worker tried first) and . If you see in the response, surface a one-line note to the user along the lines of: "your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.
Calm_Womanstatus_code:2054voice_id_requestedfallback_reason: "invalid_minimax_voice_id"fallback_reason == "invalid_minimax_voice_id"clone_voiceCookie-banner audio padding (Generic-URL mode with from Step 2.6 §B): prepend MiniMax's pause marker to the argument before calling . MiniMax's honors as N-second silence; the returned and include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.
cookie_banner_present == true<#1.5#>text:generate_speechspeech-2.8-hd<#N#>audio_urlduration_secondsFallback (only if smoke-test shows the marker is ignored on this voice): call normally, then to overlay the result onto a 1.5s silent base at offset 1.5s. Then call to probe the padded duration and rebind before Step 4.5 consumes it. is the single authoritative duration probe — do not rely on 's return payload (its duration field is not contractually guaranteed).
generate_speechmcp__pika__edit_audio_mixmcp__pika__analyze_media(url=<padded_audio_url>)duration_seconds = result.duration_secondsanalyze_mediaedit_audio_mix调用,参数为、、可选。捕获(调度器返回的音频地址为,而非)和。插件模式下语音默认使用身份存储注入的值。
mcp__pika__generate_speechprovider: "minimax-tts"text: <拼接后的完整vo_text>voice_idresult.audio_urlaudio_urlurlresult.duration_seconds过期语音 fallback 检测(AGNT-231): 当Minimax返回(语音ID不存在——通常是代理工作区指针,Minimax在7天未使用后自动删除)时,调度器会自动重试一次,使用默认的语音。重试成功后,响应会包含文档 schema 之外的两个额外字段(透传):(工作器首次尝试的已过期语音ID)和。如果响应中,则向用户显示一行提示:“你注册的语音在Minimax上已过期(7天未使用后自动清理);我们使用了系统默认语音。若需恢复个性化语音,请通过重新克隆。”渲染不会失败——只是使用默认语音——因此这是信息提示,而非重试触发条件。
status_code:2054Calm_Womanvoice_id_requestedfallback_reason: "invalid_minimax_voice_id"fallback_reason == "invalid_minimax_voice_id"clone_voiceCookie弹窗音频填充(通用URL模式且步骤2.6§B中): 在调用前,将MiniMax的暂停标记添加到参数前。MiniMax的支持作为N秒静音;返回的和会原生包含1.5秒的前置静音。这会使音频与步骤4.5中应用的屏幕录制cookie关闭+1.5秒偏移对齐。
cookie_banner_present == truegenerate_speech<#1.5#>text:speech-2.8-hd<#N#>audio_urlduration_secondsFallback方案(仅当冒烟测试显示该标记被当前语音忽略时使用):正常调用,然后调用将结果叠加到1.5秒的静音基础上,偏移量为1.5秒。然后调用探测填充后的时长,并重新绑定,供步骤4.5使用。是唯一权威的时长探测工具——不要依赖的返回 payload(其时长字段无契约保证)。
generate_speechmcp__pika__edit_audio_mixmcp__pika__analyze_media(url=<填充后的音频URL>)duration_seconds = result.duration_secondsanalyze_mediaedit_audio_mixStep 4.5 — Audio length verification + beat-sheet rescale
步骤4.5——音频长度验证 + 节拍表缩放
Applied to post Step 4 (which includes any cookie lead-in pad). End state: / are absolute wall-clock seconds matching the audio playback timeline. All mutations happen here; Steps 6 and 8 are read-only consumers.
audio_duration_secondsbeats[].t_startt_endbeats[]Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds):
If , abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.
audio_duration_seconds > 90Gate 2 — Degenerate TTS (spoken-content length):
Compute . If , retry Step 4 once (and recompute from the retry's audio). If the retry also returns , abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)narration_duration < 30snarration_durationnarration_duration < 30sGate 3 — Rescale:
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)scale = narration_duration / beats[-1].t_end- If or
scale < 0.5, abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it.scale > 1.5 - For each beat:
beat.t_start *= scale; beat.t_end *= scale - If : for each beat,
cookie_banner_presentbeat.t_start += 1.5; beat.t_end += 1.5 - Final clamp: (exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.
beats[-1].t_end = audio_duration_seconds
After Gate 3 passes, emit a one-line operator log to surface the scale value for post-run diagnosis:
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)Advisory (not a gate): scale near 1.0 is ideal. means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync. means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.
scale > 1.2scale < 0.85应用于步骤4后的(包含任何cookie前置填充)。最终状态:/为与音频播放时间线匹配的绝对挂钟秒数。所有修改都在此步骤进行;步骤6和8为只读消费。
audio_duration_secondsbeats[].t_startt_endbeats[]关卡1——Kling停滞上限(服务商限制,原始audio_duration_seconds):
如果,则中止并重写节拍表,减少单词数量。Kling头像/image2video在超过约90秒时会停滞。
audio_duration_seconds > 90关卡2——退化TTS(朗读内容长度):
计算。如果,则重试步骤4一次(并从重试的音频中重新计算)。如果重试后仍<30s,则中止并排查问题——可能的失败模式:MiniMax响应截断、静音音频、vo_text拼接错误。
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)narration_duration < 30snarration_durationnarration_duration关卡3——缩放:
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)scale = narration_duration / beats[-1].t_end- 如果或
scale < 0.5,则中止并重写。TTS结构损坏(或单词数量严重偏离预算);缩放无法修复。scale > 1.5 - 对每个节拍:
beat.t_start *= scale; beat.t_end *= scale - 如果:对每个节拍,
cookie_banner_presentbeat.t_start += 1.5; beat.t_end += 1.5 - 最终钳位: (精确值)。保证无论cookie模式如何或累积浮点漂移,不变量的浮点相等性。
beats[-1].t_end = audio_duration_seconds
关卡3通过后,输出一行操作日志以显示缩放值,供运行后诊断:
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)建议(非强制关卡): scale接近1.0为理想状态。意味着音频比预测慢很多——视觉效果会“拉伸”但保持同步。意味着音频比预测快——视觉效果会“仓促”但保持同步。两种情况都通过关卡;如果用户反馈“节奏不对”而非“不同步”,则重写节拍表调整单词数量。
scale > 1.2scale < 0.85Step 5 — Preview gate (opt-in via --preview
)
--preview步骤5——预览环节(通过--preview
主动开启)
--previewSkip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed — do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.
--preview--skip-preview--yesIf was supplied:
--preview-
with
mcp__pika__generate_speech→text: "Hi, I'm your presenter. Let's explore this repo together.".preview_audio_url -
with
mcp__pika__generate_lipsync(defaults toprovider: <resolved_lipsync_provider>; honorpikaif supplied),--lipsync-provider kling,image: <avatar>→audio: preview_audio_url(bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.preview_lipsync_url -
Present to the user verbatim:Preview ready:This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Reply
<preview_lipsync_url>to proceed, or anything else to cancel.yes -
Match(case-insensitive). Anything else → STOP, no further MCP calls.
^(yes|go|proceed|confirm|y)$
默认完全跳过步骤5。直接进入步骤6,除非用户显式传递——不生成预览,不请求确认。这符合媒体生成工具的行业标准(如Midjourney / Sora / Runway / HeyGen / Pika.art):提交→渲染→返回;账户余额 + 服务商故障切换是核心保障机制。
--preview--skip-preview--yes如果提供了:
--preview-
调用,参数
mcp__pika__generate_speech→ 获取text: "Hi, I'm your presenter. Let's explore this repo together."。preview_audio_url -
调用,参数
mcp__pika__generate_lipsync(默认provider: <解析后的lipsync_provider>;如果提供了pika则优先使用)、--lipsync-provider kling、image: <avatar>→ 获取audio: preview_audio_url(纯唇形同步,约3秒)。此处使用与步骤9完整音频相同的服务商——预览的作用是在耗时较长的渲染前确认头像+语音+服务商组合。preview_lipsync_url -
向用户原样显示:预览已准备好:这确认了头像+语音组合。完整渲染耗时较长(Kling唇形同步约5-30分钟)。 回复
<preview_lipsync_url>继续,回复其他内容则取消。yes -
匹配(不区分大小写)。回复其他内容→停止,不再调用MCP。
^(yes|go|proceed|confirm|y)$
Step 6 — Build timed_actions
and record
timed_actions步骤6——构建timed_actions
并录制
timed_actionsTranslate the beat sheet into . One per beat — set to the beat's and captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to- → 1300 ms post-anim → measure, all server-side).
capture_websitetimed_actionstimed_actionbbox_selectorzoom_target.selectorcapture_websitetop - 60 pxFor each beat in order, emit one entry:
- beats:
navigate. The worker navigates, waits to absolute{type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}, scrollsat_s + 0.6 sinto view, and measures the bbox — all without the caller scheduling a follow-up step.bbox_selector - /
scroll_tobeats:hover. The action's own{type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}drives the page scroll;selectordrives the bbox measurement (it can be the same selector or different — usually the same). (bbox_selectorhas nocapture_website; scroll-into-view is the analog.)hover
Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.
t=0t=0Call :
mcp__pika__capture_websiteurl: <beat 0's action.url>- (one entry per beat)
timed_actions: <the N-element list built above> - —
duration_s: ceil(audio_duration_seconds)andbeats[].t_starthave already been rescaled to the TTS audio timeline by Step 4.5, sot_endis simply the audio length. The oldduration_sdefense against TTS overrun is no longer needed.max(...)
Generic-URL mode additions (per Step 2.6 pre-flight):
- — defensive: hides common consent platforms via
extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B>so even if the optional click misses, the banner is invisible in the recording.display: none !important; - Prepend a action
waitfor SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.{type: "wait", at_s: 0.0, ms: 2500} - If from Step 2.6 §B, also prepend a
cookie_banner_presentactionclick. The{type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}array has already been shifted bybeats[]in Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1's+1.5sreadstimed_action.at_sdirectly, which is 1.5 in cookie mode.beats[0].t_start - No cookie banner action needed if ; just the prepended wait action.
cookie_banner_present == false
Capture , , . The result returns and alongside .
video_urlrecording_viewportaction_bboxesrecording_viewport: {w, h}action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]video_urlaction_bboxes[].idxidxtimed_actions- GitHub mode: with one timed_action per beat, maps 1:1 to beat index — Step 8 uses
idxdirectly asentry.idx.beat_idx - Generic-URL mode: the prepended (and optional cookie-dismissal
wait) shift the array by 1 or 2. Computeclickwherebeat_idx = entry.idx - prepend_countis 1 (wait only) or 2 (wait + click). Skip entries whereprepend_count(those are the prepended setup actions, not beats).beat_idx < 0
The field on each entry reports (i.e. ), not the action's own .
selectorbbox_selectorzoom_target.selectorselector将节拍表转换为的。每个节拍对应一个——将设为节拍的,会在内部捕获操作后的元素bbox(传统流程:600ms等待→平滑滚动到→1300ms动画后→测量,均在服务器端完成)。
capture_websitetimed_actionstimed_actionbbox_selectorzoom_target.selectorcapture_websitetop - 60 px按顺序为每个节拍输出一个条目:
- 节拍:
navigate。工作器会导航到目标URL,等待到绝对时间{type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>},滚动at_s + 0.6 s到视图中,并测量bbox——无需调用者调度后续步骤。bbox_selector - /
scroll_to节拍:hover。操作自身的{type: "scroll", at_s: <t_start>, selector: <action.selector或zoom_target.selector>, bbox_selector: <zoom_target.selector>}驱动页面滚动;selector驱动bbox测量(可以是相同或不同的选择器——通常相同)。(bbox_selector无capture_website操作;滚动到视图中是等效操作。)hover
不要添加tarball文件中使用的八步intro滚动流程。唇形同步音频从节拍表的开始计时;添加intro会使屏幕录制向前偏移约3秒,而音频保持不变,导致音视频不同步。录制从开始,此时节拍0的URL已加载——这正是tarball文件intro滚动提供的初始状态,但不会导致不同步。
t=0capture_websitet=0调用:
mcp__pika__capture_websiteurl: <节拍0的action.url>- (每个节拍对应一个条目)
timed_actions: <上面构建的N元素列表> - ——步骤4.5已将
duration_s: ceil(audio_duration_seconds)和beats[].t_start缩放到TTS音频时间线,因此t_end即为音频长度。不再需要旧版的duration_s防御TTS超时。max(...)
通用URL模式额外操作(根据步骤2.6预检查):
- ——防御性措施:通过
extra_css: <步骤2.6§B中的cookie弹窗隐藏CSS payload>隐藏常见同意平台,即使可选点击未命中,弹窗在录制中也会不可见。display: none !important; - 添加前置操作:对于SPA/懒加载页面(步骤2.6§D),添加
wait;“正常”页面使用1500ms。这为hero图片懒加载、字体替换、滚动触发动画提供时间,确保第一个节拍触发时已准备就绪。{type: "wait", at_s: 0.0, ms: 2500} - 如果步骤2.6§B中,还需添加前置
cookie_banner_present == true操作click。步骤4.5已将{type: "click", at_s: 0.5, selector: <从WebFetch DOM中检测到的关闭选择器>}数组偏移beats[]以适应cookie关闭延迟,且TTS音频已填充1.5秒静音(步骤4);此处无需进一步偏移。节拍1的+1.5s直接读取timed_action.at_s,在cookie模式下为1.5秒。beats[0].t_start - 如果,无需cookie弹窗操作;仅添加前置wait操作。
cookie_banner_present == false
捕获、、。结果返回和以及。
video_urlrecording_viewportaction_bboxesrecording_viewport: {w, h}action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]video_urlaction_bboxes[].idxidxtimed_actions- GitHub模式: 每个节拍对应一个timed_action,与节拍索引1:1映射——步骤8直接使用
idx作为entry.idx。beat_idx - 通用URL模式: 前置的(和可选的cookie关闭
wait)会使数组偏移1或2位。计算click,其中beat_idx = entry.idx - prepend_count为1(仅wait)或2(wait + click)。跳过prepend_count的条目(这些是前置设置操作,非节拍)。beat_idx < 0
每个条目的字段报告的是(即),而非操作自身的。
selectorbbox_selectorzoom_target.selectorselectorStep 7 — Browser chrome
步骤7——浏览器框架
mcp__pika__edit_browser_framevideo_url: <Step 6 video_url>url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)- — GitHub mode:
tab_title: <30-char title>. Generic-URL mode: the page's(meta.description or repo_name or "")[:30](from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against<title>/empty.None
Returns (1280×800 Sonoma + chrome).
framed_url调用:
mcp__pika__edit_browser_framevideo_url: <步骤6的video_url>url: (如果是GitHub模式且通过步骤2.5验证则为live_url,否则为input_url,截断为65字符)- ——GitHub模式:
tab_title: <30字符标题>。通用URL模式:页面的(meta.description或repo_name或"")[:30](步骤2中WebFetch获取)或URL主机名,截断为30字符。处理<title>/空值情况。None
返回(1280×800 Sonoma框架+浏览器控件)。
framed_urlStep 8 — Build zoom_keyframes
and apply
zoom_keyframes步骤8——构建zoom_keyframes
并应用
zoom_keyframesConstants:
- — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
INTRO_BEATS = 2 - — seconds of 1.0× before each zoom-in and after each zoom-out.
HOLD_GAP = 0.6 - — beats shorter than this are skipped (no room for a meaningful zoom).
MIN_BEAT_DUR = 1.5 - (precise element-targeted zoom).
SCALE = 1.35 - (default-position fallback when no usable bbox).
FALLBACK_SCALE = 1.25 - .
FALLBACK_RAMP = 0.4
Note: / were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.
beats[].t_startt_endedit_browser_frameCONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637edit_browser_frame/main.pyCoord transform (recording px → framed px):
cx_framed = 56 + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637 / recording_viewport.h)Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:
Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.
Constants for the default position:
- (screen center of the framed canvas)
DEFAULT_CX = 56 + 1168 // 2 - (upper-third of the content area, where most GitHub UI prominence lives)
DEFAULT_CY = 108 + 637 // 3
Walk the beat sheet from index (= 2) to the end. For each beat:
INTRO_BEATS- If (1.5s), skip — too short for a meaningful zoom.
t_end - t_start < MIN_BEAT_DUR - Compute the keyframe's interior interval as . If that interval is shorter than 1.0s, skip.
[t_start + HOLD_GAP, t_end - HOLD_GAP] - Otherwise pre-fill that beat's slot in a per-beat map (call it ) with
zoom_keyframes_by_beat[beat_idx]plus the trimmed{cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}/t_start.t_end
Step 8b — Override with bbox-derived precise zoom where provided a usable measurement.
action_bboxesFor each entry in :
action_bboxes- (since Step 6 emits one timed_action per beat). If
beat_idx = entry.idx, skip.beat_idx < INTRO_BEATS - If is false, skip.
entry.found - If the beat isn't already in (was filtered out in Step 8a by
zoom_keyframes_by_beat/MIN_BEAT_DURrules), skip.1.0s - Filter degenerate bboxes: skip if (offscreen capture — page didn't scroll the element into view in time) or
bbox.y > recording_viewport.h(full-pagebbox.h > recording_viewport.h * 1.5element — yields a meaningless zoom center).<main> - Compute /
cx_framedfrom the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot withcy_framed.{cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}
Final list: sort the values of by to produce the array.
zoom_keyframes_by_beatt_startzoom_keyframesThis guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.
If , call with . Returns . Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use as .
len(zoom_keyframes) > 0mcp__pika__edit_animate_zoomvideo_url: framed_url, zoom_keyframeszoomed_urlframed_urlzoomed_url常量:
- ——按节拍表索引过滤。跳过节拍索引0和1(上述结构框架中的“节拍1”和“节拍2”)的缩放。
INTRO_BEATS = 2 - ——每次放大前和缩小后的1.0×保持时间(秒)。
HOLD_GAP = 0.6 - ——短于此时长的节拍跳过(无足够空间进行有意义的缩放)。
MIN_BEAT_DUR = 1.5 - (精确的元素目标缩放)。
SCALE = 1.35 - (无可用bbox时的默认位置 fallback)。
FALLBACK_SCALE = 1.25 - 。
FALLBACK_RAMP = 0.4
注意: 步骤4.5已将/缩放(并在cookie模式下偏移)到音频时间线。HOLD_GAP(0.6秒)、MIN_BEAT_DUR(1.5秒)和1.0秒内部间隔检查均基于这些最终值——它们是渲染视频上的真实视觉秒数。
beats[].t_startt_endedit_browser_frameCONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637edit_browser_frame/main.py坐标转换(录制像素→框架像素):
cx_framed = 56 + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637 / recording_viewport.h)构建缩放列表:每个节拍默认值 + bbox覆盖模式。旧版遵循“每个非intro节拍都进行缩放——可用bbox则基于bbox,否则使用默认位置”的规则。此处重现该逻辑:
步骤8a——为每个非intro、时长足够的节拍预填充默认位置关键帧。
默认位置常量:
- (框架画布的屏幕中心)
DEFAULT_CX = 56 + 1168 // 2 - (内容区域上三分之一,GitHub UI最突出的位置)
DEFAULT_CY = 108 + 637 // 3
从索引(=2)到末尾遍历节拍表。对于每个节拍:
INTRO_BEATS- 如果(1.5秒),则跳过——时长太短无法进行有意义的缩放。
t_end - t_start < MIN_BEAT_DUR - 计算关键帧的内部间隔为。如果该间隔短于1.0秒,则跳过。
[t_start + HOLD_GAP, t_end - HOLD_GAP] - 否则在每个节拍的映射(称为)中预填充该节拍的条目,包含
zoom_keyframes_by_beat[beat_idx]以及修剪后的{cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}/t_start。t_end
步骤8b——在提供可用测量值的情况下,用bbox派生的精确缩放覆盖默认值。
action_bboxes遍历中的每个条目:
action_bboxes- (步骤6为每个节拍输出一个timed_action)。如果
beat_idx = entry.idx,则跳过。beat_idx < INTRO_BEATS - 如果为false,则跳过。
entry.found - 如果该节拍未在中(步骤8a中被
zoom_keyframes_by_beat/MIN_BEAT_DUR规则过滤),则跳过。1.0s - 过滤退化bbox: 如果(屏幕外捕获——页面未及时将元素滚动到视图中)或
bbox.y > recording_viewport.h(全页bbox.h > recording_viewport.h * 1.5元素——缩放中心无意义),则跳过。<main> - 使用录制像素→框架像素转换公式从bbox中心计算/
cx_framed。用cy_framed覆盖该节拍的条目。{cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}
最终列表: 按对的值排序,生成数组。
t_startzoom_keyframes_by_beatzoom_keyframes这保证了每个非intro、时长足够的节拍都有缩放——bbox捕获成功则使用精确缩放,否则使用默认位置缩放。避免了“整个视频全程无缩放”的失败模式。
如果,调用,参数为。返回。否则(无符合条件的节拍——步骤3的65-80秒约束下应很少见)跳过并使用作为。
len(zoom_keyframes) > 0mcp__pika__edit_animate_zoomvideo_url: framed_url, zoom_keyframeszoomed_urlframed_urlzoomed_urlStep 9 — Lipsync the full audio
步骤9——完整音频唇形同步
mcp__pika__generate_lipsync- — default:
provider: <resolved_lipsync_provider>(parrot a2v). Honorpikaif explicitly passed.--lipsync-provider kling image: <avatar>audio: <Step 4 audio_url>- kling-only knobs (since BACK-339, 2026-05-10): when
pika-mcp-server, addprovider == "kling"andmode: "pro"for the polished-presenter feel. Both are silently ignored onprompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"(parrot has its own driver).pika
Provider tradeoffs:
| Provider | Wall-clock | Head motion | When to use |
|---|---|---|---|
| ~2–5 min | Slightly more dramatic, naturalistic | Default for most runs — fast iteration, watchable output, ~10× faster than kling |
| ~5–30 min | Minimal, face-centered, presenter-style | High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole |
Server-side-await covers the call inline; if the response shape is , poll in a tight loop (no sleep) until the status reaches a terminal state (, , or ). On , capture . On / , fall back to the other provider (kling ↔ pika) per the failover note below.
{task_id, status: "queued"}mcp__pika__task_statuscompletedfailedcancelledcompletedlipsync_urlfailedcancelledFailover:
- If fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with
pika.provider: "kling" - If stalls past the worker's 1200s ceiling (visible as repeated
klingstatus with no completion) → fall back toprocessing. Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.provider: "pika"
Why pika is the default:
- Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
- Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
- Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.
For the canonical "polished presenter" feel of the original tarball reference output, pass explicitly.
--lipsync-provider kling调用:
mcp__pika__generate_lipsync- ——默认:
provider: <解析后的lipsync_provider>(parrot a2v)。如果显式传递pika则优先使用。--lipsync-provider kling image: <avatar>audio: <步骤4的audio_url>- 仅kling可用的参数(自BACK-339,2026-05-10起):当
pika-mcp-server时,添加provider == "kling"和mode: "pro"以获得 polished 演示者效果。这两个参数在prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"模式下会被静默忽略(parrot有自己的驱动逻辑)。pika
服务商权衡:
| 服务商 | 耗时 | 头部动作 | 使用场景 |
|---|---|---|---|
| ~2–5分钟 | 稍显生动、自然 | 大多数运行的默认选择——迭代速度快,输出可观看,比kling快约10倍 |
| ~5–30分钟 | 极小、面部居中、演示者风格 | 高优先级渲染,要求头像看起来像专业演示者;可容忍较长耗时 |
服务器端会等待调用完成;如果响应格式为,则循环调用(无睡眠)直到状态达到终端状态(、或)。状态为时,捕获。状态为/时,按以下故障切换逻辑回退到另一个服务商(kling↔pika)。
{task_id, status: "queued"}mcp__pika__task_statuscompletedfailedcancelledcompletedlipsync_urlfailedcancelled故障切换:
- 如果失败(罕见——parrot a2v在典型讲解音频长度下很健壮)→ 重试一次,使用
pika。provider: "kling" - 如果在工作器1200秒上限后仍停滞(表现为重复
kling状态但未完成)→ 回退到processing。步骤4.5的音频长度关卡应在此之前捕获长音频情况,但故障切换可处理剩余风险。provider: "pika"
为什么pika是默认选择:
- 速度——典型讲解视频的总耗时从约10-15分钟降至约5-7分钟,因为唇形同步是耗时最长的环节。
- 质量足够——parrot a2v效果自然;在60-80秒的圆形头像PiP视频中,稍多的头部动作会显得更有吸引力而非分散注意力。
- Kling模式的 polished 效果在246像素的圆形中大多不可见——面部区域太小,大多数观众无法察觉头部动作极小的差异。
若要获得原始tarball参考输出的标准“polished演示者”效果,请显式传递。
--lipsync-provider klingStep 10 — PiP composite
步骤10——画中画(PiP)合成
mcp__pika__edit_pipmain_video_url: <zoomed_url>overlay_video_url: <lipsync_url>shape: "circle"- ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's
size_px: 246CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2 stroke_width_px: 3stroke_color: "white"- ←
position_px: {x: 20, y: 476}for dock clearance (matches tarball's800 − 246 − 78)H − CIRCLE_OUT − 78
Pass , not ; the fields are mutually exclusive. Returns .
size_pxsizefinal_urlMaster-duration / audio-source contract (matching tarball ): uses semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via 's semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.
github_explainer.py:418-419, 531-533, 578-582edit_pipshortest=1duration_s = ceil(audio_duration_seconds)edit_pipshortest=1audio_url调用:
mcp__pika__edit_pipmain_video_url: <zoomed_url>overlay_video_url: <lipsync_url>shape: "circle"- ← 固定像素的246px外径(240px内部头像 + 3+3px描边环);与tarball文件的
size_px: 246一致CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2 stroke_width_px: 3stroke_color: "white"- ←
position_px: {x: 20, y: 476}以避开dock(与tarball文件的800 − 246 − 78一致)H − CIRCLE_OUT − 78
传递,而非;这两个字段互斥。返回。
size_pxsizefinal_url主时长/音频源约定(与tarball文件一致):默认使用语义,即合成视频的时长为(缩放后的屏幕录制)和(唇形同步视频)中较短的一个。步骤6的确保屏幕录制长度与唇形同步完全匹配(步骤4.5已将节拍缩放到音频时间线)。合成时长由唇形同步视频通过的语义决定。音频来自唇形同步视频的音轨(唇形同步嵌入了原始TTS音频);无需重新混合独立的。如果唇形同步视频比屏幕录制短(Kling有时会修剪末尾静音),则屏幕录制会在唇形同步结束处被截断——接受此情况;循环屏幕录制对讲解内容来说更糟。
github_explainer.py:418-419, 531-533, 578-582edit_pipshortest=1duration_s = ceil(audio_duration_seconds)edit_pipshortest=1audio_urlStep 11 — Burn captions
步骤11——添加字幕
Call . renders a bottom subtitle bar — the right register for an explainer video (use / / only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as .
mcp__pika__add_captions(video_url=<final_url>, style="classic")classictiktokhormozikaraokecaptioned_urlSkip this step only if the user passed (parsed in Step 1) — the default is captions on. (Note: does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)
--no-captions/pika:podcast调用。样式渲染底部字幕栏——这是讲解视频的合适样式(仅当用户显式要求逐词高亮时才使用//样式)。音频从PiP合成视频的唇形同步音轨中提取,因此转录内容与旁白完全一致。捕获结果为。
mcp__pika__add_captions(video_url=<final_url>, style="classic")classictiktokhormozikaraokecaptioned_url仅当用户传递(步骤1中解析)时跳过此步骤——默认开启字幕。(注意:不添加字幕——讲解视频的旁白比快速的双人对话更适合转录。)
--no-captions/pika:podcastStep 12 — Return
步骤12——返回结果
Emit (or if Step 11 was skipped) on one line: .
captioned_urlfinal_urlDone: <url>在一行中输出(如果跳过步骤11则输出):。
captioned_urlfinal_urlDone: <url>Load-bearing phrases
核心短语
These anchors preserve the visual contract across page types:
| Phrase | Where | Why load-bearing |
|---|---|---|
| Selector contract | Keeps scroll, bbox capture, and zoom targeting aligned inside |
| Mode detection | Prevents generic product-page beats from replacing README/code walkthrough beats. |
| Beat-sheet authoring | Keeps narration, screen recording, lipsync, and captions within the reliable duration envelope. |
| Audio rescale step | Ensures later capture/zoom/composite steps consume one stable timeline. |
| Generic URL pre-flight | Reduces first-frame banner occlusion when a banner click misses. |
这些锚点确保跨页面类型的视觉约定一致:
| 短语 | 位置 | 核心原因 |
|---|---|---|
| 选择器约定 | 保持 |
| 模式检测 | 防止通用产品页面节拍替换README/代码导览节拍。 |
| 节拍表编写 | 保持旁白、屏幕录制、唇形同步和字幕在可靠的时长范围内。 |
| 音频缩放步骤 | 确保后续捕获/缩放/合成步骤使用稳定的时间线。 |
| 通用URL预检查 | 当弹窗点击未命中时,减少首帧弹窗遮挡。 |
Engine choice: Pika lipsync default, Kling opt-in
引擎选择:默认Pika唇形同步,Kling主动开启
Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.
--lipsync-provider kling默认使用Pika/parrot唇形同步,因为速度更快,大多数讲解视频可在短迭代周期内完成。仅当用户显式请求或高优先级渲染需要更居中的演示者外观且可容忍更长耗时环节时,才使用Kling。屏幕捕获、浏览器框架、缩放、PiP和字幕围绕唇形同步选择保持确定性编辑/合成步骤。
--lipsync-provider klingRuntime expectations
运行时预期
Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:
| Step | Wall clock | Notes |
|---|---|---|
| URL read + pre-flight | 10-60s | GitHub README scan or generic URL DOM/cookie checks |
| TTS + audio rescale | 30-90s | Beat timing is normalized after actual audio length |
| Screen recording | 60-180s | Depends on page load and navigation count |
| Browser frame + zooms | 1-3 min | Deterministic edit/composite stages |
| Lipsync | 2-5 min Pika / 5-30 min Kling | Kling is opt-in because it is the long pole |
| PiP + captions | 1-3 min | Captions skipped when |
使用Pika唇形同步的典型耗时为5-10分钟,使用Kling唇形同步为10-30+分钟:
| 步骤 | 耗时 | 说明 |
|---|---|---|
| URL读取 + 预检查 | 10-60秒 | GitHub README扫描或通用URL DOM/cookie检查 |
| TTS + 音频缩放 | 30-90秒 | 节拍时间根据实际音频长度归一化 |
| 屏幕录制 | 60-180秒 | 取决于页面加载和导航次数 |
| 浏览器框架 + 缩放 | 1-3分钟 | 确定性编辑/合成环节 |
| 唇形同步 | Pika 2-5分钟 / Kling 5-30分钟 | Kling需主动开启,因为耗时最长 |
| PiP + 字幕 | 1-3分钟 | 当 |
Known gaps (carried as follow-up server-side work)
已知缺陷(作为后续服务器端工作)
Kling avatarResolved byandmode:"pro"not exposed.promptBACK-339 (PR #186, shipped 2026-05-10):pika-mcp-servernow wires bothgenerate_lipsync(e.g.mode) and"pro"end-to-end for the kling provider. To enable polished-presenter mode here, passpromptand the Step 9 call should add--lipsync-provider klingplus a prompt likemode: "pro". Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap."talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"- No caller-controlled white-frame trim on the screen recording. has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms
capture_websiteaction atwaitmitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.at_s: 0.0 - No wait on per-beat navigation. Tarball uses
networkidlepluspage.goto(url, wait_until="networkidle", timeout=20000)after every navigate.wait_for_timeout(600)settles tocapture_websiteplus the bbox-capture branch's 600 ms post-action settle (server-side, whendomcontentloadedis set), but SPA blob pages whose final render happens afterbbox_selectorcan still get bbox'd against unmounted code blocks. Worker enhancement: expose adomcontentloadedknob onwait_until.timed_actions[].navigate - No per-step output-size verification gates. Tarball's helper at
verify()checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extragithub_explainer.py:35-39call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.mcp__pika__analyze_media - bbox capture not implemented.
text_contentv1 returnscapture_websiteonly for steps with a CSSaction_bboxes.selector-only steps produce no entry. Prefer CSS selectors intext_contentfor guaranteed zoom coverage.zoom_target - Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
- Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (+ clear
<h1>+ named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names —<section>instead oftile-headlinedefeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright'shero-title; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements withscrollIntoViewmay not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always passloading="lazy"to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry +--focus "the X feature"+ animation-trigger viawait_until=networkidlepolyfill).IntersectionObserver - Cookie-banner click is single-attempt. Step 2.6 §B emits one against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the
clickpayload is the load-bearing defense. Worker enhancement: support a list of fallback selectors perextra_cssaction so the worker tries each in order.click - Step 8b ↔ Step 6 -mapping mismatch in Generic-URL mode. Step 6 maps
idx, but Step 8b usesbeat_idx = entry.idx - prepend_countnaively. Pre-existing bug independent of rescaling.beat_idx = entry.idx
Kling头像已解决:和mode:"pro"未暴露。promptBACK-339(PR #186,2026-05-10发布):pika-mcp-server现在为kling服务商端到端传递generate_lipsync(例如mode)和"pro"。要在此处启用polished演示者模式,传递prompt,步骤9调用应添加--lipsync-provider kling和类似mode: "pro"的prompt。这是减少唇形同步中头部动作的有效质量控制手段——不再是服务器端缺陷。"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"- 无调用者可控的屏幕录制白帧修剪。 有内部修剪启发式逻辑,但未暴露给调用者。表现为页面仍在加载时,讲解视频开头出现短暂白闪。在
capture_website添加800ms的at_s: 0.0操作可部分缓解此问题,给页面绘制时间,但无法修剪已录制的白帧。需工作器增强。wait - 每个节拍导航无等待。 Tarball文件在每次导航后使用
networkidle+page.goto(url, wait_until="networkidle", timeout=20000)。wait_for_timeout(600)会等待到capture_website加上bbox捕获分支的600ms操作后等待(服务器端,当设置domcontentloaded时),但bbox_selector后渲染的SPA blob页面仍可能针对未挂载的代码块进行bbox捕获。需工作器增强:在domcontentloaded上暴露timed_actions[].navigate参数。wait_until - 无每步输出大小验证关卡。 Tarball文件的助手(
verify())在每步后检查TTS≥50KB、预览≥100KB、屏幕录制≥200KB、唇形同步≥500KB、最终视频≥1MB。MCP路径仅返回URL;验证文件大小需每步额外调用github_explainer.py:35-39(每步约30秒开销)。当用户端延迟预算允许时值得添加。目前,下游故障级联(例如零字节TTS→静音唇形同步→空白合成)仅在步骤11才会显现。mcp__pika__analyze_media - 未实现bbox捕获。
text_contentv1仅为带有CSScapture_website的步骤返回selector。仅含action_bboxes的步骤无条目。为保证缩放覆盖,text_content中优先使用CSS选择器。zoom_target - 节拍表措辞非确定性。 相同输入运行两次会产生不同的vo_text和不同的缩放位置。视觉类型是约定,而非像素级精确复制。
- 通用URL模式质量因站点而异。 具有语义标记(+清晰
<h1>+命名类钩子)的现代独立/SaaS落地页效果良好。大型企业站点(apple.com、microsoft.com、amazon.com)存在多个已知限制:(a) 机器人检测——页面可能在无头Chrome下提供降级版本或验证码;步骤2.6§A会在这些情况中止,但启发式逻辑并非 exhaustive;(b) 混淆类名——<section>而非tile-headline会使通用选择器失效;步骤2.6§C的WebFetch DOM扫描有帮助但并非完美;(c) 滚动触发动画不播放——IntersectionObserver驱动的hero揭示在真实用户滚动时触发,而非Playwright的hero-title;录制帧可能是静态占位符;(d) 懒加载图片——带有scrollIntoView的picture/source元素可能在600ms或2500ms等待窗口内未解析;bbox会落在透明占位符上。解决方法:选择更简单/更小的营销页面进行发布演示,始终传递loading="lazy"以锚定节拍选择,接受大型站点需要后续服务器PR(cookie弹窗点击重试 +--focus "X功能"+ IntersectionObserver polyfill触发动画)。wait_until=networkidle - Cookie弹窗点击仅尝试一次。 步骤2.6§B输出一个针对WebFetch DOM中提取的关闭选择器的操作。如果WebFetch的HTML不包含弹窗(JS后渲染)或选择器错误,点击会静默失败——
clickpayload是核心防御措施。需工作器增强:支持每个extra_css操作的 fallback 选择器列表,工作器会按顺序尝试。click - 步骤8b与步骤6在通用URL模式下的映射不匹配。 步骤6计算
idx,但步骤8b天真地使用beat_idx = entry.idx - prepend_count。这是与缩放无关的预存bug。beat_idx = entry.idx
Auth
认证
If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for Google login). For non-interactive environments, set .
@pika.artMCP_AUTH_TOKEN如果任何调用返回401:用户的OAuth令牌已过期或未颁发。下一次认证MCP调用会自动触发OAuth(浏览器打开谷歌登录)。对于非交互式环境,设置。
@pika.artMCP_AUTH_TOKENExamples
示例
GitHub-mode (repo-aware: README scan + live-demo detection):
/pika:explainer https://github.com/leigest519/OpenGame/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"- (opt-in to the preview gate when testing a new avatar)
/pika:explainer https://github.com/openai/whisper --preview
Generic-URL mode (any non-GitHub URL — drives through the page directly):
/pika:explainer https://pika.art/pika:explainer https://vercel.com --focus "the deployment workflow"/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview
GitHub模式(仓库感知:扫描README + 检测实时演示):
/pika:explainer https://github.com/leigest519/OpenGame/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"- (测试新头像时主动开启预览环节)
/pika:explainer https://github.com/openai/whisper --preview
通用URL模式(任意非GitHub URL——直接导览页面):
/pika:explainer https://pika.art/pika:explainer https://vercel.com --focus "the deployment workflow"/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview