extract

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

stardust:extract

stardust:extract

Crawl an existing website, parse each page, extract the brand surface, and produce a stardust-formatted snapshot of the current state under
stardust/current/
. The output describes what the site is; later sub-commands consume it to decide what it should be.
This skill is descriptive: it does not invent direction, it does not critique, and it does not modify the live site. It writes only under
stardust/current/
and updates
stardust/state.json
.
爬取现有网站,解析每个页面,提取品牌资产信息,并在
stardust/current/
目录下生成当前状态的stardust格式快照。输出内容描述了网站当前的状态;后续子命令会基于此内容确定网站应有的状态
本技能是描述性的:它不会设定方向,不会进行评判,也不会修改在线网站。仅会在
stardust/current/
目录下写入内容,并更新
stardust/state.json
文件。

Inputs

输入参数

  • <url>
    — required. The origin to crawl. Examples:
    https://example.com
    ,
    https://example.com/shop
    . A path narrows the same-origin crawl to that subtree.
  • --cap <N>
    — optional. Override the default 5-page cap. The cap is intentionally small — a 5-page sample (home + four IA pillars/templates) is enough for cross-page brand aggregation, system-component detection, and the brand-review HTML to do useful work. Lift the cap with
    --cap 25
    (the previous default) or higher when a deeper crawl is genuinely needed.
  • --all
    — optional. Lift the cap entirely; extract every discovered page after junk filtering. Equivalent to
    --cap 0
    . Use when the user spontaneously asks for a full crawl.
  • --pages <slug,slug,...>
    — optional. Restrict the crawl to specific paths (slugs derived per
    reference/ia-extraction.md
    ). Bypasses the cap.
  • --refresh <slug>
    — optional. Re-extract one page that already exists in
    state.json
    .
  • --single
    — optional. Equivalent to
    --cap 1
    . Useful for testing.
  • --wait <fast|medium|spec|auto>
    — optional. Wait strategy per page. Default
    medium
    . See
    reference/playwright-recipe.md
    § Wait modes.
  • --no-junk-filter
    — optional. Disable the default junk-page filter in discovery (see
    reference/ia-extraction.md
    § Filtering).
  • --no-consent-dismiss
    — optional. Skip the pre-flight consent / cookie banner dismissal (see
    reference/playwright-recipe.md
    § Pre-flight: consent dismissal). Use when the redesign scope explicitly includes the consent surface, or when the dismissal's side-effects (script activation that wouldn't otherwise run) need to be avoided. Default behaviour is to dismiss; the contract preserves screenshots, voice aggregation, and per-section style from being polluted by the banner.
  • --prep
    — optional. Run in migrate-prep mode: lift the cap, type each page, detect module candidates, capture typed content slots, emit the prep summary. See § Prep mode below. Typically invoked via the
    prepare-migration
    orchestrator skill rather than directly.
  • <url>
    — 必填项。要爬取的源地址。示例:
    https://example.com
    https://example.com/shop
    。指定路径可将同源爬取范围限定为该子路径。
  • --cap <N>
    — 可选项。覆盖默认的5页限制。默认限制设置得较小——5页样本(首页 + 4个信息架构支柱/模板)足以完成跨页面品牌聚合、系统组件检测,且品牌审核HTML可发挥实际作用。当确实需要深度爬取时,可通过
    --cap 25
    (之前的默认值)或更高数值解除限制。
  • --all
    — 可选项。完全解除页数限制;经过垃圾页面过滤后,提取所有发现的页面。等效于
    --cap 0
    。适用于用户主动要求完整爬取的场景。
  • --pages <slug,slug,...>
    — 可选项。将爬取范围限制为特定路径(slug依据
    reference/ia-extraction.md
    生成)。不受页数限制约束。
  • --refresh <slug>
    — 可选项。重新提取已存在于
    state.json
    中的单个页面。
  • --single
    — 可选项。等效于
    --cap 1
    。适用于测试场景。
  • --wait <fast|medium|spec|auto>
    — 可选项。每页的等待策略。默认值为
    medium
    。详见
    reference/playwright-recipe.md
    中的「等待模式」章节。
  • --no-junk-filter
    — 可选项。禁用发现阶段的默认垃圾页面过滤器(详见
    reference/ia-extraction.md
    中的「过滤」章节)。
  • --no-consent-dismiss
    — 可选项。跳过预执行的授权/ cookie横幅关闭操作(详见
    reference/playwright-recipe.md
    中的「预执行:关闭授权横幅」章节)。当重新设计范围明确包含授权界面,或需要避免关闭操作带来的副作用(如激活原本不会运行的脚本)时使用。默认行为是关闭横幅;此约定可避免截图、语音聚合和各区块样式被横幅污染。
  • --prep
    — 可选项。以迁移准备模式运行:解除页数限制,为每个页面分类,检测模块候选,捕获带类型的内容插槽,生成准备阶段摘要。详见下文的「准备模式」章节。通常通过
    prepare-migration
    编排器技能调用,而非直接调用。

Setup

环境设置

Run the master skill's setup procedure first (
skills/stardust/SKILL.md
§ Setup): impeccable dep check, context loader, state read.
Additional checks for this sub-command:
  1. Playwright availability. The extraction step needs a real browser. Detect Playwright in this order: a Playwright MCP server, then
    npx playwright
    . If neither is available, stop and tell the user how to install Playwright.
  2. Origin collision. If
    stardust/state.json
    already records
    site.originUrl
    and the new
    <url>
    is a different origin, stop and ask before clobbering. Stardust does not silently mix two sites in one project.
  3. Browser context. Open a fresh
    BrowserContext
    for the run. Run the consent dismissal pre-flight per
    reference/playwright-recipe.md
    § Pre-flight: consent dismissal unless
    --no-consent-dismiss
    is set. Cookies persist across the per-page loop within the same context, so one dismissal covers the whole crawl. Record the resolved method in
    _crawl-log.json#consent.method
    .
  4. Bot-management probe. When the first navigation in the run returns
    ERR_HTTP2_PROTOCOL_ERROR
    or
    ERR_QUIC_PROTOCOL_ERROR
    , or hangs through the entire hard-cap on what should be a fast origin, do not retry headless. Switch to
    headless: false, channel: 'chrome'
    per
    reference/playwright-recipe.md
    § Bot-management fallback and record the switch in
    _crawl-log.json#discovery.fetchTechnique
    so re-runs start in headed mode without rediscovering the issue.
首先运行主技能的设置流程(
skills/stardust/SKILL.md
中的「设置」章节):完成依赖检查、上下文加载、状态读取。
本子命令的额外检查项:
  1. Playwright可用性。提取步骤需要真实浏览器。按以下顺序检测Playwright:先检查Playwright MCP服务器,再检查
    npx playwright
    。若两者均不可用,停止操作并告知用户如何安装Playwright。
  2. 源地址冲突。若
    stardust/state.json
    已记录
    site.originUrl
    ,且新的
    <url>
    为不同源地址,停止操作并在覆盖前询问用户。Stardust不会在一个项目中静默混合两个网站的数据。
  3. 浏览器上下文。为本次运行打开全新的
    BrowserContext
    。除非设置了
    --no-consent-dismiss
    ,否则按照
    reference/playwright-recipe.md
    中的「预执行:关闭授权横幅」章节执行授权横幅关闭预操作。Cookie会在同一上下文的每页循环中保留,因此一次关闭操作即可覆盖整个爬取过程。将解决方法记录在
    _crawl-log.json#consent.method
    中。
  4. 机器人管理探测。当运行中的首次导航返回
    ERR_HTTP2_PROTOCOL_ERROR
    ERR_QUIC_PROTOCOL_ERROR
    ,或在本应快速响应的源地址上超时达到硬限制时,不要重试无头模式。按照
    reference/playwright-recipe.md
    中的「机器人管理回退方案」切换为
    headless: false, channel: 'chrome'
    ,并将切换操作记录在
    _crawl-log.json#discovery.fetchTechnique
    中,以便重新运行时直接以有头模式启动,无需再次排查问题。

Procedure

执行流程

Phase 1 — Discovery

阶段1 — 页面发现

Discover the page inventory before crawling. Procedure in
reference/ia-extraction.md
. In summary:
  1. Fetch
    <origin>/sitemap.xml
    , then
    <origin>/sitemap_index.xml
    , then check
    robots.txt
    for
    Sitemap:
    directives.
  2. If no sitemap is reachable, run a same-origin BFS crawl from
    <url>
    , depth-limited to 3, link-extracting from rendered HTML.
  3. Filter the discovered URL list: same origin only, exclude
    mailto:
    ,
    tel:
    , anchor-only links, query-only variations, common asset paths (
    .css
    ,
    .js
    ,
    .pdf
    , image extensions).
  4. De-duplicate trailing-slash variations.
  5. Apply the junk-page filter (
    reference/ia-extraction.md
    § Junk-page filter) unless
    --no-junk-filter
    is set. Surface the filtered list to the user as overridable.
  6. Apply the cap (default 5, or
    --cap
    , or
    --all
    for no cap) and proceed silently. Print an informational summary of what was kept and what was cut — but do not gate on user confirmation. The default cap is small enough that the common case is "extract 5 pages and move on"; pausing for a yes/no reply on every run is friction without value. Users who want different scope set it spontaneously at command time:
    $stardust extract https://example.com              # default 5 pages
    $stardust extract https://example.com --cap 25     # bump to 25
    $stardust extract https://example.com --all        # lift the cap
    $stardust extract https://example.com --pages home,about,pricing
    $stardust extract https://example.com --single     # just the entry URL
    The agent reads spontaneous scope intent from the user's prompt (e.g. "extract all pages", "look at just the home and pricing", "do a full crawl") and applies the equivalent flag. No re-confirmation needed once intent is clear.
    Informational output (not a prompt — proceed immediately):
    Discovered 38 pages on https://example.com (sitemap.xml).
    Filtered as likely junk (5): /test/, /sample-page/, /holiday1/, ...
    Selecting 5 highest-priority pages:
      - / (home)
      - /about
      - /pricing
      - /products
      - /contact
    
    Cut (28 pages, --all to lift): /blog/post-1, /blog/post-2, ...
    
    Extracting...
    Selection heuristic: page-type checklist first, then score-based ranking (home + IA-pillar keywords + sitemap priority − archive / version markers). See
    reference/ia-extraction.md
    § Page selection and § Priority for the cap. The English-only keyword list is a known limitation for localized sites.
  7. Write the discovered list to
    stardust/current/_crawl-log.json
    (created if absent) with
    _provenance
    and the full discovery reasoning, including
    filteredAsJunk[]
    and
    userChoice
    . This is an audit trail, not a state file.
在爬取前先发现页面清单。流程详见
reference/ia-extraction.md
。总结如下:
  1. 先获取
    <origin>/sitemap.xml
    ,再获取
    <origin>/sitemap_index.xml
    ,然后检查
    robots.txt
    中的
    Sitemap:
    指令。
  2. 若无法获取站点地图,从
    <url>
    开始运行同源广度优先搜索(BFS)爬取,深度限制为3,从渲染后的HTML中提取链接。
  3. 过滤发现的URL列表:仅保留同源地址,排除
    mailto:
    tel:
    、仅锚点链接、仅查询参数变体、常见资源路径(
    .css
    .js
    .pdf
    、图片扩展名)。
  4. 去重末尾斜杠变体。
  5. 应用垃圾页面过滤器(
    reference/ia-extraction.md
    中的「垃圾页面过滤器」章节),除非设置了
    --no-junk-filter
    。将过滤后的列表展示给用户,允许用户覆盖。
  6. 应用页数限制(默认5页,或
    --cap
    指定的数值,或
    --all
    解除限制)并静默执行。打印信息摘要说明保留和舍弃的内容——但无需等待用户确认。默认限制足够小,常见场景是「提取5页后继续」;每次运行都暂停等待用户确认会带来不必要的摩擦。用户若需要不同范围,可在命令执行时主动设置:
    $stardust extract https://example.com              # 默认提取5页
    $stardust extract https://example.com --cap 25     # 提升至25页
    $stardust extract https://example.com --all        # 解除页数限制
    $stardust extract https://example.com --pages home,about,pricing
    $stardust extract https://example.com --single     # 仅提取入口URL
    智能体可从用户提示中读取主动设定的范围意图(例如「提取所有页面」「仅查看首页和定价页」「进行完整爬取」),并应用对应的参数。意图明确后无需再次确认。
    信息输出(非提示——立即执行):
    在https://example.com上发现38个页面(来自sitemap.xml)。
    过滤为疑似垃圾页面(5个):/test/, /sample-page/, /holiday1/, ...
    选择5个最高优先级页面:
      - / (首页)
      - /about
      - /pricing
      - /products
      - /contact
    
    已舍弃(28个页面,使用--all解除限制):/blog/post-1, /blog/post-2, ...
    
    正在提取...
    选择规则:先按页面类型清单筛选,再按得分排序(首页 + 信息架构支柱关键词 + 站点地图优先级 − 归档/版本标记)。详见
    reference/ia-extraction.md
    中的「页面选择」和「页数限制优先级」章节。仅支持英文关键词是本地化网站的已知限制。
  7. 将发现的页面列表写入
    stardust/current/_crawl-log.json
    (不存在则创建),包含
    _provenance
    和完整的发现逻辑,包括
    filteredAsJunk[]
    userChoice
    。这是审计追踪文件,而非状态文件。

Phase 2 — Per-page extraction

阶段2 — 单页提取

For each page in the cap-respecting list, render with Playwright following
reference/playwright-recipe.md
. The recipe is mandatory — in particular, do not skip the wait, scroll, or capture-list steps:
  • Viewport 1440 × 900 @ 2× DPR
  • Wait per the configured wait mode (default
    medium
    ; see § Wait modes in
    reference/playwright-recipe.md
    )
  • Disable animations via
    prefers-reduced-motion: reduce
  • After the wait resolves, scroll to bottom in 4 viewport-height steps with 300 ms pauses, then return to top — this is required to trigger lazy-load and IntersectionObserver-driven content
  • Record
    waitMs
    and
    waitMode
    in the per-page
    _provenance
Capture per page (full schema in
reference/current-state-schema.md
):
  • Page metadata (title, meta description, OG tags, theme-color)
  • Semantic structure: heading outline, landmark roles, sections
  • Content: visible text per section (full innerText, no truncation per
    reference/playwright-recipe.md
    § Capture list 7), structured paragraphs (
    body[]
    ), lists, FAQ Q/A pairs, and review/testimonial quotes per § Capture list 7-bis. Without these structured fields, every body region under a heading falls back to placeholder signature at migrate time.
  • CTA labels and href targets, link inventory (internal vs external)
  • Per-section computed style summary: dominant colors, font families in use, spacing rhythm, border-radius, shadows
  • Media inventory: img/srcset with original URLs and intrinsic dimensions, inline SVG count, video/iframe presence,
    cssBackgrounds[]
    (including pseudo-element
    ::before
    /
    ::after
    walks per § Capture list 11) so
    background-image
    heroes and motifs do not silently disappear from extract.
  • Font files captured via network-intercept (per § Capture list 16): every
    woff2
    /
    woff
    /
    ttf
    /
    otf
    response saved under
    assets/fonts/
    and recorded in
    _brand-extraction.json#type.files[]
    with licensing flag.
  • Icon-font detection (per § Capture list 17): when the page uses
    [class^="icon-"]
    with non-default
    ::before
    font-family + codepoint, capture the family, save the file, and record the
    iconClass → codepoint
    table in
    _brand-extraction.json#iconFont
    .
  • Interactive elements: forms (with field types), buttons, modals detected by ARIA roles
Save to
stardust/current/pages/<slug>.json
with
_provenance
as the first key. Save referenced media to
stardust/current/assets/media/
preserving basename plus a short content hash.
Live-render evidence (synthesis is forbidden). Refuse to mark a page
extracted
in
state.json
unless its
_provenance
contains
renderedBy: "playwright"
, an ISO-8601
fetchedAt
, a positive integer
waitMs
, a
waitMode
from the recipe, and a final
httpStatus
in the 2xx/3xx range. These five fields are the contract enforced by
reference/current-state-schema.md
§ Live-render evidence and read back by every downstream phase via
validateProvenance()
per
skills/stardust/reference/state-machine.md
§ Provenance validation. Synthesizing a page record from
_brand-extraction.json
plus URL patterns plus captured photos — the 2026-04-30 lovesac shortcut — is the failure mode this guard exists to prevent. When the agent (or a delegated sub- agent) cannot satisfy the contract for a page, treat the page as a Phase 2 failure: record under
_crawl-log.json#crawl.failures[]
with
errorClass: "ProvenanceMissing"
and continue.
Mark the page
extracted
in
state.json
immediately after each successful page write. If a page fails, record the error in
_crawl-log.json
and continue — extraction is best-effort per page.
对于符合页数限制的每个页面,按照
reference/playwright-recipe.md
使用Playwright渲染页面。该规则为强制要求——尤其不能跳过等待、滚动或捕获列表步骤:
  • 视口尺寸1440 × 900 @ 2×设备像素比(DPR)
  • 按照配置的等待模式等待(默认
    medium
    ;详见
    reference/playwright-recipe.md
    中的「等待模式」章节)
  • 通过
    prefers-reduced-motion: reduce
    禁用动画
  • 等待完成后,分4步滚动至页面底部(每次滚动一个视口高度,间隔300毫秒),再返回顶部——这是触发懒加载和IntersectionObserver驱动内容的必要步骤
  • 在单页的
    _provenance
    中记录
    waitMs
    waitMode
为每个页面捕获以下内容(完整 schema 详见
reference/current-state-schema.md
):
  • 页面元数据(标题、元描述、OG标签、主题色)
  • 语义结构:标题大纲、地标角色、区块
  • 内容:每个区块的可见文本(完整innerText,不截断,详见
    reference/playwright-recipe.md
    中的「捕获列表7」)、结构化段落(
    body[]
    )、列表、FAQ问答对、以及评论/推荐引用(详见「捕获列表7-bis」)。若缺少这些结构化字段,迁移时每个标题下的正文区域都会回退为占位符。
  • CTA标签和href目标、链接清单(内部链接 vs 外部链接)
  • 每个区块的计算样式摘要:主导颜色、使用的字体族、间距规律、圆角半径、阴影
  • 媒体清单:带原始URL和固有尺寸的img/srcset、内联SVG数量、video/iframe存在情况、
    cssBackgrounds[]
    (包括伪元素
    ::before
    /
    ::after
    遍历,详见「捕获列表11」),避免
    background-image
    的英雄图和主题元素在提取过程中静默丢失。
  • 通过网络拦截捕获字体文件(详见「捕获列表16」):所有
    woff2
    /
    woff
    /
    ttf
    /
    otf
    响应保存至
    assets/fonts/
    ,并在
    _brand-extraction.json#type.files[]
    中记录,包含授权标记。
  • 图标字体检测(详见「捕获列表17」):当页面使用带非默认
    ::before
    字体族+代码点的
    [class^="icon-"]
    时,捕获字体族、保存文件,并在
    _brand-extraction.json#iconFont
    中记录
    iconClass → codepoint
    映射表。
  • 交互元素:表单(含字段类型)、按钮、通过ARIA角色检测到的模态框
将内容保存至
stardust/current/pages/<slug>.json
_provenance
作为第一个键。将引用的媒体保存至
stardust/current/assets/media/
,保留原文件名并添加简短内容哈希。
实时渲染证据(禁止合成)。除非页面的
_provenance
包含
renderedBy: "playwright"
、ISO-8601格式的
fetchedAt
、正整数
waitMs
、规则中指定的
waitMode
,以及2xx/3xx范围内的最终
httpStatus
,否则不得在
state.json
中将页面标记为
extracted
。这五个字段是
reference/current-state-schema.md
中「实时渲染证据」章节强制执行的约定,后续每个阶段都会通过
validateProvenance()
读取验证(详见
skills/stardust/reference/state-machine.md
中的「来源验证」章节)。从
_brand-extraction.json
、URL模式和捕获的照片合成页面记录——即2026-04-30 lovesac网站的快捷方式——是此防护机制要避免的失败模式。当智能体(或委托的子智能体)无法为页面满足该约定时,将页面视为阶段2失败:在
_crawl-log.json#crawl.failures[]
中记录
errorClass: "ProvenanceMissing"
,然后继续执行。
每个页面写入成功后,立即在
state.json
中将其标记为
extracted
。若页面提取失败,在
_crawl-log.json
中记录错误并继续——提取为每页的最大努力尝试。

Phase 3 — Brand-surface extraction

阶段3 — 品牌资产提取

Run once, after Phase 2 has finished, so cross-page aggregation has data to work with. Produces
stardust/current/_brand-extraction.json
per
reference/brand-surface.md
. Some fields are home-only (logo, voice samples, register heuristic); the visual tokens that drive DESIGN.md (palette, radius, shadow, type) are aggregated across all extracted pages to avoid the home-page bias documented in
brand-surface.md
§ Aggregation scope. Captures:
  • Logo by the v1 priority chain: inline SVG →
    <img>
    with logo-ish class/id →
    apple-touch-icon
    og:image
    → favicon → synthesized placeholder. Save to
    stardust/current/assets/logo.<ext>
    .
  • Palette — aggregate computed colors across all extracted pages (background, text, accents, borders, hovers). Frequency-sort, cluster near-duplicates, emit a role-named list (background, surface, text, primary, secondary, accent).
  • Type — font families in use with their weights, sizes, and computed line-heights. Identify the heading family vs body family. Run the modular-scale audit (
    brand-surface.md
    § Modular-scale audit) and emit
    scaleAudit.kind = "modular" | "ad-hoc"
    .
  • Motifs — signature border-radius (cross-page mode of non-zero values, weighted by element count), shadow stack (top 3 distinct, cross-page), gradient inventory, common patterns (chip, badge, card, hero-with-image). When the home-only mode disagrees with the cross-page mode, surface the divergence in
    _provenance.notes
    .
  • Voice samples — first paragraph of body copy, the hero headline, 3 representative CTA labels, a representative link list. Used by
    direct
    later but extracted now so the network round-trip is over.
  • Hero image — elevate the home page's primary visual asset to
    voice.heroImage
    (per
    reference/brand-surface.md
    § heroImage resolution). Without this elevation, downstream prototype reasons over a 16-image list and frequently picks the
    og:image
    instead of the live hero.
  • Icon font — when detected per
    reference/playwright-recipe.md
    § Capture list 17, populate
    _brand-extraction.json#iconFont
    with family, file path, and the
    iconClass → codepoint
    table so prototypes can render the brand's actual icons.
  • System components — cross-page repeated DOM blocks (site header, site footer, cross-promo strips, persistent CTAs, breadcrumbs). Detected by heading-sequence + CTA-label fingerprint per
    reference/brand-surface.md
    § System components. Required — these are usually the most load-bearing surfaces and must not silently disappear from the redesign target.
Do not invent values. Every captured value cites a source selector or URL in
_brand-extraction.json
for traceability.
阶段2完成后运行一次,以便跨页面聚合有数据可用。按照
reference/brand-surface.md
生成
stardust/current/_brand-extraction.json
。部分字段仅来自首页(logo、语音样本、注册类型推断);驱动DESIGN.md的视觉令牌(调色板、圆角、阴影、字体)会聚合所有提取页面的数据,避免
brand-surface.md
中「聚合范围」章节记录的首页偏差。捕获内容包括:
  • Logo:按照v1优先级链提取:内联SVG → 带logo类/id的
    <img>
    apple-touch-icon
    og:image
    → favicon → 合成占位符。保存至
    stardust/current/assets/logo.<ext>
  • 调色板 — 聚合所有提取页面的计算颜色(背景、文本、强调色、边框、悬停色)。按频率排序,聚类近似颜色,输出带角色名称的列表(背景、表面、文本、主色、次要色、强调色)。
  • 字体 — 使用的字体族及其字重、字号、计算行高。区分标题字体族和正文字体族。运行模块化比例审计(
    brand-surface.md
    中的「模块化比例审计」章节),输出
    scaleAudit.kind = "modular" | "ad-hoc"
  • 主题元素 — 标志性圆角(跨页面非零值的众数,按元素数量加权)、阴影堆栈(跨页面最常见的3种)、渐变清单、常见模式(芯片、徽章、卡片、带图英雄区)。当仅首页模式与跨页面模式结果不一致时,在
    _provenance.notes
    中记录差异。
  • 语音样本 — 正文第一段、英雄区标题、3个代表性CTA标签、一个代表性链接列表。供后续
    direct
    命令使用,但现在提取可避免重复网络请求。
  • 英雄图 — 将首页的主要视觉资产提升至
    voice.heroImage
    (详见
    reference/brand-surface.md
    中的「heroImage解析」章节)。若不进行此提升,下游原型会在16张图片列表中筛选,经常会错误选择
    og:image
    而非实际的英雄图。
  • 图标字体 — 若按照
    reference/playwright-recipe.md
    中的「捕获列表17」检测到,在
    _brand-extraction.json#iconFont
    中填充字体族、文件路径和
    iconClass → codepoint
    映射表,以便原型能渲染品牌的实际图标。
  • 系统组件 — 跨页面重复的DOM块(站点页眉、站点页脚、跨推广条、持久CTA、面包屑)。通过标题序列+CTA标签指纹检测(详见
    reference/brand-surface.md
    中的「系统组件」章节)。这是强制要求——这些通常是承载核心功能的界面,不能在重新设计目标中静默丢失。
不得凭空生成值。每个捕获的值都要在
_brand-extraction.json
中引用源选择器或URL,确保可追溯。

Phase 4 — Seed
stardust/current/PRODUCT.md
and
DESIGN.md

阶段4 — 生成
stardust/current/PRODUCT.md
DESIGN.md

The current-state PRODUCT.md and DESIGN.md are descriptive, not authored — there is no interview to run because the user is not defining intent here, the agent is describing the existing site. Write them directly using impeccable's format specs:
  • For PRODUCT.md, follow the section structure in impeccable's
    reference/teach.md
    . Populate
    Register
    from the brand surface (sites that read as marketing/landing →
    brand
    ; tools/dashboards →
    product
    ; ambiguous →
    brand
    with a note). Populate
    Users
    ,
    Product Purpose
    ,
    Brand Personality
    ,
    Anti-references
    , and
    Design Principles
    from the captured copy and the brand surface. Where the agent must infer, mark the section with
    _provenance: inferred
    and a one-line basis sentence.
  • For DESIGN.md and DESIGN.json, follow the format spec in impeccable's
    reference/document.md
    . Populate frontmatter (
    colors
    ,
    typography
    ,
    rounded
    ,
    spacing
    ,
    components
    ) from the captured tokens. The
    extensions
    block of DESIGN.json carries v1's
    componentStyle
    ,
    motifs
    , and
    voice
    arrays so nothing is lost.
Stardust does not invoke
$impeccable teach
or
$impeccable document
for the current-state files: those commands write to project root (the target) and run an interview. Stardust authors the descriptive snapshot directly. The format spec from impeccable is the contract; the runtime command is not.
The target-state PRODUCT.md and DESIGN.md at the project root are written by
$stardust direct
in Phase 2 of the pipeline, not here.
当前状态的PRODUCT.md和DESIGN.md是描述性的,而非创作性的——无需与用户访谈,因为用户在此阶段未定义意图,智能体仅描述现有网站。直接按照impeccable的格式规范编写:
  • 对于PRODUCT.md,遵循impeccable的
    reference/teach.md
    中的章节结构。从品牌资产中填充
    Register
    字段(营销/着陆类网站 →
    brand
    ;工具/仪表板类 →
    product
    ;模糊不清 →
    brand
    并添加说明)。从捕获的文案和品牌资产中填充
    Users
    Product Purpose
    Brand Personality
    Anti-references
    Design Principles
    字段。若需要推断,在章节中标记
    _provenance: inferred
    并添加一行推断依据。
  • 对于DESIGN.md和DESIGN.json,遵循impeccable的
    reference/document.md
    中的格式规范。从捕获的令牌中填充前置内容(
    colors
    typography
    rounded
    spacing
    components
    )。DESIGN.json的
    extensions
    块包含v1的
    componentStyle
    motifs
    voice
    数组,确保无信息丢失。
Stardust不会为当前状态文件调用
$impeccable teach
$impeccable document
:这些命令会写入项目根目录(目标状态)并运行访谈。Stardust直接编写描述性快照。impeccable的格式规范是约定,而非运行时命令。
项目根目录下的目标状态PRODUCT.md和DESIGN.md由管道阶段2中的
$stardust direct
命令编写,而非在此阶段。

Phase 5 — Render
stardust/current/brand-review.html

阶段5 — 渲染
stardust/current/brand-review.html

After Phase 4 writes the descriptive PRODUCT.md and DESIGN.md, emit the current-state brand review per
reference/brand-review-template.md
.
The brand-review HTML is the first surface a human can eyeball to verify the extraction before committing to a redesign direction. Misreads in the JSON (a wrong dominant radius, a missing system component, a single-page palette bias) are obvious to the eye in five seconds and invisible in JSON until someone notices. Putting the review at the end of
extract
catches misreads while they are still cheap to fix — re-extract is fast; re-direct + re-prototype is not.
The template is mandatory. In particular:
  1. Run the Tensions detectors listed in
    reference/brand-review-template.md
    § Detectors. Each rule is mechanical; emit a tension card whenever the trigger condition matches. The review may ship with zero tensions if the data is too thin to evaluate, but the detectors must always be run.
  2. Render in the brand's own captured colors and fonts, not a stardust shell.
  3. Embed all CSS; do not load external JavaScript or fonts unless the live site already does.
  4. Cite the source artifact for every section (e.g.
    _brand-extraction.json § type
    under Typography).
If the data for a section is missing, omit the section — do not fabricate placeholders. The coverage callout at the top reflects what is missing.
阶段4写入描述性的PRODUCT.md和DESIGN.md后,按照
reference/brand-review-template.md
生成当前状态的品牌审核页面。
品牌审核HTML是人类可直观查看的第一个界面,用于在确定重新设计方向前验证提取结果。JSON中的错误(错误的主导圆角、缺失的系统组件、单页调色板偏差)在视觉上只需5秒即可发现,但在JSON中可能无人注意。将审核放在
extract
命令末尾,可在修复成本较低时发现错误——重新提取速度快;而重新执行
direct
+重新生成原型则耗时较长。
模板为强制要求。尤其需要注意:
  1. 运行
    reference/brand-review-template.md
    中的「张力检测器」章节列出的张力检测器。每个规则都是机械性的;触发条件匹配时输出张力卡片。若数据不足无法评估,审核页面可能没有张力卡片,但必须始终运行检测器。
  2. 使用品牌自身捕获的颜色和字体渲染,而非Stardust的默认外壳。
  3. 嵌入所有CSS;除非在线网站已加载,否则不加载外部JavaScript或字体。
  4. 为每个章节引用源工件(例如,Typography章节下标注
    _brand-extraction.json § type
    )。
若某章节数据缺失,省略该章节——不得生成占位符。顶部的覆盖范围说明会反映缺失内容。

Phase 6 — Update state and report

阶段6 — 更新状态并生成报告

After all Phase 2-5 writes succeed:
  1. Update
    stardust/state.json
    (schema in
    skills/stardust/reference/state-machine.md
    ):
    • site.originUrl
      ,
      site.extractedAt
      ,
      site.pageCap
      ,
      site.totalDiscovered
      ,
      site.crawled
    • pages[]
      — one entry per crawled page with
      status: "extracted"
      , filled
      currentStatePath
      , empty
      prototypePath
      and
      migratedPath
  2. Print a one-screen summary:
    Extracted https://example.com (5/38 pages, sitemap.xml)
    
    stardust/current/
      PRODUCT.md            (register: brand, inferred from landing)
      DESIGN.md             (5 colors, 2 type families, 3 motifs)
      brand-review.html     (4 tensions surfaced)
      pages/                (5 files)
      assets/logo.svg       (extracted from inline SVG)
      _brand-extraction.json
      _crawl-log.json
    
    Per-page evidence:
      slug         live  waitMode               waitMs   status
      /            yes   medium                 2380     200
      /about       yes   medium                 2110     200
      /pricing     yes   medium                 1940     200
      /products    yes   medium                 2640     200
      /contact     yes   domcontentloaded(fb)   8000     200
    
    Wait summary: 4 resolved at medium (avg 2.4s), 1 fallback (timed out at 8s)
      → /contact may be under-captured; consider --refresh
    
    Open stardust/current/brand-review.html to verify the extraction
    before running $stardust direct.
    
    Coverage note: extracted 5 of 38 discovered pages. The brand
    surface and brand-review use cross-page aggregation, so 5 pages
    covering distinct templates is usually sufficient. To extract
    more, re-run with --cap <N> (e.g. --cap 25) or list specific
    slugs with --pages.
    
    Next: $stardust direct  (resolve a redesign direction)
    The per-page evidence table is mandatory. The
    live
    column is
    yes
    when
    _provenance.renderedBy === "playwright"
    AND
    waitMs > 0
    , else
    no
    . A
    no
    row means the page record was not produced by a live Playwright render — this should never happen given the write-time guard, but the visible column is the defense-in-depth signal that catches the failure mode when it does (the 2026-04-30 lovesac synthesis bug went four phases deep before being caught because no report column surfaced the missing provenance). A maintainer scanning the summary should see
    yes
    on every row.
    Compute the wait summary by grouping each page's
    _provenance.waitMode
    and averaging
    waitMs
    . List slugs whose
    waitMode
    ends in
    (fallback)
    (rendered as
    (fb)
    in the table for width) as candidates for
    --refresh
    .
阶段2-5的所有写入操作成功后:
  1. 更新
    stardust/state.json
    (schema详见
    skills/stardust/reference/state-machine.md
    ):
    • site.originUrl
      site.extractedAt
      site.pageCap
      site.totalDiscovered
      site.crawled
    • pages[]
      — 每个爬取页面对应一个条目,
      status: "extracted"
      ,填充
      currentStatePath
      prototypePath
      migratedPath
      为空
  2. 打印单屏摘要:
    已提取https://example.com(38页中的5页,来自sitemap.xml)
    
    stardust/current/
      PRODUCT.md            (注册类型:brand,从着陆页推断)
      DESIGN.md             (5种颜色、2种字体族、3种主题元素)
      brand-review.html     (发现4处张力)
      pages/                (5个文件)
      assets/logo.svg       (从内联SVG提取)
      _brand-extraction.json
      _crawl-log.json
    
    单页证据:
      slug         live  waitMode               waitMs   status
      /            yes   medium                 2380     200
      /about       yes   medium                 2110     200
      /pricing     yes   medium                 1940     200
      /products    yes   medium                 2640     200
      /contact     yes   domcontentloaded(fb)   8000     200
    
    等待摘要:4个页面以medium模式完成(平均2.4秒),1个页面使用回退方案(超时8秒)
      → /contact可能未完全捕获;考虑使用--refresh重新提取
    
    运行$stardust direct前,请打开stardust/current/brand-review.html验证提取结果
    
    覆盖范围说明:已提取38个发现页面中的5个。品牌资产和品牌审核使用跨页面聚合,因此覆盖不同模板的5个页面通常已足够。若要提取更多页面,重新运行时使用--cap <N>(例如--cap 25)或使用--pages指定特定slug。
    
    下一步:$stardust direct (确定重新设计方向)
    单页证据表为强制要求。
    live
    列在
    _provenance.renderedBy === "playwright"
    waitMs > 0
    时为
    yes
    ,否则为
    no
    no
    表示页面记录并非由Playwright实时渲染生成——鉴于写入时的防护机制,此情况不应发生,但可见列是深度防御信号,可在故障发生时及时捕获(2026-04-30 lovesac合成漏洞在4个阶段后才被发现,因为报告中没有列显示缺失的来源信息)。维护人员扫描摘要时应看到所有行的
    live
    值均为
    yes
    通过分组每个页面的
    _provenance.waitMode
    并计算
    waitMs
    的平均值来生成等待摘要。列出
    waitMode
    (fallback)
    结尾的slug(表格中显示为
    (fb)
    以节省宽度),作为
    --refresh
    的候选页面。

Outputs

输出内容

PathPurpose
stardust/current/PRODUCT.md
Descriptive strategy of the existing site (impeccable format)
stardust/current/DESIGN.md
Descriptive visual system (Stitch format)
stardust/current/DESIGN.json
Sidecar with extensions for motifs, voice, components
stardust/current/brand-review.html
Self-contained visual review of the extraction (first eyeball-able artifact)
stardust/current/pages/<slug>.json
Per-page parsed structure + content
stardust/current/assets/logo.<ext>
Extracted logo
stardust/current/assets/media/
Extracted media referenced by pages
stardust/current/assets/screenshots/
Per-page viewport screenshots (used by brand-review)
stardust/current/_brand-extraction.json
Consolidated brand surface (palette, type, motifs, voice, system components)
stardust/current/_crawl-log.json
Discovery + crawl audit trail
stardust/state.json
Updated with site + per-page status
路径用途
stardust/current/PRODUCT.md
现有网站的描述性策略(impeccable格式)
stardust/current/DESIGN.md
描述性视觉系统(Stitch格式)
stardust/current/DESIGN.json
包含主题元素、语音、组件扩展的辅助文件
stardust/current/brand-review.html
提取结果的独立视觉审核界面(首个可直观查看的工件)
stardust/current/pages/<slug>.json
单页解析结构 + 内容
stardust/current/assets/logo.<ext>
提取的logo
stardust/current/assets/media/
页面引用的提取媒体
stardust/current/assets/screenshots/
单页视口截图(供品牌审核使用)
stardust/current/_brand-extraction.json
整合后的品牌资产(调色板、字体、主题元素、语音、系统组件)
stardust/current/_crawl-log.json
发现 + 爬取审计追踪
stardust/state.json
更新后的站点 + 单页状态

Concurrency

并发处理

Per
state-machine.md
: stardust does not lock. Two concurrent extracts on the same project are last-write-wins. Document this in the user report; do not engineer around it.
按照
state-machine.md
:Stardust不锁定资源。同一项目上的两个并发提取操作采用最后写入获胜原则。在用户报告中记录此规则,无需进行额外工程处理。

Failure modes

失败模式

  • Network failure mid-crawl. Continue, record in
    _crawl-log.json
    , end with a partial state. State.json reflects only successfully extracted pages. User can re-run; already-extracted pages are skipped unless
    --refresh <slug>
    .
  • HTTP 4xx/5xx, non-HTML content, soft-404s. Validated explicitly per
    reference/playwright-recipe.md
    § Response validation. Each produces a distinct error class (
    HTTPError
    ,
    ContentTypeError
    ,
    EmptyPageError
    ) recorded in
    _crawl-log.json#crawl.failures[]
    . Failed pages do not appear in
    state.json
    as
    extracted
    — they appear only in the failure log. Without this validation a 5xx page silently lands as an empty success and propagates wrong data to
    direct
    and
    prototype
    .
  • Login wall. Do not attempt to authenticate. If the home page redirects to a login screen, capture that one page, mark the rest as unreachable, and ask the user how to proceed (provide cookies via Playwright config, change the entry URL, or scope to public pages).
  • Bot-management block (Akamai / Cloudflare / F5 / Imperva). When the first navigation returns
    ERR_HTTP2_PROTOCOL_ERROR
    ,
    ERR_QUIC_PROTOCOL_ERROR
    , or hangs through the hard-cap on a TLS/H2 fingerprint check, the issue is JA3/H2 fingerprinting on bundled-chromium-default headless mode — not auth, not network. Switch to
    headless: false, channel: 'chrome'
    per
    reference/playwright-recipe.md
    § Bot-management fallback. Do not retry headless: it will fail identically. The headed fallback works against most enterprise / commerce origins;
    playwright-extra
    + stealth plugin is a non-standard escape hatch for the residual cases. The headed window pops visibly, which is acceptable for interactive runs and unacceptable for unattended pipelines — surface this to the user when first triggered.
  • JavaScript-only content. Playwright already handles this. If the configured wait condition never fires within the mode's hard cap (
    reference/playwright-recipe.md
    § Wait modes), fall back to
    domcontentloaded
    and capture what is rendered. Record the fallback in the per-page
    _provenance.waitMode
    and surface in the wait-summary line of the final report.
  • Synthesis attempt (forbidden). When the agent (or a delegated sub-agent) cannot run a real Playwright render for a page — whether due to time pressure, token budget, or a tool/network failure — the only correct outcome is to record the page as a Phase 2 failure (
    errorClass: "ProvenanceMissing"
    in
    _crawl-log.json#crawl.failures[]
    ) and continue. Synthesizing a page record from
    _brand-extraction.json
    plus URL patterns plus captured photos at "semantically matching" template positions is forbidden.
    This was the 2026-04-30 lovesac.com failure — 20 of 25 pages synthesized this way and the cascade ran four phases on the synthesized data before the gap was caught by a meta-question. The synthesis shortcut produces output indistinguishable from a successful run and propagates fabricated content through every downstream phase.
  • 爬取中途网络故障。继续执行,在
    _crawl-log.json
    中记录,最终生成部分状态。state.json仅反映成功提取的页面。用户可重新运行;已提取的页面会被跳过,除非使用
    --refresh <slug>
  • HTTP 4xx/5xx、非HTML内容、软404。按照
    reference/playwright-recipe.md
    中的「响应验证」章节进行显式验证。每种情况都会生成不同的错误类(
    HTTPError
    ContentTypeError
    EmptyPageError
    ),记录在
    _crawl-log.json#crawl.failures[]
    中。失败页面不会
    state.json
    中标记为
    extracted
    ——仅会出现在失败日志中。若没有此验证,5xx页面会静默作为空成功记录,并将错误数据传播至
    direct
    prototype
    命令。
  • 登录墙。不尝试认证。若首页重定向至登录界面,捕获该页面,标记其余页面为不可访问,并询问用户如何处理(通过Playwright配置提供cookie、更改入口URL或限定为公开页面)。
  • 机器人管理拦截(Akamai / Cloudflare / F5 / Imperva)。当首次导航返回
    ERR_HTTP2_PROTOCOL_ERROR
    ERR_QUIC_PROTOCOL_ERROR
    ,或在TLS/H2指纹检查中超时达到硬限制时,问题出在捆绑Chromium默认无头模式的JA3/H2指纹——而非认证或网络问题。按照
    reference/playwright-recipe.md
    中的「机器人管理回退方案」切换为
    headless: false, channel: 'chrome'
    不要重试无头模式:会再次失败。有头回退方案对大多数企业/电商源地址有效;
    playwright-extra
    + stealth插件是剩余情况的非标准解决方案。有头窗口会可见弹出,这在交互式运行中可接受,但在无人值守管道中不可接受——首次触发时需告知用户。
  • 纯JavaScript内容。Playwright已处理此情况。若配置的等待条件在模式的硬限制内始终未触发(详见
    reference/playwright-recipe.md
    中的「等待模式」章节),回退至
    domcontentloaded
    并捕获已渲染的内容。在单页的
    _provenance.waitMode
    中记录回退操作,并在最终报告的等待摘要行中显示。
  • 合成尝试(禁止)。当智能体(或委托的子智能体)无法为页面运行真实的Playwright渲染时——无论是由于时间压力、令牌预算还是工具/网络故障——唯一正确的结果是将页面标记为阶段2失败(在
    _crawl-log.json#crawl.failures[]
    中记录
    errorClass: "ProvenanceMissing"
    )并继续执行。**禁止从
    _brand-extraction.json
    、URL模式和捕获的照片合成页面记录,即使是在「语义匹配」的模板位置。**这正是2026-04-30 lovesac.com的失败原因——25个页面中有20个以此方式合成,且错误数据在4个阶段后才被元问题发现。合成捷径生成的输出与成功运行无法区分,并会将伪造内容传播至每个下游阶段。

Prep mode (--prep)

准备模式(--prep)

When invoked with
--prep
, extract runs an extended pass that prepares the inventory for migration. Discovery-mode runs (without
--prep
) are unchanged: small cap, no typing, no module detection, presales-friendly.
--prep
is the gesture that says "the user is committing to migrate; build the data structure migrate consumes."
--prep
adds five things on top of the standard procedure:
使用
--prep
调用时,extract命令会运行扩展流程,为迁移准备页面清单。未使用
--prep
的发现模式运行保持不变:小页数限制、无页面分类、无模块检测、适合售前场景。
--prep
表示「用户已承诺进行迁移;构建迁移所需的数据结构」。
--prep
在标准流程基础上添加了5项内容:

1. Lift the cap

1. 解除页数限制

--prep
implies
--all
. Migration coverage requires the full inventory — the small discovery cap (5 pages) is insufficient. The cap-respecting selection logic from
reference/ia-extraction.md
§ Page selection still applies for ordering and junk-filtering; it just doesn't truncate.
--prep
隐含
--all
。迁移覆盖需要完整的页面清单——小发现限制(5页)不足够。
reference/ia-extraction.md
中的「页面选择」章节的页数限制选择逻辑仍适用于排序和垃圾过滤;只是不再截断。

Sub-agent prompt requirements (when delegating)

子智能体提示要求(委托时)

When
--prep
is heavy enough that the agent delegates extraction to a sub-agent (a presales-shaped pattern when the inventory is large), the sub-agent prompt must:
  1. Forbid synthesis by name. The literal sentence *"do not synthesize a page record from
    _brand-extraction.json
    • URL patterns + captured photos; every page must be a live Playwright render"* must appear in the prompt. The earlier wording "must actually invoke Playwright per page" was satisfiable in spirit by synthesis-with-photo-reuse and produced the lovesac failure. Naming the shortcut explicitly closes that loophole.
  2. Require a per-page evidence table in the return. Columns:
    slug | waitMode | waitMs | fetchedAt | httpStatus
    . The parent agent reads this table on completion and aborts if any row is missing or shows
    waitMs: 0
    .
  3. Require the wait-summary line in the return, formatted identically to Phase 6's wait summary, so the parent can surface it in the user-facing report without reformatting.
These three are mandatory; missing any of them in the sub-agent prompt is itself a recipe violation. The cascade-level guard in
prepare-migration
validates the resulting per-page JSONs via
validateProvenance()
regardless — but a well-formed sub-agent return makes the failure cheaper to diagnose.
--prep
任务过重,智能体将提取委托给子智能体时(页面清单较大时的售前典型模式),子智能体的提示必须包含:
  1. 明确禁止合成。必须包含字面句子:"不得从
    _brand-extraction.json
    + URL模式 + 捕获的照片合成页面记录;每个页面都必须是Playwright实时渲染的结果"
    。之前的措辞*"必须为每个页面实际调用Playwright"*可被合成+照片复用的方式满足,导致了lovesac失败。明确命名该捷径可关闭此漏洞。
  2. 要求返回单页证据表。列:
    slug | waitMode | waitMs | fetchedAt | httpStatus
    。父智能体在完成时读取此表,若任何行缺失或
    waitMs: 0
    则中止操作。
  3. 要求返回等待摘要行,格式与阶段6的等待摘要完全相同,以便父智能体无需重新格式化即可在面向用户的报告中显示。
这三项为强制要求;子智能体提示中缺少任何一项均违反规则。
prepare-migration
中的级联防护会通过
validateProvenance()
验证最终的单页JSON——但格式良好的子智能体返回可降低故障诊断成本。

2. Page typing

2. 页面分类

For each extracted page, infer the
type
field from URL pattern and content shape (LLM judgment). Catalog from
skills/stardust/reference/state-machine.md
§ Page types:
landing | article | listing | program | form | static | unique
.
Write the inferred type to
state.json.pages[].type
. The user confirms or refines during
direct --prep
. Discovery-mode runs leave
type
as
null
.
为每个提取的页面,从URL模式和内容形态推断
type
字段(LLM判断)。分类来自
skills/stardust/reference/state-machine.md
中的「页面类型」章节:
landing | article | listing | program | form | static | unique
将推断的类型写入
state.json.pages[].type
。用户可在
direct --prep
期间确认或细化。发现模式运行时
type
null

3. Module candidate detection

3. 模块候选检测

After Phase 3 (brand-surface extraction), scan extracted pages for recurring structural patterns. A pattern that appears in N+ pages with similar shape (same sequence of elements, same
data-section
/
data-purpose
, similar text shape) is surfaced as a module candidate.
阶段3(品牌资产提取)完成后,扫描提取的页面以查找重复的结构模式。在N+个页面中出现的、形态相似(元素序列相同、
data-section
/
data-purpose
相同、文本形态相似)的模式会被标记为模块候选。

Signal-source priority

信号源优先级

Detection consumes per-page captured fields in this priority order. Each higher signal is weighted more heavily in the match-score; lower signals are tie-breakers and corroboration, not primary evidence. The priority exists because higher-up fields are explicitly extracted and structured (no parsing ambiguity), while the bottom of the list (
landmarks[].innerText
substring search) is fragile against capture variations and was the source of the 2026-04-29 sliccy.com under-detection (0 hits for
pre-footer-shell
, 1 of 2 hits for
install-tile
— both modules genuinely present on every page, both invisible because the substrings being searched lived past the truncation boundary that has since been removed).
  1. pages/<slug>.json#headings[]
    — cross-page repeats of the same heading text in the same level. Highest signal: structured, explicit, captured in full regardless of body length.
  2. pages/<slug>.json#ctas[]
    labels
    — cross-page repeats of the same CTA label appearing on similar surfaces.
  3. pages/<slug>.json#media.cssBackgrounds[]
    URLs
    — same asset URL on multiple pages is a strong system-component signal (already specced as a system-component candidate in
    reference/brand-surface.md
    § Cross-page CSS-background reuse; module detection consumes the same signal at finer granularity).
  4. pages/<slug>.json#forms[]
    actions
    — cross-page repeats of the same form
    action
    URL. Newsletter / contact / search forms are the typical hits.
  5. pages/<slug>.json#components.componentsByLandmark
    when present (per future
    current-state-schema.md
    extension): per-landmark counts of cards / grids / etc.
  6. Substring search in
    landmarks[].innerText
    — lowest signal. Use only as corroboration once a candidate has already passed the higher-signal checks; never as the primary detector.
A candidate that fires on signals 1 + 2 above the threshold is high-confidence; a candidate that fires only on 6 should be treated as speculative and surfaced as such for the user to confirm in
direct --prep
.
Candidate output is a draft entry under
DESIGN.json.extensions.modules[]
:
json
{
  "id": "candidate-<short-hash>",
  "slots": [
    { "name": "<inferred>", "type": "text|link|image|...", "required": false }
  ],
  "instances": [
    { "slug": "home",   "selector": "..." },
    { "slug": "donate", "selector": "..." }
  ],
  "status": "candidate"
}
The
status: "candidate"
flag distinguishes draft entries from confirmed modules.
direct --prep
is where the user names them and promotes (or prunes).
检测按以下优先级使用单页捕获字段。优先级越高的信号在匹配得分中的权重越大;低优先级信号仅用于打破平局和佐证,而非主要证据。优先级存在的原因是上方字段为显式提取的结构化字段(无解析歧义),而列表底部(
landmarks[].innerText
子字符串搜索)易受捕获变化影响,是2026-04-29 sliccy.com检测不足的原因(
pre-footer-shell
命中0次,
install-tile
命中2次中的1次——这两个模块确实存在于每个页面,但由于搜索的子字符串超出了现已移除的截断边界而无法被检测到)。
  1. pages/<slug>.json#headings[]
    — 跨页面重复出现的相同层级的相同标题文本。最高信号:结构化、显式、无论正文长度如何均完整捕获。
  2. pages/<slug>.json#ctas[]
    标签
    — 跨页面在相似界面上重复出现的相同CTA标签。
  3. pages/<slug>.json#media.cssBackgrounds[]
    URL
    — 同一资产URL出现在多个页面上是强系统组件信号(已在
    reference/brand-surface.md
    中的「跨页面CSS背景复用」章节中指定为系统组件候选;模块检测在更细粒度上使用相同信号)。
  4. pages/<slug>.json#forms[]
    动作
    — 跨页面重复出现的相同表单
    action
    URL。典型命中为通讯订阅/联系/搜索表单。
  5. pages/<slug>.json#components.componentsByLandmark
    (若存在,基于未来的
    current-state-schema.md
    扩展):每个地标的卡片/网格等组件计数。
  6. landmarks[].innerText
    子字符串搜索
    — 最低信号。仅在候选已通过高信号检查后用作佐证;绝不能作为主要检测器。
通过信号1+2且超过阈值的候选为高置信度;仅通过信号6的候选应视为推测性,在
direct --prep
中展示给用户确认。
候选输出为
DESIGN.json.extensions.modules[]
下的草稿条目:
json
{
  "id": "candidate-<short-hash>",
  "slots": [
    { "name": "<inferred>", "type": "text|link|image|...", "required": false }
  ],
  "instances": [
    { "slug": "home",   "selector": "..." },
    { "slug": "donate", "selector": "..." }
  ],
  "status": "candidate"
}
status: "candidate"
标记区分草稿条目与已确认模块。
direct --prep
是用户为模块命名并升级(或删除)的阶段。

4. Typed content slots

4. 带类型的内容插槽

Per-page JSON (
current/pages/<slug>.json
) gains a
slots
section that identifies content slots per page-type:
  • article
    pages:
    headline
    ,
    deck
    ,
    byline
    ,
    meta
    ,
    lead-image
    ,
    body
    ,
    pullquotes[]
    ,
    related[]
  • listing
    pages:
    index-headline
    ,
    filter-controls
    ,
    card-grid
    with typed sub-slots per card
  • program
    pages:
    program-headline
    ,
    summary
    ,
    feature-grid
    ,
    cta-band
  • landing
    ,
    form
    ,
    static
    — typed slots inferred per content shape
Schema additions live in
reference/current-state-schema.md
§ Typed slots (extend that doc separately).
单页JSON(
current/pages/<slug>.json
)新增
slots
章节,为每个页面类型识别内容插槽:
  • article
    页面:
    headline
    deck
    byline
    meta
    lead-image
    body
    pullquotes[]
    related[]
  • listing
    页面:
    index-headline
    filter-controls
    、带类型子插槽的
    card-grid
  • program
    页面:
    program-headline
    summary
    feature-grid
    cta-band
  • landing
    form
    static
    — 根据内容形态推断类型插槽
Schema扩展位于
reference/current-state-schema.md
中的「带类型的插槽」章节(需单独扩展该文档)。

5. Prep summary

5. 准备阶段摘要

Replace Phase 6's standard report with the prep summary format:
extract --prep complete
=======================

Inventory:    127 pages crawled (5 prior, 122 new)
Provenance:   127/127 live (every page has Playwright evidence)
Page types:   landing 1 · article 84 · listing 6 · program 12 · form 3 · static 18 · unique 3
              (LLM-inferred; refine in direct --prep)

Module candidates: 8
  hotline-211         5 instances  (home, get-help, donate, news, programs)
  donate-band         12 instances (home, donate, news, all article footers)
  story-card          7 instances  (home, news, programs)
  ...

Typed slots:  filled per page-type (see current/pages/<slug>.json § slots)

Next: $stardust direct --prep  (confirm types, name modules)
The
Provenance: <live>/<total> live
line is mandatory in prep-mode output. When the ratio is anything other than
<total>/<total>
the prep run has failed the synthesis guard; list the affected slugs as a sub-bullet and treat the prep run as incomplete (the cascade-level guard in
prepare-migration
SKILL.md surfaces the same check between phases).
Default mode (no
--prep
) is unchanged. The flag is intended for the
prepare-migration
orchestrator, though direct invocation is supported.
用准备阶段摘要格式替换阶段6的标准报告:
extract --prep complete
=======================

页面清单:    已爬取127个页面(5个已存在,122个新增)
来源验证:   127/127为实时渲染(每个页面都有Playwright证据)
页面类型:   landing 1 · article 84 · listing 6 · program 12 · form 3 · static 18 · unique 3
              (LLM推断;在direct --prep中细化)

模块候选: 8个
  hotline-211         5个实例  (首页、获取帮助、捐赠、新闻、项目)
  donate-band         12个实例 (首页、捐赠、新闻、所有文章页脚)
  story-card          7个实例  (首页、新闻、项目)
  ...

带类型插槽: 按页面类型填充(详见current/pages/<slug>.json § slots)

下一步: $stardust direct --prep (确认类型,为模块命名)
来源验证: <live>/<total> live
行在准备模式输出中为强制要求。当比例不是
<total>/<total>
时,准备运行未通过合成防护;列出受影响的slug作为子项目,并将准备运行视为未完成(
prepare-migration
SKILL.md中的级联防护会在阶段间进行相同检查)。
默认模式(无
--prep
)保持不变。该标志供
prepare-migration
编排器使用,但也支持直接调用。

References

参考文档

  • reference/playwright-recipe.md
    — viewport, capture list, logo locator chain.
  • reference/ia-extraction.md
    — sitemap + BFS crawl + cap procedure.
  • reference/current-state-schema.md
    — per-page JSON schema.
  • reference/brand-surface.md
    — consolidated brand-surface schema.
  • reference/brand-review-template.md
    — current-state brand-review HTML contract + Tensions detectors.
  • skills/stardust/reference/state-machine.md
    — state.json contract.
  • skills/stardust/reference/artifact-map.md
    — provenance shape.
  • reference/playwright-recipe.md
    — 视口、捕获列表、logo定位链。
  • reference/ia-extraction.md
    — 站点地图 + BFS爬取 + 页数限制流程。
  • reference/current-state-schema.md
    — 单页JSON schema。
  • reference/brand-surface.md
    — 整合后的品牌资产schema。
  • reference/brand-review-template.md
    — 当前状态品牌审核HTML约定 + 张力检测器。
  • skills/stardust/reference/state-machine.md
    — state.json约定。
  • skills/stardust/reference/artifact-map.md
    — 来源信息形态。