swain-search

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- swain-model-hint: opus, effort: high -->
<!-- swain-model-hint: opus, effort: high -->

swain-search

swain-search

Collect, normalize, and cache source materials into reusable troves that swain-design artifacts can reference.
将来源材料收集、规范化并缓存为可复用的Trove,供swain-design工件引用。

Mode detection

模式检测

SignalMode
No trove exists for the topic, or user says "research X" / "gather sources"Create — new trove
Trove exists and user provides new sources or says "add to" / "extend"Extend — add sources to existing trove
Trove exists and user says "refresh" or sources are past TTLRefresh — re-fetch stale sources
User asks "what troves do we have" or "find sources about X"Discover — search existing troves by tag
信号模式
主题对应的Trove不存在,或用户说"研究X" / "收集来源"创建 — 新Trove
Trove已存在,且用户提供新来源或说"添加到" / "扩展"扩展 — 为现有Trove添加来源
Trove已存在,且用户说"刷新"或来源超过TTL刷新 — 重新获取过期来源
用户问"我们有哪些Trove"或"查找关于X的来源"发现 — 按标签搜索现有Trove

Create mode

创建模式

Build a new trove from scratch.
从零开始构建新的Trove。

Step 1 — Gather inputs

步骤1 — 收集输入

Ask the user (or infer from context) for:
  1. Trove ID — a slug for the topic (e.g.,
    websocket-vs-sse
    ). Suggest one if the context is clear.
  2. Tags — keywords for discovery (e.g.,
    real-time
    ,
    websocket
    ,
    sse
    )
  3. Sources — any combination of:
    • Web search queries ("search for WebSocket vs SSE comparisons")
    • URLs (web pages, forum threads, docs)
    • Video/audio URLs
    • Local file paths
  4. Freshness TTL overrides — optional, defaults are fine for most troves
If invoked from swain-design (e.g., spike entering Active), the artifact context provides the topic, tags, and sometimes initial sources.
向用户询问(或从上下文推断)以下信息:
  1. Trove ID — 主题的短标识(例如:
    websocket-vs-sse
    )。如果上下文明确,可主动建议一个。
  2. 标签 — 用于发现的关键词(例如:
    real-time
    ,
    websocket
    ,
    sse
  3. 来源 — 以下任意组合:
    • 网页搜索查询("搜索WebSocket与SSE的对比")
    • URL(网页、论坛帖子、文档)
    • 视频/音频URL
    • 本地文件路径
  4. 新鲜度TTL覆盖 — 可选,默认值适用于大多数Trove
如果是从swain-design调用(例如:spike进入活跃状态),工件上下文会提供主题、标签,有时还会提供初始来源。

Step 2 — Collect and normalize

步骤2 — 收集与规范化

For each source, use the appropriate capability. Read
skills/swain-search/references/normalization-formats.md
for the exact markdown structure per source type.
Web search queries:
  1. Use a web search capability to find relevant results
  2. Select the top 3-5 most relevant results
  3. For each: fetch the page, normalize to markdown per the web page format
  4. If no web search capability is available, tell the user and skip
Web page URLs:
  1. Fetch the page using a browser or page-fetching capability
  2. Strip boilerplate (nav, ads, sidebars, cookie banners)
  3. Normalize to markdown per the web page format
  4. If fetch fails, record the URL in manifest with a
    failed: true
    flag and move on
Video/audio URLs:
  1. Use a media transcription capability to get the transcript
  2. Normalize to markdown per the media format (timestamps, speaker labels, key points)
  3. If no transcription capability is available, tell the user and skip — or accept a pre-made transcript
Local files:
  1. Use a document conversion capability (PDF, DOCX, etc.) or read directly if already markdown
  2. Normalize per the document format
  3. For markdown files: add frontmatter only, preserve content
Forum threads / discussions:
  1. Fetch and normalize per the forum format (chronological, author-attributed)
  2. Flatten nested threads to chronological order with reply-to context
Repositories:
  1. Clone or read the repository contents
  2. Mirror the original directory tree under
    sources/<source-id>/
  3. Default: mirror the full tree. For large repositories (thousands of files), ingest selectively and set
    selective: true
    in the manifest entry
  4. Populate the
    highlights
    array with paths to the most important files (relative to the source-id directory)
Documentation sites:
  1. Crawl or fetch the documentation site
  2. Mirror the section hierarchy under
    sources/<source-id>/
  3. Default: mirror the full site. For large sites, ingest selectively and set
    selective: true
  4. Populate the
    highlights
    array with paths to the most important pages
  5. Preserve internal link structure where possible
Each normalized source gets a slug-based source ID and lives in a directory-per-source layout:
  • Flat sources (web, forum, media, document, local):
    sources/<source-id>/<source-id>.md
  • Hierarchical sources (repository, documentation-site):
    sources/<source-id>/
    with the original tree mirrored inside
Source ID generation:
  • Derive the source ID as a slug from the source title or URL (e.g.,
    mdn-websocket-api
    ,
    strangeloop-2025-realtime
    )
  • When a slug collides with an existing source ID: append
    __word1-word2
    using two random words from
    skills/swain-search/references/wordlist.txt
  • If the wordlist is missing, append
    __
    followed by 4 hex characters (e.g.,
    __a3f8
    ) as a fallback
针对每个来源,使用相应的能力。请阅读
skills/swain-search/references/normalization-formats.md
了解每种来源类型对应的精确Markdown结构。
网页搜索查询:
  1. 使用网页搜索能力查找相关结果
  2. 选择最相关的3-5个结果
  3. 对每个结果:获取页面内容,按照网页格式规范化为Markdown
  4. 如果没有可用的网页搜索能力,告知用户并跳过
网页URL:
  1. 使用浏览器或页面获取能力获取页面内容
  2. 去除冗余内容(导航栏、广告、侧边栏、Cookie提示)
  3. 按照网页格式规范化为Markdown
  4. 如果获取失败,在清单中记录该URL并标记
    failed: true
    ,然后继续处理下一个来源
视频/音频URL:
  1. 使用媒体转写能力获取转录文本
  2. 按照媒体格式规范化为Markdown(包含时间戳、说话人标签、关键点)
  3. 如果没有可用的转写能力,告知用户并跳过 — 或接受预先制作的转录文本
本地文件:
  1. 使用文档转换能力(PDF、DOCX等),如果已是Markdown格式则直接读取
  2. 按照文档格式规范化
  3. 对于Markdown文件:仅添加前置元数据,保留原有内容
论坛帖子/讨论:
  1. 获取内容并按照论坛格式规范化(按时间顺序、标注作者)
  2. 将嵌套帖子展平为带回复上下文的时间顺序列表
代码仓库:
  1. 克隆或读取仓库内容
  2. sources/<source-id>/
    下镜像原始目录结构
  3. 默认:镜像完整目录树。对于大型仓库(数千个文件),选择性导入并在清单条目中设置
    selective: true
  4. highlights
    数组中填充最重要文件的路径(相对于source-id目录)
文档站点:
  1. 爬取或获取文档站点内容
  2. sources/<source-id>/
    下镜像章节层级
  3. 默认:镜像完整站点。对于大型站点,选择性导入并设置
    selective: true
  4. highlights
    数组中填充最重要页面的路径
  5. 尽可能保留内部链接结构
每个规范化后的来源都会获得一个基于短标识的source ID,并按每个来源一个目录的结构存储:
  • 扁平来源(网页、论坛、媒体、文档、本地):
    sources/<source-id>/<source-id>.md
  • 层级来源(代码仓库、文档站点):
    sources/<source-id>/
    ,内部镜像原始目录结构
Source ID生成规则:
  • 从来源标题或URL派生短标识作为source ID(例如:
    mdn-websocket-api
    ,
    strangeloop-2025-realtime
  • 当短标识与现有source ID冲突时:从
    skills/swain-search/references/wordlist.txt
    中选取两个随机单词,追加为
    __word1-word2
  • 如果单词列表缺失,作为回退方案,追加
    __
    和4个十六进制字符(例如:
    __a3f8

Step 3 — Generate manifest

步骤3 — 生成清单

Create
manifest.yaml
following the schema in
skills/swain-search/references/manifest-schema.md
. Include:
  • Trove metadata (id, created date, tags)
  • Default freshness TTL per source type
  • One entry per source with provenance (URL/path, fetch date, content hash, type)
Compute content hashes as bare hex SHA-256 digests (no prefix) of the normalized markdown content:
bash
shasum -a 256 sources/mdn-websocket-api/mdn-websocket-api.md | cut -d' ' -f1
按照
skills/swain-search/references/manifest-schema.md
中的 schema 创建
manifest.yaml
。内容包括:
  • Trove元数据(id、创建日期、标签)
  • 每种来源类型的默认新鲜度TTL
  • 每个来源的条目,包含溯源信息(URL/路径、获取日期、内容哈希、类型)
计算内容哈希时,使用规范化后Markdown内容的纯十六进制SHA-256摘要(无前缀):
bash
shasum -a 256 sources/mdn-websocket-api/mdn-websocket-api.md | cut -d' ' -f1

Step 4 — Generate synthesis

步骤4 — 生成综合摘要

Create
synthesis.md
— a structured distillation of key findings across all sources.
Structure the synthesis by theme, not by source. Group related findings together, cite sources by ID, and surface:
  • Key findings — what the sources collectively say about the topic
  • Points of agreement — where sources converge
  • Points of disagreement — where sources conflict or present alternatives
  • Gaps — what the sources don't cover that might matter
Keep it concise. The synthesis is a starting point, not a comprehensive report — the user or artifact author will refine it.
创建
synthesis.md
— 对所有来源的关键发现进行结构化提炼。
主题而非来源组织综合摘要。将相关发现分组,通过source ID引用来源,并突出显示:
  • 关键发现 — 所有来源关于该主题的共同结论
  • 共识点 — 来源达成一致的内容
  • 分歧点 — 来源存在冲突或提供替代方案的内容
  • 空白点 — 来源未涵盖但可能重要的内容
保持简洁。综合摘要只是起点,而非全面报告 — 用户或工件作者会对其进行细化。

Step 5 — Commit and stamp

步骤5 — 提交与标记

Use the dual-commit pattern (same as swain-design lifecycle stamps) to give the trove a reachable commit hash.
Before Commit A — append a
history
entry to
manifest.yaml
with a
--
placeholder for the commit hash:
yaml
history:
  - event: created
    date: 2026-03-09
    commit: "--"
    sources: 3
Commit A — commit the trove content:
bash
git add docs/troves/<trove-id>/
git commit -m "research(<trove-id>): create trove with N sources"
TROVE_HASH=$(git rev-parse HEAD)
Commit B — back-fill the commit hash into the history entry, then update the referencing artifact's frontmatter (if one exists):
bash
undefined
使用双提交模式(与swain-design生命周期标记相同)为Trove赋予可访问的提交哈希。
提交A之前 — 在
manifest.yaml
history
条目中追加一个记录,使用
--
作为提交哈希的占位符:
yaml
history:
  - event: created
    date: 2026-03-09
    commit: "--"
    sources: 3
提交A — 提交Trove内容:
bash
git add docs/troves/<trove-id>/
git commit -m "research(<trove-id>): create trove with N sources"
TROVE_HASH=$(git rev-parse HEAD)
提交B — 将提交哈希回填到历史条目,然后更新引用该Trove的工件的前置元数据(如果存在):
bash
undefined

Replace "--" with the real hash in the history entry

将历史条目中的"--"替换为真实哈希

Update artifact frontmatter: trove: <trove-id>@<TROVE_HASH>

更新工件前置元数据:trove: <trove-id>@<TROVE_HASH>

git add docs/troves/<trove-id>/manifest.yaml git add docs/<artifact-type>/<phase>/<artifact-dir>/ # if artifact exists git commit -m "docs(<trove-id>): stamp history hash ${TROVE_HASH:0:7}"

If no referencing artifact exists yet (standalone research), Commit B still stamps the history entry — report the hash so it can be referenced later.
git add docs/troves/<trove-id>/manifest.yaml git add docs/<artifact-type>/<phase>/<artifact-dir>/ # 如果工件存在 git commit -m "docs(<trove-id>): stamp history hash ${TROVE_HASH:0:7}"

如果还没有引用该Trove的工件(独立研究),提交B仍会标记历史条目 — 报告哈希以便后续引用。

Step 6 — Report

步骤6 — 报告

Tell the user what was created:
Trove
<trove-id>
created
with N sources — committed as
<TROVE_HASH:0:7>
.
  • docs/troves/<trove-id>/manifest.yaml
    — provenance and metadata
  • docs/troves/<trove-id>/sources/
    — N normalized source files
  • docs/troves/<trove-id>/synthesis.md
    — thematic distillation
Reference from artifacts with:
trove: <trove-id>@<TROVE_HASH:0:7>
告知用户已创建的内容:
已创建Trove
<trove-id>
,包含N个来源 — 提交哈希为
<TROVE_HASH:0:7>
  • docs/troves/<trove-id>/manifest.yaml
    — 溯源与元数据
  • docs/troves/<trove-id>/sources/
    — N个规范化后的来源文件
  • docs/troves/<trove-id>/synthesis.md
    — 主题提炼摘要
在工件中引用格式:
trove: <trove-id>@<TROVE_HASH:0:7>

Extend mode

扩展模式

Add new sources to an existing trove.
  1. Read the existing
    manifest.yaml
  2. Collect and normalize new sources (same as Create step 2)
  3. Assign slug-based source IDs to new sources (following the same ID generation rules)
  4. Append new entries to
    manifest.yaml
  5. Update
    refreshed
    date
  6. Regenerate
    synthesis.md
    incorporating all sources (old + new)
  7. Append a
    history
    entry with
    event: extended
    and
    commit: "--"
    placeholder
  8. Commit and stamp (same dual-commit pattern as Create step 5):
    • Commit A:
      git commit -m "research(<trove-id>): extend with N new sources"
    • Capture
      TROVE_HASH=$(git rev-parse HEAD)
    • Commit B: back-fill hash in history entry, update referencing artifact frontmatter (if artifact exists)
  9. Report what was added, including the new commit hash
为现有Trove添加新来源。
  1. 读取现有的
    manifest.yaml
  2. 收集并规范化新来源(与创建模式步骤2相同)
  3. 为新来源分配基于短标识的source ID(遵循相同的ID生成规则)
  4. manifest.yaml
    中追加新条目
  5. 更新
    refreshed
    日期
  6. 重新生成
    synthesis.md
    ,整合所有来源(旧+新)
  7. history
    条目中追加一个
    event: extended
    的记录,使用
    --
    作为提交哈希占位符
  8. 提交与标记(与创建模式步骤5相同的双提交模式):
    • 提交A
      git commit -m "research(<trove-id>): extend with N new sources"
    • 获取
      TROVE_HASH=$(git rev-parse HEAD)
    • 提交B:将哈希回填到历史条目,更新引用该Trove的工件前置元数据(如果存在)
  9. 报告已添加的内容,包括新的提交哈希

Refresh mode

刷新模式

Re-fetch stale sources and update changed content.
  1. Read
    manifest.yaml
  2. For each source, check if
    fetched
    date +
    freshness-ttl
    has elapsed
  3. For stale sources:
    • Re-fetch the raw content
    • Re-normalize to markdown
    • Compute new content hash
    • If hash changed: replace the source file, update manifest entry
    • If hash unchanged: update only
      fetched
      date
  4. Update
    refreshed
    date in manifest
  5. If any content changed, regenerate
    synthesis.md
  6. Append a
    history
    entry with
    event: refreshed
    ,
    sources-changed: M
    , and
    commit: "--"
    placeholder
  7. Commit and stamp (same dual-commit pattern as Create step 5):
    • Commit A:
      git commit -m "research(<trove-id>): refresh N sources (M changed)"
    • Capture
      TROVE_HASH=$(git rev-parse HEAD)
    • Commit B: back-fill hash in history entry, update referencing artifact(s) frontmatter — check
      referenced-by
      in manifest for all dependents
  8. Report: "Refreshed N sources. M had changed content, K were unchanged. New hash:
    <TROVE_HASH:0:7>
    ."
For sources with
freshness-ttl: never
, skip them during refresh.
重新获取过期来源并更新已更改的内容。
  1. 读取
    manifest.yaml
  2. 对每个来源,检查
    fetched
    日期 +
    freshness-ttl
    是否已过期
  3. 对于过期来源:
    • 重新获取原始内容
    • 重新规范化为Markdown
    • 计算新的内容哈希
    • 如果哈希已更改:替换来源文件,更新清单条目
    • 如果哈希未更改:仅更新
      fetched
      日期
  4. 更新清单中的
    refreshed
    日期
  5. 如果有内容更改,重新生成
    synthesis.md
  6. history
    条目中追加一个
    event: refreshed
    ,
    sources-changed: M
    的记录,使用
    --
    作为提交哈希占位符
  7. 提交与标记(与创建模式步骤5相同的双提交模式):
    • 提交A
      git commit -m "research(<trove-id>): refresh N sources (M changed)"
    • 获取
      TROVE_HASH=$(git rev-parse HEAD)
    • 提交B:将哈希回填到历史条目,更新所有引用该Trove的工件前置元数据 — 检查清单中的
      referenced-by
      获取所有依赖项
  8. 报告:"已刷新N个来源。其中M个内容有更改,K个无变化。新哈希:
    <TROVE_HASH:0:7>
    。"
对于
freshness-ttl: never
的来源,刷新时跳过。

Discover mode

发现模式

Help the user find existing troves relevant to their topic.
  1. Scan
    docs/troves/*/manifest.yaml
    for all troves
  2. Match against the user's query by:
    • Tag match — trove tags contain query keywords
    • Title match — trove ID slug contains query keywords
  3. For each match, show: trove ID, tags, source count, last refreshed date, referenced-by list
  4. If no matches, suggest creating a new trove
帮助用户找到与他们的主题相关的现有Trove。
  1. 扫描
    docs/troves/*/manifest.yaml
    获取所有Trove
  2. 通过以下方式匹配用户查询:
    • 标签匹配 — Trove标签包含查询关键词
    • 标题匹配 — Trove ID短标识包含查询关键词
  3. 对于每个匹配结果,显示:Trove ID、标签、来源数量、最后刷新日期、引用列表
  4. 如果没有匹配结果,建议创建新的Trove

Graceful degradation

优雅降级

The skill references capabilities generically. When a capability isn't available:
CapabilityFallback
Web searchSkip search-based sources. Tell user: "No web search capability available — provide URLs directly or add a search MCP."
Browser / page fetcherTry basic URL fetch. If that fails: "Can't fetch this URL — paste the content or provide a local file."
Media transcription"No transcription capability available — provide a pre-made transcript file, or add a media conversion tool."
Document conversion"Can't convert this file type — provide a markdown version, or add a document conversion tool."
Never fail the entire run because one capability is missing. Collect what you can, skip what you can't, and report clearly.
该技能通用地引用各种能力。当某能力不可用时:
能力回退方案
网页搜索跳过基于搜索的来源。告知用户:"无可用的网页搜索能力 — 请直接提供URL或添加搜索MCP。"
浏览器/页面获取尝试基础URL获取。如果失败:"无法获取该URL — 请粘贴内容或提供本地文件。"
媒体转写"无可用的媒体转写能力 — 请提供预先制作的转录文件,或添加媒体转换工具。"
文档转换"无法转换该文件类型 — 请提供Markdown版本,或添加文档转换工具。"
不要因某一个能力缺失而导致整个运行失败。尽可能收集可用内容,跳过无法处理的部分,并清晰告知用户。

Capability detection

能力检测

Before collecting sources, check what's available. Look for tools matching these patterns — the exact tool names vary by installation:
  • Web search: tools with "search" in the name (e.g.,
    brave_web_search
    ,
    bing-search-to-markdown
    )
  • Page fetching: tools with "fetch", "webpage", "browser" in the name (e.g.,
    fetch_content
    ,
    webpage-to-markdown
    ,
    browser_navigate
    )
  • Media transcription: tools with "audio", "video", "youtube" in the name (e.g.,
    audio-to-markdown
    ,
    youtube-to-markdown
    )
  • Document conversion: tools with "pdf", "docx", "pptx", "xlsx" in the name (e.g.,
    pdf-to-markdown
    ,
    docx-to-markdown
    )
Report available capabilities at the start of collection so the user knows what will and won't work.
在收集来源之前,检查可用的能力。查找符合以下模式的工具 — 具体工具名称因安装环境而异:
  • 网页搜索:名称包含"search"的工具(例如:
    brave_web_search
    ,
    bing-search-to-markdown
  • 页面获取:名称包含"fetch", "webpage", "browser"的工具(例如:
    fetch_content
    ,
    webpage-to-markdown
    ,
    browser_navigate
  • 媒体转写:名称包含"audio", "video", "youtube"的工具(例如:
    audio-to-markdown
    ,
    youtube-to-markdown
  • 文档转换:名称包含"pdf", "docx", "pptx", "xlsx"的工具(例如:
    pdf-to-markdown
    ,
    docx-to-markdown
在收集开始时报告可用的能力,让用户了解哪些功能可用,哪些不可用。

Linking from artifacts

从工件链接到Trove

Artifacts reference troves in frontmatter:
yaml
trove: websocket-vs-sse@abc1234
The format is
<trove-id>@<commit-hash>
. The commit hash pins the trove to a specific version — troves evolve over time as sources are added or refreshed, and the hash ensures reproducibility.
The dual-commit workflow in Create step 5, Extend step 8, and Refresh step 7 handles this automatically — Commit A records the trove content and Commit B stamps the hash into the history entry and referencing artifact's frontmatter. Do not defer this to the operator.
工件在前置元数据中引用Trove:
yaml
trove: websocket-vs-sse@abc1234
格式为
<trove-id>@<commit-hash>
。提交哈希将Trove固定到特定版本 — 随着来源的添加或刷新,Trove会不断演变,哈希确保了可复现性。
创建模式步骤5、扩展模式步骤8和刷新模式步骤7中的双提交工作流会自动处理这一点 — 提交A记录Trove内容,提交B将哈希标记到历史条目和引用工件的前置元数据中。请勿将此操作推迟给操作人员。