swain-search

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

swain-search

Collect, normalize, and cache source materials into reusable troves that swain-design artifacts can reference.

将来源材料收集、规范化并缓存为可复用的Trove，供swain-design工件引用。

Mode detection

模式检测

Signal	Mode
No trove exists for the topic, or user says "research X" / "gather sources"	Create — new trove
Trove exists and user provides new sources or says "add to" / "extend"	Extend — add sources to existing trove
Trove exists and user says "refresh" or sources are past TTL	Refresh — re-fetch stale sources
User asks "what troves do we have" or "find sources about X"	Discover — search existing troves by tag

信号	模式
主题对应的Trove不存在，或用户说"研究X" / "收集来源"	创建 — 新Trove
Trove已存在，且用户提供新来源或说"添加到" / "扩展"	扩展 — 为现有Trove添加来源
Trove已存在，且用户说"刷新"或来源超过TTL	刷新 — 重新获取过期来源
用户问"我们有哪些Trove"或"查找关于X的来源"	发现 — 按标签搜索现有Trove

Create mode

创建模式

Build a new trove from scratch.

从零开始构建新的Trove。

Step 1 — Gather inputs

步骤1 — 收集输入

Ask the user (or infer from context) for:

Trove ID — a slug for the topic (e.g.,
```
websocket-vs-sse
```
). Suggest one if the context is clear.
Tags — keywords for discovery (e.g.,
```
real-time
```
,
```
websocket
```
,
```
sse
```
)
Sources — any combination of:
- Web search queries ("search for WebSocket vs SSE comparisons")
- URLs (web pages, forum threads, docs)
- Video/audio URLs
- Local file paths
Freshness TTL overrides — optional, defaults are fine for most troves

If invoked from swain-design (e.g., spike entering Active), the artifact context provides the topic, tags, and sometimes initial sources.

向用户询问（或从上下文推断）以下信息：

Trove ID — 主题的短标识（例如：
```
websocket-vs-sse
```
）。如果上下文明确，可主动建议一个。
标签 — 用于发现的关键词（例如：
```
real-time
```
,
```
websocket
```
,
```
sse
```
）
来源 — 以下任意组合：
- 网页搜索查询（"搜索WebSocket与SSE的对比"）
- URL（网页、论坛帖子、文档）
- 视频/音频URL
- 本地文件路径
新鲜度TTL覆盖 — 可选，默认值适用于大多数Trove

如果是从swain-design调用（例如：spike进入活跃状态），工件上下文会提供主题、标签，有时还会提供初始来源。

Step 2 — Collect and normalize

步骤2 — 收集与规范化

For each source, use the appropriate capability. Read

skills/swain-search/references/normalization-formats.md

for the exact markdown structure per source type.

Web search queries:

Use a web search capability to find relevant results
Select the top 3-5 most relevant results
For each: fetch the page, normalize to markdown per the web page format
If no web search capability is available, tell the user and skip

Web page URLs:

Fetch the page using a browser or page-fetching capability
Strip boilerplate (nav, ads, sidebars, cookie banners)
Normalize to markdown per the web page format
If fetch fails, record the URL in manifest with a
```
failed: true
```
flag and move on

Video/audio URLs:

Use a media transcription capability to get the transcript
Normalize to markdown per the media format (timestamps, speaker labels, key points)
If no transcription capability is available, tell the user and skip — or accept a pre-made transcript

Local files:

Use a document conversion capability (PDF, DOCX, etc.) or read directly if already markdown
Normalize per the document format
For markdown files: add frontmatter only, preserve content

Forum threads / discussions:

Fetch and normalize per the forum format (chronological, author-attributed)
Flatten nested threads to chronological order with reply-to context

Repositories:

Clone or read the repository contents
Mirror the original directory tree under
```
sources/<source-id>/
```
Default: mirror the full tree. For large repositories (thousands of files), ingest selectively and set
```
selective: true
```
in the manifest entry
Populate the
```
highlights
```
array with paths to the most important files (relative to the source-id directory)

Documentation sites:

Crawl or fetch the documentation site
Mirror the section hierarchy under
```
sources/<source-id>/
```
Default: mirror the full site. For large sites, ingest selectively and set
```
selective: true
```
Populate the
```
highlights
```
array with paths to the most important pages
Preserve internal link structure where possible

Each normalized source gets a slug-based source ID and lives in a directory-per-source layout:

Flat sources (web, forum, media, document, local):
```
sources/<source-id>/<source-id>.md
```
Hierarchical sources (repository, documentation-site):
```
sources/<source-id>/
```
with the original tree mirrored inside

Source ID generation:

Derive the source ID as a slug from the source title or URL (e.g.,
```
mdn-websocket-api
```
,
```
strangeloop-2025-realtime
```
)
When a slug collides with an existing source ID: append
```
__word1-word2
```
using two random words from
```
skills/swain-search/references/wordlist.txt
```
If the wordlist is missing, append
```
__
```
followed by 4 hex characters (e.g.,
```
__a3f8
```
) as a fallback

针对每个来源，使用相应的能力。请阅读

skills/swain-search/references/normalization-formats.md

了解每种来源类型对应的精确Markdown结构。

网页搜索查询：

使用网页搜索能力查找相关结果
选择最相关的3-5个结果
对每个结果：获取页面内容，按照网页格式规范化为Markdown
如果没有可用的网页搜索能力，告知用户并跳过

网页URL：

使用浏览器或页面获取能力获取页面内容
去除冗余内容（导航栏、广告、侧边栏、Cookie提示）
按照网页格式规范化为Markdown
如果获取失败，在清单中记录该URL并标记
```
failed: true
```
，然后继续处理下一个来源

视频/音频URL：

使用媒体转写能力获取转录文本
按照媒体格式规范化为Markdown（包含时间戳、说话人标签、关键点）
如果没有可用的转写能力，告知用户并跳过 — 或接受预先制作的转录文本

本地文件：

使用文档转换能力（PDF、DOCX等），如果已是Markdown格式则直接读取
按照文档格式规范化
对于Markdown文件：仅添加前置元数据，保留原有内容

论坛帖子/讨论：

获取内容并按照论坛格式规范化（按时间顺序、标注作者）
将嵌套帖子展平为带回复上下文的时间顺序列表

代码仓库：

克隆或读取仓库内容
在
```
sources/<source-id>/
```
下镜像原始目录结构
默认：镜像完整目录树。对于大型仓库（数千个文件），选择性导入并在清单条目中设置
```
selective: true
```
在
```
highlights
```
数组中填充最重要文件的路径（相对于source-id目录）

文档站点：

爬取或获取文档站点内容
在
```
sources/<source-id>/
```
下镜像章节层级
默认：镜像完整站点。对于大型站点，选择性导入并设置
```
selective: true
```
在
```
highlights
```
数组中填充最重要页面的路径
尽可能保留内部链接结构

每个规范化后的来源都会获得一个基于短标识的source ID，并按每个来源一个目录的结构存储：

扁平来源（网页、论坛、媒体、文档、本地）：
```
sources/<source-id>/<source-id>.md
```
层级来源（代码仓库、文档站点）：
```
sources/<source-id>/
```
，内部镜像原始目录结构

Source ID生成规则：

从来源标题或URL派生短标识作为source ID（例如：
```
mdn-websocket-api
```
,
```
strangeloop-2025-realtime
```
）
当短标识与现有source ID冲突时：从
```
skills/swain-search/references/wordlist.txt
```
中选取两个随机单词，追加为
```
__word1-word2
```
如果单词列表缺失，作为回退方案，追加
```
__
```
和4个十六进制字符（例如：
```
__a3f8
```
）

Step 3 — Generate manifest

步骤3 — 生成清单

Create

manifest.yaml

following the schema in

skills/swain-search/references/manifest-schema.md

. Include:

Trove metadata (id, created date, tags)
Default freshness TTL per source type
One entry per source with provenance (URL/path, fetch date, content hash, type)

Compute content hashes as bare hex SHA-256 digests (no prefix) of the normalized markdown content:

bash

shasum -a 256 sources/mdn-websocket-api/mdn-websocket-api.md | cut -d' ' -f1

按照

skills/swain-search/references/manifest-schema.md

中的 schema 创建

manifest.yaml

。内容包括：

Trove元数据（id、创建日期、标签）
每种来源类型的默认新鲜度TTL
每个来源的条目，包含溯源信息（URL/路径、获取日期、内容哈希、类型）

计算内容哈希时，使用规范化后Markdown内容的纯十六进制SHA-256摘要（无前缀）：

bash

shasum -a 256 sources/mdn-websocket-api/mdn-websocket-api.md | cut -d' ' -f1

Step 4 — Generate synthesis

步骤4 — 生成综合摘要

Create

synthesis.md

— a structured distillation of key findings across all sources.

Structure the synthesis by theme, not by source. Group related findings together, cite sources by ID, and surface:

Key findings — what the sources collectively say about the topic
Points of agreement — where sources converge
Points of disagreement — where sources conflict or present alternatives
Gaps — what the sources don't cover that might matter

Keep it concise. The synthesis is a starting point, not a comprehensive report — the user or artifact author will refine it.

创建

synthesis.md

— 对所有来源的关键发现进行结构化提炼。

按主题而非来源组织综合摘要。将相关发现分组，通过source ID引用来源，并突出显示：

关键发现 — 所有来源关于该主题的共同结论
共识点 — 来源达成一致的内容
分歧点 — 来源存在冲突或提供替代方案的内容
空白点 — 来源未涵盖但可能重要的内容

保持简洁。综合摘要只是起点，而非全面报告 — 用户或工件作者会对其进行细化。

Step 5 — Commit and stamp

步骤5 — 提交与标记

Use the dual-commit pattern (same as swain-design lifecycle stamps) to give the trove a reachable commit hash.

Before Commit A — append a

history

entry to

manifest.yaml

with a

--

placeholder for the commit hash:

yaml

history:
  - event: created
    date: 2026-03-09
    commit: "--"
    sources: 3

Commit A — commit the trove content:

bash

git add docs/troves/<trove-id>/
git commit -m "research(<trove-id>): create trove with N sources"
TROVE_HASH=$(git rev-parse HEAD)

Commit B — back-fill the commit hash into the history entry, then update the referencing artifact's frontmatter (if one exists):

bash

undefined

使用双提交模式（与swain-design生命周期标记相同）为Trove赋予可访问的提交哈希。

提交A之前 — 在

manifest.yaml

的

history

条目中追加一个记录，使用

--

作为提交哈希的占位符：

yaml

history:
  - event: created
    date: 2026-03-09
    commit: "--"
    sources: 3

提交A — 提交Trove内容：

bash

git add docs/troves/<trove-id>/
git commit -m "research(<trove-id>): create trove with N sources"
TROVE_HASH=$(git rev-parse HEAD)

提交B — 将提交哈希回填到历史条目，然后更新引用该Trove的工件的前置元数据（如果存在）：

bash

undefined

Replace "--" with the real hash in the history entry

将历史条目中的"--"替换为真实哈希

Update artifact frontmatter: trove: <trove-id>@<TROVE_HASH>

更新工件前置元数据：trove: <trove-id>@<TROVE_HASH>

git add docs/troves/<trove-id>/manifest.yaml git add docs/<artifact-type>/<phase>/<artifact-dir>/ # if artifact exists git commit -m "docs(<trove-id>): stamp history hash ${TROVE_HASH:0:7}"


If no referencing artifact exists yet (standalone research), Commit B still stamps the history entry — report the hash so it can be referenced later.

git add docs/troves/<trove-id>/manifest.yaml git add docs/<artifact-type>/<phase>/<artifact-dir>/ # 如果工件存在 git commit -m "docs(<trove-id>): stamp history hash ${TROVE_HASH:0:7}"


如果还没有引用该Trove的工件（独立研究），提交B仍会标记历史条目 — 报告哈希以便后续引用。

Step 6 — Report

步骤6 — 报告

Tell the user what was created:

Trove
<trove-id>
created with N sources — committed as
<TROVE_HASH:0:7>
.
docs/troves/<trove-id>/manifest.yaml
— provenance and metadata
docs/troves/<trove-id>/sources/
— N normalized source files
docs/troves/<trove-id>/synthesis.md
— thematic distillation
Reference from artifacts with:
trove: <trove-id>@<TROVE_HASH:0:7>

告知用户已创建的内容：

已创建Trove
<trove-id>
，包含N个来源 — 提交哈希为
<TROVE_HASH:0:7>
。
docs/troves/<trove-id>/manifest.yaml
— 溯源与元数据
docs/troves/<trove-id>/sources/
— N个规范化后的来源文件
docs/troves/<trove-id>/synthesis.md
— 主题提炼摘要
在工件中引用格式：
trove: <trove-id>@<TROVE_HASH:0:7>

Extend mode

扩展模式

Add new sources to an existing trove.

Read the existing
```
manifest.yaml
```
Collect and normalize new sources (same as Create step 2)
Assign slug-based source IDs to new sources (following the same ID generation rules)
Append new entries to
```
manifest.yaml
```
Update
```
refreshed
```
date
Regenerate
```
synthesis.md
```
incorporating all sources (old + new)
Append a
```
history
```
entry with
```
event: extended
```
and
```
commit: "--"
```
placeholder
Commit and stamp (same dual-commit pattern as Create step 5):
- Commit A:
```
git commit -m "research(<trove-id>): extend with N new sources"
```
- Capture
```
TROVE_HASH=$(git rev-parse HEAD)
```
- Commit B: back-fill hash in history entry, update referencing artifact frontmatter (if artifact exists)
Report what was added, including the new commit hash

为现有Trove添加新来源。

读取现有的
```
manifest.yaml
```
收集并规范化新来源（与创建模式步骤2相同）
为新来源分配基于短标识的source ID（遵循相同的ID生成规则）
在
```
manifest.yaml
```
中追加新条目
更新
```
refreshed
```
日期
重新生成
```
synthesis.md
```
，整合所有来源（旧+新）
在
```
history
```
条目中追加一个
```
event: extended
```
的记录，使用
```
--
```
作为提交哈希占位符
提交与标记（与创建模式步骤5相同的双提交模式）：
- 提交A：
```
git commit -m "research(<trove-id>): extend with N new sources"
```
- 获取
```
TROVE_HASH=$(git rev-parse HEAD)
```
- 提交B：将哈希回填到历史条目，更新引用该Trove的工件前置元数据（如果存在）
报告已添加的内容，包括新的提交哈希

Refresh mode

刷新模式

Re-fetch stale sources and update changed content.

Read
```
manifest.yaml
```
For each source, check if
```
fetched
```
date +
```
freshness-ttl
```
has elapsed
For stale sources:
- Re-fetch the raw content
- Re-normalize to markdown
- Compute new content hash
- If hash changed: replace the source file, update manifest entry
- If hash unchanged: update only
```
fetched
```
  date
Update
```
refreshed
```
date in manifest
If any content changed, regenerate
```
synthesis.md
```

Append a

history

entry with

event: refreshed

sources-changed: M

, and

commit: "--"

placeholder

Commit and stamp (same dual-commit pattern as Create step 5):
- Commit A:
```
git commit -m "research(<trove-id>): refresh N sources (M changed)"
```
- Capture
```
TROVE_HASH=$(git rev-parse HEAD)
```
- Commit B: back-fill hash in history entry, update referencing artifact(s) frontmatter — check
```
referenced-by
```
  in manifest for all dependents
Report: "Refreshed N sources. M had changed content, K were unchanged. New hash:
```
<TROVE_HASH:0:7>
```
."

For sources with

freshness-ttl: never

, skip them during refresh.

重新获取过期来源并更新已更改的内容。

读取
```
manifest.yaml
```
对每个来源，检查
```
fetched
```
日期 +
```
freshness-ttl
```
是否已过期
对于过期来源：
- 重新获取原始内容
- 重新规范化为Markdown
- 计算新的内容哈希
- 如果哈希已更改：替换来源文件，更新清单条目
- 如果哈希未更改：仅更新
```
fetched
```
  日期
更新清单中的
```
refreshed
```
日期
如果有内容更改，重新生成
```
synthesis.md
```
在
```
history
```
条目中追加一个
```
event: refreshed
```
,
```
sources-changed: M
```
的记录，使用
```
--
```
作为提交哈希占位符
提交与标记（与创建模式步骤5相同的双提交模式）：
- 提交A：
```
git commit -m "research(<trove-id>): refresh N sources (M changed)"
```
- 获取
```
TROVE_HASH=$(git rev-parse HEAD)
```
- 提交B：将哈希回填到历史条目，更新所有引用该Trove的工件前置元数据 — 检查清单中的
```
referenced-by
```
  获取所有依赖项
报告："已刷新N个来源。其中M个内容有更改，K个无变化。新哈希：
```
<TROVE_HASH:0:7>
```
。"

对于

freshness-ttl: never

的来源，刷新时跳过。

Discover mode

发现模式

Help the user find existing troves relevant to their topic.

Scan
```
docs/troves/*/manifest.yaml
```
for all troves
Match against the user's query by:
- Tag match — trove tags contain query keywords
- Title match — trove ID slug contains query keywords
For each match, show: trove ID, tags, source count, last refreshed date, referenced-by list
If no matches, suggest creating a new trove

帮助用户找到与他们的主题相关的现有Trove。

扫描
```
docs/troves/*/manifest.yaml
```
获取所有Trove
通过以下方式匹配用户查询：
- 标签匹配 — Trove标签包含查询关键词
- 标题匹配 — Trove ID短标识包含查询关键词
对于每个匹配结果，显示：Trove ID、标签、来源数量、最后刷新日期、引用列表
如果没有匹配结果，建议创建新的Trove

Graceful degradation

优雅降级

The skill references capabilities generically. When a capability isn't available:

Capability	Fallback
Web search	Skip search-based sources. Tell user: "No web search capability available — provide URLs directly or add a search MCP."
Browser / page fetcher	Try basic URL fetch. If that fails: "Can't fetch this URL — paste the content or provide a local file."
Media transcription	"No transcription capability available — provide a pre-made transcript file, or add a media conversion tool."
Document conversion	"Can't convert this file type — provide a markdown version, or add a document conversion tool."

Never fail the entire run because one capability is missing. Collect what you can, skip what you can't, and report clearly.

该技能通用地引用各种能力。当某能力不可用时：

能力	回退方案
网页搜索	跳过基于搜索的来源。告知用户："无可用的网页搜索能力 — 请直接提供URL或添加搜索MCP。"
浏览器/页面获取	尝试基础URL获取。如果失败："无法获取该URL — 请粘贴内容或提供本地文件。"
媒体转写	"无可用的媒体转写能力 — 请提供预先制作的转录文件，或添加媒体转换工具。"
文档转换	"无法转换该文件类型 — 请提供Markdown版本，或添加文档转换工具。"

不要因某一个能力缺失而导致整个运行失败。尽可能收集可用内容，跳过无法处理的部分，并清晰告知用户。

Capability detection

能力检测

Before collecting sources, check what's available. Look for tools matching these patterns — the exact tool names vary by installation:

Web search: tools with "search" in the name (e.g.,
```
brave_web_search
```
,
```
bing-search-to-markdown
```
)
Page fetching: tools with "fetch", "webpage", "browser" in the name (e.g.,
```
fetch_content
```
,
```
webpage-to-markdown
```
,
```
browser_navigate
```
)
Media transcription: tools with "audio", "video", "youtube" in the name (e.g.,
```
audio-to-markdown
```
,
```
youtube-to-markdown
```
)
Document conversion: tools with "pdf", "docx", "pptx", "xlsx" in the name (e.g.,
```
pdf-to-markdown
```
,
```
docx-to-markdown
```
)

Report available capabilities at the start of collection so the user knows what will and won't work.

在收集来源之前，检查可用的能力。查找符合以下模式的工具 — 具体工具名称因安装环境而异：

网页搜索：名称包含"search"的工具（例如：
```
brave_web_search
```
,
```
bing-search-to-markdown
```
）
页面获取：名称包含"fetch", "webpage", "browser"的工具（例如：
```
fetch_content
```
,
```
webpage-to-markdown
```
,
```
browser_navigate
```
）
媒体转写：名称包含"audio", "video", "youtube"的工具（例如：
```
audio-to-markdown
```
,
```
youtube-to-markdown
```
）
文档转换：名称包含"pdf", "docx", "pptx", "xlsx"的工具（例如：
```
pdf-to-markdown
```
,
```
docx-to-markdown
```
）

在收集开始时报告可用的能力，让用户了解哪些功能可用，哪些不可用。

Linking from artifacts

从工件链接到Trove

Artifacts reference troves in frontmatter:

yaml

trove: websocket-vs-sse@abc1234

The format is

<trove-id>@<commit-hash>

. The commit hash pins the trove to a specific version — troves evolve over time as sources are added or refreshed, and the hash ensures reproducibility.

The dual-commit workflow in Create step 5, Extend step 8, and Refresh step 7 handles this automatically — Commit A records the trove content and Commit B stamps the hash into the history entry and referencing artifact's frontmatter. Do not defer this to the operator.

工件在前置元数据中引用Trove：

yaml

trove: websocket-vs-sse@abc1234

格式为

<trove-id>@<commit-hash>

。提交哈希将Trove固定到特定版本 — 随着来源的添加或刷新，Trove会不断演变，哈希确保了可复现性。

创建模式步骤5、扩展模式步骤8和刷新模式步骤7中的双提交工作流会自动处理这一点 — 提交A记录Trove内容，提交B将哈希标记到历史条目和引用工件的前置元数据中。请勿将此操作推迟给操作人员。