archive-crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

archive-crawler — The Universal Archivist

archive-crawler — 通用档案管理工具

Convention: see conventions/quality.md for citation rules, exact-phrasing requirements when capturing the user's reactions, and back-link enforcement.
Convention: see _brain-filing-rules.md — this skill is schema-generic: it reads the user's filing rules from the rules JSON instead of hardcoding any specific era / archive layout.
约定: 请查看 conventions/quality.md 了解引用规则、捕捉用户反馈时的精确措辞要求,以及反向链接强制规范。
约定: 请查看 _brain-filing-rules.md —— 本技能为** schema 通用型**:它从规则JSON中读取用户的归档规则,而非硬编码任何特定时期/归档布局。

Safety gate (REQUIRED, no exceptions)

安全闸门(必填,无例外)

archive-crawler refuses to run unless
archive-crawler.scan_paths:
is explicitly set in
gbrain.yml
. This is a deliberate safety fence against the agent over-scoping a scan and ingesting sensitive content (tax PDFs, medical records, credentials).
yaml
undefined
除非在
gbrain.yml
中明确设置
archive-crawler.scan_paths:
,否则archive-crawler将拒绝运行。这是一项刻意设置的安全防护机制,防止代理过度扩大扫描范围,摄入敏感内容(如税务PDF、医疗记录、凭证等)。
yaml
undefined

gbrain.yml — the allow-list is mandatory

gbrain.yml —— 允许列表为必填项

archive-crawler: scan_paths: - ~/Documents/writing/ - ~/Dropbox/Archive/ - /mnt/backup/old-letters/

Optional deny-list inside the allow-list:

deny_paths:

- ~/Documents/finances/

- ~/Documents/medical/


If `scan_paths` is empty or missing, the skill exits with:
archive-crawler: refusing to run. No
archive-crawler.scan_paths:
allow-list in gbrain.yml. Add explicit paths the agent is permitted to scan, then re-run. This is a safety fence — the agent will not infer what's safe to read.

This contract is enforced by `src/core/storage-config.ts` (mirrors the
`db_tracked` / `db_only` allow-list pattern from v0.22.11 storage tiering).
archive-crawler: scan_paths: - ~/Documents/writing/ - ~/Dropbox/Archive/ - /mnt/backup/old-letters/

允许列表内可选的拒绝列表:

deny_paths:

- ~/Documents/finances/

- ~/Documents/medical/


如果`scan_paths`为空或缺失,技能将退出并提示:
archive-crawler: refusing to run. No
archive-crawler.scan_paths:
allow-list in gbrain.yml. Add explicit paths the agent is permitted to scan, then re-run. This is a safety fence — the agent will not infer what's safe to read.

该规则由`src/core/storage-config.ts`强制执行(沿用v0.22.11存储分层中的`db_tracked` / `db_only`允许列表模式)。

What this is

工具简介

Generic engine for exploring any tree of personal content within an explicit allow-list. Works on local mounts, Dropbox API targets, Backblaze B2, Gmail takeouts (
.mbox
), and similar archives. Filters for "gold" (the user's own writing, ideas, relationships) and surfaces it interactively for review. Skips noise (system files, configs, binary blobs).
这是一款通用引擎,用于在明确的允许列表范围内探索任何个人内容目录树。支持本地挂载目录、Dropbox API目标、Backblaze B2、Gmail导出文件(
.mbox
)及类似归档。可筛选出「高价值内容」(用户原创文字、想法、人际关系相关内容)并以交互方式呈现供用户审阅。自动跳过冗余内容(系统文件、配置文件、二进制大文件)。

Concepts

核心概念

Source

源(Source)

A source is any tree of files to explore. Sources have:
  • type:
    local
    |
    dropbox
    |
    backblaze
    |
    gmail-takeout
    |
    mbox
    |
    pst
  • root: filesystem path, Dropbox path, B2 prefix, mbox path
  • manifest: a brain page tracking progress at
    projects/<archive-slug>/STATUS.md
源指任何待探索的文件目录树。源包含以下属性:
  • 类型(type)
    local
    |
    dropbox
    |
    backblaze
    |
    gmail-takeout
    |
    mbox
    |
    pst
  • 根路径(root):文件系统路径、Dropbox路径、B2前缀、mbox文件路径
  • 进度清单(manifest):一个跟踪进度的脑页面,路径为
    projects/<archive-slug>/STATUS.md

Manifest

进度清单(Manifest)

Every archive exploration gets a manifest brain page that tracks:
  1. Tree inventory — folders / files / sizes / types
  2. Triage status — each item:
    ⬜ unseen
    /
    👀 reviewed
    /
    ✅ ingested
    /
    ⏭️ skip
    /
    🔥 high-signal
  3. User reactions — exact quotes when they react (per conventions/quality.md exact-phrasing rule)
  4. Priority queue — what to explore next, ranked
  5. Session log — timestamped record of what was shown per session
每次归档探索都会生成一个进度清单脑页面,用于跟踪:
  1. 目录清单 —— 文件夹/文件/大小/类型
  2. 分类状态 —— 每个项目:
    ⬜ 未查看
    /
    👀 已审阅
    /
    ✅ 已摄入
    /
    ⏭️ 跳过
    /
    🔥 高信号
  3. 用户反馈 —— 用户反馈的精确引用(遵循conventions/quality.md中的精确措辞规则)
  4. 优先级队列 —— 下一步探索的内容及优先级排序
  5. 会话日志 —— 每次会话中展示内容的时间戳记录

Gold filter

高价值内容筛选器

Before showing anything to the user, apply the gold filter:
Keep (show)Skip (note existence, don't show)
Personal writing (journals, letters, reflections, essays)System files, configs, package.json, node_modules
Conversations (IM logs, email threads with substance)Binary blobs (images / video)
Ideas, theses, frameworksReceipts, invoices, tax docs
Relationship material (letters to / from people who matter)Spam, newsletters, mailing-list bulk
Creative work (poetry, stories, code with soul)Corrupted / null files
Origin stories (first versions of things that became important)
Emotional content (anger, love, grief, discovery)
在向用户展示内容前,需应用高价值内容筛选器:
保留(展示)跳过(仅记录存在,不展示)
个人原创文字(日记、信件、反思、随笔)系统文件、配置文件、package.json、node_modules
对话内容(即时通讯日志、有实质内容的邮件线程)二进制大文件(图片/视频)
想法、论文、框架收据、发票、税务文档
人际关系相关内容(与重要人物往来的信件)垃圾邮件、新闻通讯、邮件列表批量内容
创意作品(诗歌、故事、有灵魂的代码)损坏/空文件
起源故事(重要事物的初始版本)
情感类内容(愤怒、爱意、悲伤、发现)

Protocol

执行流程

Phase 1: Inventory

阶段1:目录清单

When pointed at a new source:
  1. Confirm scan_paths is set (safety gate). Exit if not.
  2. Map the tree — list folders + files + sizes + date ranges.
  3. Classify folders — group by likely content type (writing, email, code, photos, docs, system).
  4. Create manifest — write
    projects/<archive-slug>/STATUS.md
    with the full inventory.
  5. Propose priority queue — rank folders by likely gold density.
  6. Present to user — show the map and proposed order. Let them override.
指向新源时:
  1. 确认scan_paths已设置(安全闸门)。若未设置则退出。
  2. 映射目录树 —— 列出文件夹+文件+大小+日期范围。
  3. 分类文件夹 —— 按可能的内容类型分组(文字、邮件、代码、照片、文档、系统)。
  4. 创建进度清单 —— 写入
    projects/<archive-slug>/STATUS.md
    ,包含完整目录清单。
  5. 提出优先级队列 —— 按高价值内容密度对文件夹排序。
  6. 呈现给用户 —— 展示目录映射和建议顺序,允许用户修改。

Phase 2: Crawl

阶段2:爬取

Work through folders in priority order:
  1. Read before showing — open each candidate file, apply the gold filter, skip noise.
  2. Show one at a time — present gold items individually for review.
  3. Capture exact reaction — track the user's response in the manifest using their exact words (per conventions/quality.md).
  4. Ingest if worth keeping — create a brain page immediately.
  5. Update manifest — mark item status after each interaction.
  6. Never re-show — check the manifest before presenting anything.
按优先级顺序处理文件夹:
  1. 先读取再展示 —— 打开每个候选文件,应用高价值内容筛选器,跳过冗余内容。
  2. 逐个展示 —— 单独呈现高价值内容供用户审阅。
  3. 记录精确反馈 —— 在进度清单中记录用户的原始反馈(遵循conventions/quality.md)。
  4. 摄入有价值内容 —— 立即创建脑页面。
  5. 更新进度清单 —— 每次交互后标记项目状态。
  6. 永不重复展示 —— 展示前先检查进度清单。

Phase 3: Ingest

阶段3:摄入

When an item is worth keeping, file it by primary subject per
_brain-filing-rules.md
:
  • User's own writing / ideas / origin-story content →
    originals/<slug>.md
  • Reflections / personal-life content →
    personal/<slug>.md
  • Product / business ideas →
    ideas/<slug>.md
  • Letters or threads about a specific person →
    people/<person>/timeline
    back-link plus the letter at
    personal/<slug>.md
    or
    originals/<slug>.md
The skill is schema-generic. It does NOT bake in any specific era-folder structure (e.g.,
originals/archive/
for pre-2003,
originals/yc-era/
for post-2019, etc.). The user's filing rules from
_brain-filing-rules.json
are read at runtime; the agent decides per-page where content lands within those sanctioned directories.
Brain page format:
markdown
---
title: "[Title or first line]"
type: original
source_type: "[local|dropbox|backblaze|gmail-takeout|mbox|pst]"
source_path: "[path within the allow-listed scan_paths]"
date: "YYYY-MM-DD"  # date from the file metadata or content
people: ["person-1", "person-2"]
tags: ["tag-1", "tag-2"]
---
当某项目值得保留时,根据
_brain-filing-rules.md
主题类别归档:
  • 用户原创文字/想法/起源故事类内容 →
    originals/<slug>.md
  • 反思/个人生活类内容 →
    personal/<slug>.md
  • 产品/商业想法 →
    ideas/<slug>.md
  • 关于特定人物的信件或对话线程 →
    people/<person>/timeline
    反向链接,同时将信件归档至
    personal/<slug>.md
    originals/<slug>.md
本技能为schema通用型。它不会硬编码任何特定时期的文件夹结构(例如,2003年前的内容归档至
originals/archive/
,2019年后的内容归档至
originals/yc-era/
等)。运行时会读取
_brain-filing-rules.json
中的用户归档规则;代理会根据这些批准的目录,逐页决定内容的归档位置。
脑页面格式:
markdown
---
title: "[标题或首行内容]"
type: original
source_type: "[local|dropbox|backblaze|gmail-takeout|mbox|pst]"
source_path: "[允许扫描路径内的文件路径]"
date: "YYYY-MM-DD"  # 文件元数据或内容中的日期
people: ["person-1", "person-2"]
tags: ["tag-1", "tag-2"]
---

[Title]

[标题]

[Summary: what it is, when it's from, why it matters]
User's reaction: [exact quote, no paraphrasing]
[摘要:内容类型、时间、重要性]
用户反馈: [精确引用,不得改写]

Context

上下文

[Cross-links to people, concepts, projects.]

[Raw source material below the line — full text]
undefined
[与人物、概念、项目的交叉链接]

[以下为原始源材料 —— 完整文本]
undefined

File-type handlers

文件类型处理程序

Plain text / HTML / Markdown

纯文本/HTML/Markdown

Read directly. Strip HTML tags for display.
直接读取。展示时去除HTML标签。

.mbox
(email archives)

.mbox
(邮件归档)

python
import mailbox
mbox = mailbox.mbox('/path/to/file.mbox')
for msg in mbox:
    body = ''
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == 'text/plain':
                body = part.get_payload(decode=True).decode('utf-8', errors='replace')
                break
    else:
        body = msg.get_payload(decode=True).decode('utf-8', errors='replace')
    # Apply gold filter
python
import mailbox
mbox = mailbox.mbox('/path/to/file.mbox')
for msg in mbox:
    body = ''
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == 'text/plain':
                body = part.get_payload(decode=True).decode('utf-8', errors='replace')
                break
    else:
        body = msg.get_payload(decode=True).decode('utf-8', errors='replace')
    # Apply gold filter

.doc
/
.docx

.doc
/
.docx

bash
undefined
bash
undefined

.docx (modern)

.docx(现代格式)

python3 -c " import zipfile, xml.etree.ElementTree as ET with zipfile.ZipFile('/path/to/file.docx') as z: tree = ET.parse(z.open('word/document.xml')) print(''.join(t.text or '' for t in tree.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'))) "
python3 -c " import zipfile, xml.etree.ElementTree as ET with zipfile.ZipFile('/path/to/file.docx') as z: tree = ET.parse(z.open('word/document.xml')) print(''.join(t.text or '' for t in tree.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'))) "

.doc (legacy, requires antiword or catdoc)

.doc(旧格式,需安装antiword或catdoc)

antiword /path/to/file.doc 2>/dev/null || catdoc /path/to/file.doc 2>/dev/null
undefined
antiword /path/to/file.doc 2>/dev/null || catdoc /path/to/file.doc 2>/dev/null
undefined

.pst
(Outlook archives)

.pst
(Outlook归档)

bash
undefined
bash
undefined

Validate first; many PSTs are null bytes

先验证;许多PST文件为空字节

python3 -c " with open('/path/to/file.pst', 'rb') as f: print('Valid PST' if f.read(4) == b'!BDN' else 'CORRUPT/NULL') "
python3 -c " with open('/path/to/file.pst', 'rb') as f: print('Valid PST' if f.read(4) == b'!BDN' else 'CORRUPT/NULL') "

If valid:

若验证通过:

readpst -o /tmp/pst-output /path/to/file.pst
undefined
readpst -o /tmp/pst-output /path/to/file.pst
undefined

.zip
/
.tar
/
.tar.gz

.zip
/
.tar
/
.tar.gz

Extract to a temp dir, then recurse through the extracted tree.
提取至临时目录,然后递归遍历提取后的目录树。

Images

图片

Note existence + metadata (filename, size, date). Don't show unless the user asks. Flag scans / portraits as potentially personal.
记录存在性+元数据(文件名、大小、日期)。除非用户要求,否则不展示。将扫描件/肖像标记为潜在个人内容。

Manifest template

进度清单模板

markdown
---
title: "[Archive Name] — Ingestion Status"
type: project
created: YYYY-MM-DD
updated: YYYY-MM-DD
source_type: "[local|dropbox|...]"
scan_paths: ["paths from gbrain.yml"]
---
markdown
---
title: "[归档名称] — 摄入状态"
type: project
created: YYYY-MM-DD
updated: YYYY-MM-DD
source_type: "[local|dropbox|...]"
scan_paths: ["paths from gbrain.yml"]
---

[Archive Name] — Ingestion Status

[归档名称] — 摄入状态

Source

源信息

  • Type: [local|dropbox|...]
  • Allow-listed paths: [from gbrain.yml]
  • Total files: [N]
  • Total size: [X GB]
  • Date range: [earliest] — [latest]
  • 类型: [local|dropbox|...]
  • 允许扫描路径: [来自gbrain.yml]
  • 总文件数: [N]
  • 总大小: [X GB]
  • 日期范围: [最早日期] — [最晚日期]

Inventory

目录清单

[Folder 1]

[文件夹1]

ItemTypeSizeStatusReaction
file1.txttext2KB✅ ingested🔥 "exact quote"
file2.docdoc15KB⏭️ skip
file3.htmlhtml4KB⬜ unseen
项目类型大小状态反馈
file1.txttext2KB✅ 已摄入🔥 "精确引用"
file2.docdoc15KB⏭️ 跳过
file3.htmlhtml4KB⬜ 未查看

[Folder 2]

[文件夹2]

...
...

Priority Queue

优先级队列

  1. [Highest priority — why]
  2. [Next — why] ...
  1. [最高优先级 —— 原因]
  2. [次优先级 —— 原因] ...

Session Log

会话日志

YYYY-MM-DD — [Session topic]

YYYY-MM-DD — [会话主题]

  • Reviewed: [list]
  • Reactions: [exact quotes]
  • Ingested: [brain pages created]
  • Next: [what's queued]
undefined
  • 已审阅:[列表]
  • 反馈:[精确引用]
  • 已摄入:[已创建的脑页面]
  • 下一步:[待处理内容]
undefined

Anti-Patterns

反模式

  • ❌ Running without
    archive-crawler.scan_paths:
    set. Hard refusal. This is the safety contract — never bypass.
  • ❌ Hardcoding era-specific filing paths (e.g.,
    originals/archive/
    ,
    originals/yc-era/
    ). Read filing rules at runtime instead.
  • ❌ Re-showing items already marked in the manifest. The user's time is the scarcest resource.
  • ❌ Paraphrasing reactions. Exact words only.
  • ❌ Wrapping found content in lessons or takeaways. Let stories breathe.
  • ❌ Skipping back-links when content references people / companies who have brain pages. Iron Law per conventions/quality.md.
  • ❌ 未设置
    archive-crawler.scan_paths:
    即运行。强制拒绝。 这是安全约定 —— 绝不能绕过。
  • ❌ 硬编码特定时期的归档路径(例如
    originals/archive/
    originals/yc-era/
    )。应在运行时读取归档规则。
  • ❌ 重复展示已在进度清单中标记的项目。用户时间是最稀缺的资源。
  • ❌ 改写用户反馈。仅保留精确措辞。
  • ❌ 为找到的内容添加经验总结或启示。让内容本身呈现价值。
  • ❌ 当内容引用已存在脑页面的人物/公司时,跳过反向链接。这是conventions/quality.md中的铁律。

Related skills

相关技能

  • skills/voice-note-ingest/SKILL.md
    — same exact-phrasing pattern for audio capture
  • skills/idea-ingest/SKILL.md
    — single-link-or-article ingest with the same primary-subject filing rule
  • skills/conventions/quality.md
    — citations, back-links, voice
  • skills/voice-note-ingest/SKILL.md
    —— 音频捕捉采用相同的精确措辞模式
  • skills/idea-ingest/SKILL.md
    —— 单链接/文章摄入采用相同的主题类别归档规则
  • skills/conventions/quality.md
    —— 引用、反向链接、措辞规范

Contract

约定

This skill guarantees:
  • Routing matches the canonical triggers in the frontmatter.
  • Output written under the directories listed in
    writes_to:
    (when applicable).
  • Conventions referenced (
    quality.md
    ,
    brain-first.md
    ,
    _brain-filing-rules.md
    ) are followed.
  • Privacy contract preserved: no real names, no fork-specific filesystem path literals, no upstream-fork references.
The full behavior contract is documented in the body sections above; this section exists for the conformance test.
本技能保证:
  • 路由与前置元数据中的标准触发条件匹配。
  • 输出写入
    writes_to:
    中列出的目录(若适用)。
  • 遵循引用的约定(
    quality.md
    brain-first.md
    _brain-filing-rules.md
    )。
  • 遵守隐私约定:不使用真实姓名、不使用特定fork的文件系统路径字面量、不引用上游fork。
完整的行为约定已在上述正文中记录;本节用于一致性测试。

Output Format

输出格式

The skill's output shape is documented inline in the body sections above (see "Output", "Brain page format", or equivalent). The literal section header here exists for the conformance test (
test/skills-conformance.test.ts
).
技能的输出格式已在上述正文中内联记录(请查看「输出」、「脑页面格式」或等效部分)。本节的字面标题用于一致性测试(
test/skills-conformance.test.ts
)。