data-research
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Research
数据研究
Structured research pipeline: search sources, extract structured data,
archive raw, deduplicate, update canonical trackers, backlink entities.
结构化研究流程:搜索数据源、提取结构化数据、归档原始数据、去重、更新标准追踪器、关联实体反向链接。
Contract
约定
One skill for any email-to-structured-data pipeline. The only differences
between tracking investor updates, expenses, and company metrics
are the search queries, extraction schemas, and tracker page format.
All three use the same 7-phase pipeline with parameterized recipes.
一个技能适配所有邮件转结构化数据的处理流程。追踪投资者更新、支出和公司指标之间的唯一区别在于搜索查询、提取模式和追踪页面格式。三者均采用相同的7阶段流程,通过配置文件实现参数化。
When to Use
使用场景
- User wants to track structured data from email, web, or API sources
- User says "research", "track", "extract from email", "build a tracker"
- User mentions investor updates, donations, company metrics, filings
- User wants to set up recurring data collection (with cron recipe)
- 用户希望从邮件、网页或API数据源追踪结构化数据
- 用户提及“研究”、“追踪”、“从邮件提取数据”、“构建追踪器”
- 用户提到投资者更新、捐赠、公司指标、备案文件
- 用户希望设置定期数据收集(配合cron配置文件)
Phases
流程阶段
Phase 1: Define Research Recipe
阶段1:定义研究配置文件
Ask the user what they want to track. Either:
- Pick a built-in recipe: investor-updates, expense-tracker, company-updates
- Define a custom recipe with: source queries, classification rules, extraction schema, tracker page path, tracker format
Recipes are YAML files at . Use
to scaffold a new one.
~/.gbrain/recipes/{name}.yamlgbrain research init询问用户想要追踪的内容。可选择:
- 选用内置配置文件:investor-updates、expense-tracker、company-updates
- 自定义配置文件,包含:数据源查询、分类规则、提取模式、追踪页面路径、追踪格式
配置文件为存储在的YAML文件。使用命令快速生成新配置文件的框架。
~/.gbrain/recipes/{name}.yamlgbrain research initPhase 2: Search Sources
阶段2:搜索数据源
Brain first (maybe we already have this data). Then:
- Email via credential gateway: windowed queries (quarterly, monthly if truncated)
- Web via search: public filings, press releases, regulatory data
- APIs: any structured data source the recipe defines
- Attachments: PDF extraction, HTML stripping
优先检查Brain(可能已存在所需数据)。然后:
- 邮件:通过凭证网关,按时间范围查询(季度、月度,若数据截断则调整)
- 网页:通过搜索获取公开备案文件、新闻稿、监管数据
- API:配置文件中定义的任何结构化数据源
- 附件:PDF提取、HTML内容剥离
Phase 3: Classify
阶段3:分类
Deterministic first (regex patterns from recipe), LLM fallback.
Log every LLM fallback for future regex improvement (fail-improve loop).
Skip marketing, newsletters, noise based on recipe's classification rules.
优先使用确定性规则(配置文件中的正则表达式),LLM作为备选方案。记录所有LLM备选场景,用于后续正则表达式优化(失败-改进循环)。根据配置文件的分类规则,过滤营销邮件、新闻通讯等无效内容。
Phase 4: Extract Structured Data
阶段4:提取结构化数据
EXTRACTION INTEGRITY RULE:
- Save raw source immediately (before any extraction)
- Extract fields using deterministic regex first, LLM fallback
- When summarizing batch results: re-read from saved files
- Never trust LLM working memory after batch processing
This prevents a known hallucination bug where batch-processed amounts were
13/13 wrong from LLM working memory while saved files were correct.
提取完整性规则:
- 立即保存原始数据源(提取前)
- 优先使用确定性正则表达式提取字段,LLM作为备选
- 汇总批量结果时:从已保存的文件中重新读取数据
- 批量处理后,绝不信任LLM的工作记忆
此规则可避免已知的幻觉问题:批量处理时,LLM工作记忆中的金额数据13/13均错误,而已保存文件中的数据是正确的。
Phase 5: Archive Raw Sources
阶段5:归档原始数据源
- for email bodies, API responses
put_raw_data - for PDF attachments, documents
file_upload - Create pointers for large files in storage
.redirect.yaml - Every tracker entry must link back to its raw source
- 使用存储邮件正文、API响应
put_raw_data - 使用上传PDF附件、文档
file_upload - 为存储中的大文件创建指针
.redirect.yaml - 每个追踪条目必须链接回其原始数据源
Phase 6: Deduplicate
阶段6:去重
Before adding to tracker:
- Exact match (same key fields) → skip
- Fuzzy match (same entity + date + similar amount within tolerance) → flag for review
- Different amount for same entity+date → add with note (could be correction)
添加到追踪器前:
- 完全匹配(相同关键字段)→ 跳过
- 模糊匹配(相同实体+日期+金额在容差范围内相似)→ 标记待审核
- 同一实体+日期但金额不同→ 添加并标注(可能为修正数据)
Phase 7: Update Canonical Tracker + Backlink
阶段7:更新标准追踪器 + 反向链接
- Parse existing tracker page (markdown table)
- Append new entries in correct section (grouped by year/quarter/entity)
- Compute running totals
- Backlink every mentioned entity (person → people/ page, company → companies/ page)
- Uses enrichment service for entity pages
- 解析现有追踪页面(Markdown表格)
- 在正确章节(按年份/季度/实体分组)追加新条目
- 计算累计总额
- 为每个提及的实体添加反向链接(人物→people/页面,公司→companies/页面)
- 使用实体页面的 enrichment service
Built-In Recipes
内置配置文件
Three example recipes ship with GBrain (see ):
~/.gbrain/recipes/- investor-updates — extract MRR, ARR, growth, burn, runway, headcount from investor update emails
- expense-tracker — extract amounts, recipients, platforms from receipt emails (subscriptions, services, recurring charges)
- company-updates — extract revenue, users, key metrics from portfolio company update emails
GBrain附带三个示例配置文件(查看):
~/.gbrain/recipes/- investor-updates — 从投资者更新邮件中提取MRR、ARR、增长率、消耗率、现金流runway、员工人数
- expense-tracker — 从收据邮件中提取金额、收款方、平台(订阅、服务、定期收费)
- company-updates — 从投资组合公司更新邮件中提取收入、用户数、关键指标
Anti-Patterns
反模式
- Trusting LLM working memory for amounts after batch processing (use extraction integrity rule)
- Creating tracker entries without raw source links
- Running without deduplication (leads to double-counted entries)
- Hardcoding source-specific patterns in the pipeline code (use recipes)
- 批量处理后信任LLM工作记忆中的金额数据(遵循提取完整性规则)
- 创建无原始数据源链接的追踪条目
- 未执行去重就运行流程(导致重复统计条目)
- 在管道代码中硬编码特定数据源的规则(使用配置文件替代)
Output Format
输出格式
Brain page at the recipe's path with markdown tables:
tracker_pagemarkdown
undefined在配置文件指定的路径生成Brain页面,包含Markdown表格:
tracker_pagemarkdown
undefined2026
2026
| Date | Company | MRR | ARR | Growth | Status |
|---|---|---|---|---|---|
| 2026-04-01 | Example Co | $188K | $2.3M | +14.7% MoM | Source |
Each entry links to its raw source. Running totals at the bottom of each section.| Date | Company | MRR | ARR | Growth | Status |
|---|---|---|---|---|---|
| 2026-04-01 | Example Co | $188K | $2.3M | +14.7% MoM | Source |
每个条目均链接至其原始数据源。每个章节底部显示累计总额。Conventions
约定规范
References for citation and back-linking rules.
skills/conventions/quality.md参考中的引用和反向链接规则。
skills/conventions/quality.md