data-research

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Research

数据研究

Structured research pipeline: search sources, extract structured data, archive raw, deduplicate, update canonical trackers, backlink entities.
结构化研究流程:搜索数据源、提取结构化数据、归档原始数据、去重、更新标准追踪器、关联实体反向链接。

Contract

约定

One skill for any email-to-structured-data pipeline. The only differences between tracking investor updates, expenses, and company metrics are the search queries, extraction schemas, and tracker page format. All three use the same 7-phase pipeline with parameterized recipes.
一个技能适配所有邮件转结构化数据的处理流程。追踪投资者更新、支出和公司指标之间的唯一区别在于搜索查询提取模式追踪页面格式。三者均采用相同的7阶段流程,通过配置文件实现参数化。

When to Use

使用场景

  • User wants to track structured data from email, web, or API sources
  • User says "research", "track", "extract from email", "build a tracker"
  • User mentions investor updates, donations, company metrics, filings
  • User wants to set up recurring data collection (with cron recipe)
  • 用户希望从邮件、网页或API数据源追踪结构化数据
  • 用户提及“研究”、“追踪”、“从邮件提取数据”、“构建追踪器”
  • 用户提到投资者更新、捐赠、公司指标、备案文件
  • 用户希望设置定期数据收集(配合cron配置文件)

Phases

流程阶段

Phase 1: Define Research Recipe

阶段1:定义研究配置文件

Ask the user what they want to track. Either:
  • Pick a built-in recipe: investor-updates, expense-tracker, company-updates
  • Define a custom recipe with: source queries, classification rules, extraction schema, tracker page path, tracker format
Recipes are YAML files at
~/.gbrain/recipes/{name}.yaml
. Use
gbrain research init
to scaffold a new one.
询问用户想要追踪的内容。可选择:
  • 选用内置配置文件:investor-updates、expense-tracker、company-updates
  • 自定义配置文件,包含:数据源查询、分类规则、提取模式、追踪页面路径、追踪格式
配置文件为存储在
~/.gbrain/recipes/{name}.yaml
的YAML文件。使用
gbrain research init
命令快速生成新配置文件的框架。

Phase 2: Search Sources

阶段2:搜索数据源

Brain first (maybe we already have this data). Then:
  • Email via credential gateway: windowed queries (quarterly, monthly if truncated)
  • Web via search: public filings, press releases, regulatory data
  • APIs: any structured data source the recipe defines
  • Attachments: PDF extraction, HTML stripping
优先检查Brain(可能已存在所需数据)。然后:
  • 邮件:通过凭证网关,按时间范围查询(季度、月度,若数据截断则调整)
  • 网页:通过搜索获取公开备案文件、新闻稿、监管数据
  • API:配置文件中定义的任何结构化数据源
  • 附件:PDF提取、HTML内容剥离

Phase 3: Classify

阶段3:分类

Deterministic first (regex patterns from recipe), LLM fallback. Log every LLM fallback for future regex improvement (fail-improve loop). Skip marketing, newsletters, noise based on recipe's classification rules.
优先使用确定性规则(配置文件中的正则表达式),LLM作为备选方案。记录所有LLM备选场景,用于后续正则表达式优化(失败-改进循环)。根据配置文件的分类规则,过滤营销邮件、新闻通讯等无效内容。

Phase 4: Extract Structured Data

阶段4:提取结构化数据

EXTRACTION INTEGRITY RULE:
  1. Save raw source immediately (before any extraction)
  2. Extract fields using deterministic regex first, LLM fallback
  3. When summarizing batch results: re-read from saved files
  4. Never trust LLM working memory after batch processing
This prevents a known hallucination bug where batch-processed amounts were 13/13 wrong from LLM working memory while saved files were correct.
提取完整性规则:
  1. 立即保存原始数据源(提取前)
  2. 优先使用确定性正则表达式提取字段,LLM作为备选
  3. 汇总批量结果时:从已保存的文件中重新读取数据
  4. 批量处理后,绝不信任LLM的工作记忆
此规则可避免已知的幻觉问题:批量处理时,LLM工作记忆中的金额数据13/13均错误,而已保存文件中的数据是正确的。

Phase 5: Archive Raw Sources

阶段5:归档原始数据源

  • put_raw_data
    for email bodies, API responses
  • file_upload
    for PDF attachments, documents
  • Create
    .redirect.yaml
    pointers for large files in storage
  • Every tracker entry must link back to its raw source
  • 使用
    put_raw_data
    存储邮件正文、API响应
  • 使用
    file_upload
    上传PDF附件、文档
  • 为存储中的大文件创建
    .redirect.yaml
    指针
  • 每个追踪条目必须链接回其原始数据源

Phase 6: Deduplicate

阶段6:去重

Before adding to tracker:
  • Exact match (same key fields) → skip
  • Fuzzy match (same entity + date + similar amount within tolerance) → flag for review
  • Different amount for same entity+date → add with note (could be correction)
添加到追踪器前:
  • 完全匹配(相同关键字段)→ 跳过
  • 模糊匹配(相同实体+日期+金额在容差范围内相似)→ 标记待审核
  • 同一实体+日期但金额不同→ 添加并标注(可能为修正数据)

Phase 7: Update Canonical Tracker + Backlink

阶段7:更新标准追踪器 + 反向链接

  • Parse existing tracker page (markdown table)
  • Append new entries in correct section (grouped by year/quarter/entity)
  • Compute running totals
  • Backlink every mentioned entity (person → people/ page, company → companies/ page)
  • Uses enrichment service for entity pages
  • 解析现有追踪页面(Markdown表格)
  • 在正确章节(按年份/季度/实体分组)追加新条目
  • 计算累计总额
  • 为每个提及的实体添加反向链接(人物→people/页面,公司→companies/页面)
  • 使用实体页面的 enrichment service

Built-In Recipes

内置配置文件

Three example recipes ship with GBrain (see
~/.gbrain/recipes/
):
  1. investor-updates — extract MRR, ARR, growth, burn, runway, headcount from investor update emails
  2. expense-tracker — extract amounts, recipients, platforms from receipt emails (subscriptions, services, recurring charges)
  3. company-updates — extract revenue, users, key metrics from portfolio company update emails
GBrain附带三个示例配置文件(查看
~/.gbrain/recipes/
):
  1. investor-updates — 从投资者更新邮件中提取MRR、ARR、增长率、消耗率、现金流runway、员工人数
  2. expense-tracker — 从收据邮件中提取金额、收款方、平台(订阅、服务、定期收费)
  3. company-updates — 从投资组合公司更新邮件中提取收入、用户数、关键指标

Anti-Patterns

反模式

  • Trusting LLM working memory for amounts after batch processing (use extraction integrity rule)
  • Creating tracker entries without raw source links
  • Running without deduplication (leads to double-counted entries)
  • Hardcoding source-specific patterns in the pipeline code (use recipes)
  • 批量处理后信任LLM工作记忆中的金额数据(遵循提取完整性规则)
  • 创建无原始数据源链接的追踪条目
  • 未执行去重就运行流程(导致重复统计条目)
  • 在管道代码中硬编码特定数据源的规则(使用配置文件替代)

Output Format

输出格式

Brain page at the recipe's
tracker_page
path with markdown tables:
markdown
undefined
在配置文件指定的
tracker_page
路径生成Brain页面,包含Markdown表格:
markdown
undefined

2026

2026

DateCompanyMRRARRGrowthStatus
2026-04-01Example Co$188K$2.3M+14.7% MoMSource

Each entry links to its raw source. Running totals at the bottom of each section.
DateCompanyMRRARRGrowthStatus
2026-04-01Example Co$188K$2.3M+14.7% MoMSource

每个条目均链接至其原始数据源。每个章节底部显示累计总额。

Conventions

约定规范

References
skills/conventions/quality.md
for citation and back-linking rules.
参考
skills/conventions/quality.md
中的引用和反向链接规则。