data-research

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Research

数据研究

Structured research pipeline: search sources, extract structured data, archive raw, deduplicate, update canonical trackers, backlink entities.

结构化研究流程：搜索数据源、提取结构化数据、归档原始数据、去重、更新标准追踪器、关联实体反向链接。

Contract

约定

One skill for any email-to-structured-data pipeline. The only differences between tracking investor updates, expenses, and company metrics are the search queries, extraction schemas, and tracker page format. All three use the same 7-phase pipeline with parameterized recipes.

一个技能适配所有邮件转结构化数据的处理流程。追踪投资者更新、支出和公司指标之间的唯一区别在于搜索查询、提取模式和追踪页面格式。三者均采用相同的7阶段流程，通过配置文件实现参数化。

When to Use

使用场景

User wants to track structured data from email, web, or API sources
User says "research", "track", "extract from email", "build a tracker"
User mentions investor updates, donations, company metrics, filings
User wants to set up recurring data collection (with cron recipe)

用户希望从邮件、网页或API数据源追踪结构化数据
用户提及“研究”、“追踪”、“从邮件提取数据”、“构建追踪器”
用户提到投资者更新、捐赠、公司指标、备案文件
用户希望设置定期数据收集（配合cron配置文件）

Phases

流程阶段

Phase 1: Define Research Recipe

阶段1：定义研究配置文件

Ask the user what they want to track. Either:

Pick a built-in recipe: investor-updates, expense-tracker, company-updates
Define a custom recipe with: source queries, classification rules, extraction schema, tracker page path, tracker format

Recipes are YAML files at

~/.gbrain/recipes/{name}.yaml

. Use

gbrain research init

to scaffold a new one.

询问用户想要追踪的内容。可选择：

选用内置配置文件：investor-updates、expense-tracker、company-updates
自定义配置文件，包含：数据源查询、分类规则、提取模式、追踪页面路径、追踪格式

配置文件为存储在

~/.gbrain/recipes/{name}.yaml

的YAML文件。使用

gbrain research init

命令快速生成新配置文件的框架。

Phase 2: Search Sources

阶段2：搜索数据源

Brain first (maybe we already have this data). Then:

Email via credential gateway: windowed queries (quarterly, monthly if truncated)
Web via search: public filings, press releases, regulatory data
APIs: any structured data source the recipe defines
Attachments: PDF extraction, HTML stripping

优先检查Brain（可能已存在所需数据）。然后：

邮件：通过凭证网关，按时间范围查询（季度、月度，若数据截断则调整）
网页：通过搜索获取公开备案文件、新闻稿、监管数据
API：配置文件中定义的任何结构化数据源
附件：PDF提取、HTML内容剥离

Phase 3: Classify

阶段3：分类

Deterministic first (regex patterns from recipe), LLM fallback. Log every LLM fallback for future regex improvement (fail-improve loop). Skip marketing, newsletters, noise based on recipe's classification rules.

优先使用确定性规则（配置文件中的正则表达式），LLM作为备选方案。记录所有LLM备选场景，用于后续正则表达式优化（失败-改进循环）。根据配置文件的分类规则，过滤营销邮件、新闻通讯等无效内容。

Phase 4: Extract Structured Data

阶段4：提取结构化数据

EXTRACTION INTEGRITY RULE:

Save raw source immediately (before any extraction)
Extract fields using deterministic regex first, LLM fallback
When summarizing batch results: re-read from saved files
Never trust LLM working memory after batch processing

This prevents a known hallucination bug where batch-processed amounts were 13/13 wrong from LLM working memory while saved files were correct.

提取完整性规则：

立即保存原始数据源（提取前）
优先使用确定性正则表达式提取字段，LLM作为备选
汇总批量结果时：从已保存的文件中重新读取数据
批量处理后，绝不信任LLM的工作记忆

此规则可避免已知的幻觉问题：批量处理时，LLM工作记忆中的金额数据13/13均错误，而已保存文件中的数据是正确的。

Phase 5: Archive Raw Sources

阶段5：归档原始数据源

```
put_raw_data
```
for email bodies, API responses
```
file_upload
```
for PDF attachments, documents
Create
```
.redirect.yaml
```
pointers for large files in storage
Every tracker entry must link back to its raw source

使用
```
put_raw_data
```
存储邮件正文、API响应
使用
```
file_upload
```
上传PDF附件、文档
为存储中的大文件创建
```
.redirect.yaml
```
指针
每个追踪条目必须链接回其原始数据源

Phase 6: Deduplicate

阶段6：去重

Before adding to tracker:

Exact match (same key fields) → skip
Fuzzy match (same entity + date + similar amount within tolerance) → flag for review
Different amount for same entity+date → add with note (could be correction)

添加到追踪器前：

完全匹配（相同关键字段）→ 跳过
模糊匹配（相同实体+日期+金额在容差范围内相似）→ 标记待审核
同一实体+日期但金额不同→ 添加并标注（可能为修正数据）

Phase 7: Update Canonical Tracker + Backlink

阶段7：更新标准追踪器 + 反向链接

Parse existing tracker page (markdown table)
Append new entries in correct section (grouped by year/quarter/entity)
Compute running totals
Backlink every mentioned entity (person → people/ page, company → companies/ page)
Uses enrichment service for entity pages

解析现有追踪页面（Markdown表格）
在正确章节（按年份/季度/实体分组）追加新条目
计算累计总额
为每个提及的实体添加反向链接（人物→people/页面，公司→companies/页面）
使用实体页面的 enrichment service

Built-In Recipes

内置配置文件

Three example recipes ship with GBrain (see

~/.gbrain/recipes/

investor-updates — extract MRR, ARR, growth, burn, runway, headcount from investor update emails
expense-tracker — extract amounts, recipients, platforms from receipt emails (subscriptions, services, recurring charges)
company-updates — extract revenue, users, key metrics from portfolio company update emails

GBrain附带三个示例配置文件（查看

~/.gbrain/recipes/

）：

investor-updates — 从投资者更新邮件中提取MRR、ARR、增长率、消耗率、现金流runway、员工人数
expense-tracker — 从收据邮件中提取金额、收款方、平台（订阅、服务、定期收费）
company-updates — 从投资组合公司更新邮件中提取收入、用户数、关键指标

Anti-Patterns

反模式

Trusting LLM working memory for amounts after batch processing (use extraction integrity rule)
Creating tracker entries without raw source links
Running without deduplication (leads to double-counted entries)
Hardcoding source-specific patterns in the pipeline code (use recipes)

批量处理后信任LLM工作记忆中的金额数据（遵循提取完整性规则）
创建无原始数据源链接的追踪条目
未执行去重就运行流程（导致重复统计条目）
在管道代码中硬编码特定数据源的规则（使用配置文件替代）

Output Format

输出格式

Brain page at the recipe's

tracker_page

path with markdown tables:

markdown

undefined

在配置文件指定的

tracker_page

路径生成Brain页面，包含Markdown表格：

markdown

undefined

2026

Date	Company	MRR	ARR	Growth	Status
2026-04-01	Example Co	$188K	$2.3M	+14.7% MoM	Source


Each entry links to its raw source. Running totals at the bottom of each section.

Date	Company	MRR	ARR	Growth	Status
2026-04-01	Example Co	$188K	$2.3M	+14.7% MoM	Source


每个条目均链接至其原始数据源。每个章节底部显示累计总额。

Conventions

约定规范

References

skills/conventions/quality.md

for citation and back-linking rules.

参考

skills/conventions/quality.md

中的引用和反向链接规则。