fork-intelligence

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fork Intelligence

Fork情报系统

Systematic methodology for discovering valuable work in GitHub fork ecosystems. Stars-only filtering misses 60-100% of substantive forks — this skill uses branch-level divergence analysis, upstream PR cross-referencing, and domain-specific heuristics to find what matters.
Validated empirically across 10 repositories spanning Python, Rust, TypeScript, C++/Python, and Node.js (tensortrade, backtesting.py, kokoro, pymoo, firecrawl, barter-rs, pueue, dukascopy-node, ArcticDB, flowsurface).
用于在GitHub fork生态中发掘有价值工作的系统性方法论。仅通过星标筛选会漏掉60-100%的有实际价值的fork——本方法使用分支级差异分析、上游PR交叉引用和领域特定启发式规则来找出真正有价值的内容。
已在横跨Python、Rust、TypeScript、C++/Python、Node.js的10个仓库中得到实证验证(tensortrade、backtesting.py、kokoro、pymoo、firecrawl、barter-rs、pueue、dukascopy-node、ArcticDB、flowsurface)。

FIRST — TodoWrite Task Templates

第一步:待办任务模板

MANDATORY: Select and load the appropriate template before any fork analysis.
强制要求:在进行任何fork分析前,请先选择并加载对应的模板。

Template A — Full Analysis (new repository)

模板A——全量分析(新仓库)

1. Get upstream baseline (stars, forks, default branch, last push)
2. List all forks with pagination, note timestamp clusters
3. Filter to unique-timestamp forks (skip bulk mirrors)
4. Check default branch divergence (ahead_by/behind_by)
5. Check non-default branches for all forks with recent push or >1 branch
6. Evaluate commit content, author emails, tags/releases
7. Cross-reference upstream PR history from fork owners
8. Tier ranking and cross-fork convergence analysis
9. Produce report with actionable recommendations
1. 获取上游基线(星标数、fork数、默认分支、最后推送时间)
2. 分页列出所有fork,记录时间戳集群
3. 筛选出时间戳唯一的fork(跳过批量镜像)
4. 检查默认分支差异(领先/落后提交数)
5. 检查所有近期有推送或分支数>1的fork的非默认分支
6. 评估提交内容、作者邮箱、标签/发布版本
7. 交叉引用fork所有者的上游PR历史
8. 等级排序与跨fork趋同分析
9. 生成包含可落地建议的报告

Template B — Quick Scan (triage only)

模板B——快速扫描(仅分类筛选)

1. Get upstream baseline
2. List forks, filter by timestamp clustering
3. Check default branch divergence only
4. Report forks with ahead_by > 0
1. 获取上游基线
2. 列出fork,按时间戳集群筛选
3. 仅检查默认分支差异
4. 报告领先提交数>0的fork

Template C — Targeted Fork Evaluation (specific fork)

模板C——定向fork评估(特定fork)

1. Compare fork vs upstream on all branches
2. Examine commit messages and changed files
3. Check for tags/releases, open issues, PRs
4. Assess cherry-pick viability

1. 对比fork与上游所有分支的差异
2. 检查提交信息和修改的文件
3. 检查标签/发布版本、开放issue、PR
4. 评估代码 cherry-pick 可行性

Signal Priority Order

信号优先级排序

Ranked by empirical reliability across 10 repositories. See signal-priority.md for details.
RankSignalReliabilityWhat It Catches
1Branch-level divergenceHighestWork on feature branches (50%+ of substantive forks)
2Upstream PR cross-referenceHighRebased/force-pushed work invisible to compare API
3Tags/releases on forkHighIndependent maintenance intent
4Commit email domainsHighInstitutional contributors (
@company.com
)
5Timestamp clusteringMediumEliminates 85%+ mirror noise
6Cross-fork convergenceMediumReveals unmet upstream demand
7StarsLowestOften anti-correlated with actual value

按在10个仓库中的实证可靠性排序。详情请查看signal-priority.md
排名信号可靠性可识别内容
1分支级差异最高特性分支上的工作(占有价值fork的50%以上)
2上游PR交叉引用compare API无法识别的变基/强制推送工作
3fork上的标签/发布版本独立维护的意图
4提交邮箱域名机构贡献者(
@company.com
5时间戳集群中等消除85%以上的镜像噪音
6跨fork趋同中等揭示未被上游满足的需求
7星标最低通常与实际价值负相关

Pipeline — 7 Steps

处理流程——7步

Step 1: Upstream Baseline

步骤1:上游基线

bash
UPSTREAM="OWNER/REPO"
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'
bash
UPSTREAM="OWNER/REPO"
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'

Step 2: List All Forks + Timestamp Clustering

步骤2:列出所有fork + 时间戳聚类

bash
undefined
bash
undefined

List all forks with activity signals

列出所有携带活跃信号的fork

gh api "repos/$UPSTREAM/forks" --paginate
--jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'

**Timestamp clustering**: Forks sharing exact `pushed_at` with upstream are bulk mirrors created by GitHub's fork mechanism and never touched. Group by `pushed_at` — forks with unique timestamps warrant investigation. This alone eliminates 85%+ of noise.

```bash
gh api "repos/$UPSTREAM/forks" --paginate \ --jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'

**时间戳聚类**:与上游`pushed_at`完全相同的fork是GitHub fork机制创建的从未被修改过的批量镜像。按`pushed_at`分组——时间戳唯一的fork值得进一步排查。仅这一步就能消除85%以上的噪音。

```bash

Filter to unique-timestamp forks (skip bulk mirrors)

筛选出时间戳唯一的fork(跳过批量镜像)

gh api "repos/$UPSTREAM/forks" --paginate
--jq '.[] | {full_name, pushed_at, stargazers_count}' |
jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'
undefined
gh api "repos/$UPSTREAM/forks" --paginate \ --jq '.[] | {full_name, pushed_at, stargazers_count}' | \ jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'
undefined

Step 3: Default Branch Divergence

步骤3:默认分支差异

bash
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')
bash
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

For each candidate fork

对每个候选fork执行

gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH"
--jq '{ahead_by, behind_by, status}'

The `status` field meanings:

- `identical` — pure mirror, skip
- `behind` — stale mirror, skip
- `diverged` — has original commits AND is behind (interesting)
- `ahead` — has original commits, up-to-date with upstream (rare, most valuable)

**Important**: Always compare from the upstream repo's perspective (`repos/UPSTREAM/compare/...`). The reverse direction (`repos/FORK/compare/...`) returns 404 for some repositories.
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH" \ --jq '{ahead_by, behind_by, status}'

`status`字段含义:

- `identical` — 纯镜像,跳过
- `behind` — 过时镜像,跳过
- `diverged` — 存在原始提交且落后于上游(值得关注)
- `ahead` — 存在原始提交,与上游保持同步(罕见,价值最高)

**重要提示**:始终从上游仓库的视角进行对比(`repos/UPSTREAM/compare/...`)。反向对比(`repos/FORK/compare/...`)在部分仓库会返回404。

Step 4: Non-Default Branch Analysis (CRITICAL)

步骤4:非默认分支分析(关键步骤)

This is the single biggest methodology improvement. Across all 10 repos tested, 50%+ of the most valuable fork work lived exclusively on feature branches.
Examples:
  • flowsurface/aviu16: 7,000-line GPU shader heatmap only on
    shader-heatmap
  • ArcticDB/DerThorsten: 147 commits across
    conda_build
    ,
    clang
    ,
    apple_changes
  • pueue/FrancescElies: Duration display only on
    cesc/duration
  • barter-rs: 6 of 12 top forks had work only on feature branches
bash
undefined
这是方法最大的改进点。在所有测试的10个仓库中,50%以上最有价值的fork工作仅存在于特性分支上。
示例:
  • flowsurface/aviu16:7000行GPU着色器热力图仅存在于
    shader-heatmap
    分支
  • ArcticDB/DerThorsten:
    conda_build
    clang
    apple_changes
    分支共147次提交
  • pueue/FrancescElies:时长显示功能仅存在于
    cesc/duration
    分支
  • barter-rs:12个顶级fork中有6个的工作仅存在于特性分支
bash
undefined

List branches on a fork

列出fork上的分支

gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20
gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20

Check divergence on a specific branch

检查特定分支的差异

gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH"
--jq '{ahead_by, behind_by, status}'

**Heuristics for which forks need branch checks**:

- Any fork with `pushed_at` more recent than upstream but `ahead_by == 0` on default branch
- Any fork with more than 1 branch
- Branch count > 10 is suspicious — likely non-trivial work (ArcticDB: Rohan-flutterint had 197 branches)
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH" \ --jq '{ahead_by, behind_by, status}'

**需要检查分支的fork启发式规则**:

- 任何`pushed_at`比上游新但默认分支`ahead_by == 0`的fork
- 任何分支数>1的fork
- 分支数>10属于可疑情况——大概率存在有价值的工作(ArcticDB的Rohan-flutterint有197个分支)

Step 5: Commit Content Evaluation

步骤5:提交内容评估

bash
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH" \
  --jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\n")[0], date: .commit.committer.date[:10], author: .commit.author.email}'
What to look for:
  • Commit email domains reveal institutional contributors (
    @man.com
    ,
    @quantstack.net
    )
  • Subtract merge commits from ahead_by count (e.g., akeda2/pueue showed 35 ahead but 28 were upstream merges)
  • Build system changes (
    CMakeLists.txt
    ,
    Cargo.toml
    ,
    pyproject.toml
    ) indicate platform enablement
  • Protobuf schema changes indicate architectural-level features
  • Test files alongside source changes signal production-intent work
bash
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH" \\
  --jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\
")[0], date: .commit.committer.date[:10], author: .commit.author.email}'
需要关注的点
  • 提交邮箱域名可识别机构贡献者(
    @man.com
    @quantstack.net
  • 从领先提交数中减去合并提交(例如akeda2/pueue显示领先35次提交,但其中28次是上游合并)
  • 构建系统变更(
    CMakeLists.txt
    Cargo.toml
    pyproject.toml
    )表明平台适配工作
  • Protobuf schema变更表明架构级功能更新
  • 源代码修改附带测试文件说明是生产级工作

Step 6: Fork-Specific Signals

步骤6:fork专属信号

bash
undefined
bash
undefined

Tags/releases (strongest independent maintenance signal)

标签/发布版本(最强的独立维护信号)

gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10 gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5
gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10 gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5

Open issues on the fork (signals independent project maintenance)

fork上的开放issue(表明独立项目维护)

gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'
gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'

Check if repo was renamed (strong divergence intent signal)

检查仓库是否被重命名(强差异意图信号)

gh api "repos/FORK_OWNER/REPO" --jq '.name'

| Signal                    | Strength                  | Example                                 |
| ------------------------- | ------------------------- | --------------------------------------- |
| Tags/releases on fork     | Highest                   | pueue/freesrz93 had 6 releases          |
| Open PRs against upstream | High                      | Formal proposals with review context    |
| Open issues on the fork   | High                      | Independent project maintenance         |
| Repo renamed              | Medium                    | flowsurface/sinaha81 became volume_flow |
| Build config changes      | High (compiled languages) | Cargo.toml, CMakeLists.txt diff         |
| Description changed       | Weak                      | Many vanity renames with no code        |
gh api "repos/FORK_OWNER/REPO" --jq '.name'

| 信号                    | 强度                  | 示例                                 |
| ------------------------- | ------------------------- | --------------------------------------- |
| fork上的标签/发布版本     | 最高                   | pueue/freesrz93有6个发布版本          |
| 向上游提交的开放PR | 高                      | 带评审上下文的正式提案    |
| fork上的开放issue   | 高                      | 独立项目维护         |
| 仓库重命名              | 中等                    | flowsurface/sinaha81改名为volume_flow |
| 构建配置变更      | 高(编译型语言) | Cargo.toml、CMakeLists.txt差异         |
| 描述变更       | 弱                      | 大量无代码变更的 vanity 重命名        |

Step 7: Cross-Fork Convergence + Upstream PR History

步骤7:跨fork趋同 + 上游PR历史

bash
undefined
bash
undefined

Check upstream PRs from fork owners

检查fork所有者提交的上游PR

gh api "repos/$UPSTREAM/pulls?state=all" --paginate
--jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'

**Cross-fork convergence**: When multiple forks independently solve the same problem, it signals unmet upstream demand:

- firecrawl: 3 forks adopted Patchright for anti-detection
- flowsurface: 3 forks added technical indicators independently
- kokoro: 2 independent batched inference implementations
- barter-rs: 4 forks added Bybit support

**Upstream PR cross-reference catches**:

- Rebased/force-pushed work invisible to compare API
- Work that was merged upstream (fork shows 0 ahead but was historically significant)
- Declined PRs with valuable code that the fork still maintains

---
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \ --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'

**跨fork趋同**:当多个fork独立解决同一个问题时,表明存在未被上游满足的需求:

- firecrawl:3个fork采用Patchright实现反检测
- flowsurface:3个fork独立添加了技术指标
- kokoro:2个独立的批量推理实现
- barter-rs:4个fork添加了Bybit支持

**上游PR交叉引用可识别**:

- compare API无法识别的变基/强制推送工作
- 已合并到上游的工作(fork显示领先0次提交但历史上有重要价值)
- 被拒绝的PR中包含fork仍在维护的有价值代码

---

Tier Classification

等级分类

After running the pipeline, classify forks into tiers:
TierCriteriaAction
Tier 1: Major ExtensionsNew features, architectural changes, >10 original commitsDeep evaluation, cherry-pick candidates
Tier 2: Targeted FeaturesFocused additions, bug fixes, 2-10 commitsCherry-pick individual commits
Tier 3: InfrastructureCI/CD, packaging, deployment, docsEvaluate if relevant to your setup
Tier 4: HistoricalMerged upstream or stale but once significantNote for context, no action needed

运行完流程后,将fork分为以下等级:
等级判定标准处理建议
一级:重大扩展新功能、架构变更、>10次原始提交深度评估,可作为cherry-pick候选
二级:定向功能聚焦的功能新增、bug修复、2-10次提交单独cherry-pick对应提交
三级:基础设施CI/CD、打包、部署、文档评估是否与你的配置相关
四级:历史价值已合并到上游或过时但曾经有重要意义记录作为上下文,无需操作

Domain-Specific Patterns

领域特定模式

Different codebases exhibit different fork behaviors. See domain-patterns.md for full details.
DomainKey PatternExample
Scientific/MLResearchers fork-implement-publish-vanish, zero social engagementpymoo: 300-file fork with 0 stars
Trading/FinanceExchange connectors dominate; best forks are privatebarter-rs: 4 independent Bybit impls
Infrastructure/DevToolsSelf-hosting/SaaS-removal is the dominant themefirecrawl: devflowinc/firecrawl-simple (630 stars)
C++/Python MixedFeature work lives on branches; email domains reveal institutionsArcticDB: @man.com, @quantstack.net
Node.js LibrariesCheck npm publication as separate packagesdukascopy-node: kyo06 published
dukascopy-node-plus
Rust CLICargo.toml diff is reliable quick filter; "superset" forks add subcommandspueue: freesrz93 added 7 subcommands

不同代码库表现出不同的fork行为。完整详情请查看domain-patterns.md
领域核心模式示例
科学/机器学习研究者fork-实现-发布-消失,零社交参与pymoo:300个文件的fork,0个星标
交易/金融交易所连接器占主导;最好的fork都是私有的barter-rs:4个独立的Bybit实现
基础设施/开发工具自托管/移除SaaS是主流主题firecrawl:devflowinc/firecrawl-simple(630星标)
C++/Python混合功能工作存在于分支上;邮箱域名可识别机构ArcticDB:@man.com、@quantstack.net
Node.js库检查是否作为独立包发布到npmdukascopy-node:kyo06发布了
dukascopy-node-plus
Rust CLICargo.toml差异是可靠的快速筛选条件;"超集"fork会添加子命令pueue:freesrz93添加了7个子命令

Quick-Scan Pipeline (5-minute triage)

快速扫描流程(5分钟分类)

For rapid triage of any new repo:
bash
UPSTREAM="OWNER/REPO"
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')
用于快速筛选任意新仓库:
bash
UPSTREAM="OWNER/REPO"
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

1. Baseline

1. 基线

gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'

2. Forks with unique timestamps (skip mirrors)

2. 时间戳唯一的fork(跳过镜像)

gh api "repos/$UPSTREAM/forks" --paginate
--jq '.[] | {full_name, pushed_at, stargazers_count}' |
jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'
gh api "repos/$UPSTREAM/forks" --paginate \ --jq '.[] | {full_name, pushed_at, stargazers_count}' | \ jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'

3. Check ahead_by for each candidate

3. 检查每个候选fork的领先提交数

(loop over candidates from step 2)

(遍历步骤2得到的候选列表)

4. Check upstream PRs from fork authors

4. 检查fork作者提交的上游PR

gh api "repos/$UPSTREAM/pulls?state=all" --paginate
--jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'

---
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \ --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'

---

Known Limitations

已知限制

LimitationImpactWorkaround
GitHub compare API 250-commit limitHighly divergent forks may truncateUse
gh api repos/FORK/commits?per_page=1
to get total count
Private forks invisibleTrading firms keep best work privateAccepted limitation
Force-pushed branches break compare APIShows 0 ahead despite significant workCross-reference upstream PR history
Renamed forks may break API callsOld URLs may 404Use
gh api repos/FORK_OWNER/REPO --jq '.name'
to detect renames
Rate limiting on large fork ecosystems>1000 forks = many API callsUse timestamp clustering to reduce calls by 85%+
Maintainer dev forks look like independent workBranch names 1:1 with upstream PRsCross-reference branch names against upstream PR branch names

限制影响解决方案
GitHub compare API 250次提交限制差异极大的fork可能会截断结果使用
gh api repos/FORK/commits?per_page=1
获取总提交数
私有fork不可见交易公司将最好的工作设为私有接受该限制
强制推送的分支会破坏compare API尽管有大量工作仍显示领先0次提交交叉引用上游PR历史
重命名的fork可能导致API调用失败旧URL可能返回404使用
gh api repos/FORK_OWNER/REPO --jq '.name'
检测重命名
大型fork生态的速率限制>1000个fork会产生大量API调用使用时间戳聚类减少85%以上的调用
维护者的开发fork看起来像独立工作分支名与上游PR一一对应将分支名与上游PR分支名交叉比对

Report Template

报告模板

Use this structure for the final analysis report:
markdown
undefined
最终分析报告使用以下结构:
markdown
undefined

Fork Analysis Report: OWNER/REPO

Fork分析报告:OWNER/REPO

Repository: OWNER/REPO (N stars, M forks) Analysis date: YYYY-MM-DD
仓库:OWNER/REPO(N星标,M个fork) 分析日期:YYYY-MM-DD

Fork Landscape Summary

Fork生态概览

MetricValue
Total forksN
Pure mirrorsN (X%)
Divergent forks (ahead on any branch)N
Substantive forks (meaningful work)N
Stars-only miss rateX%
指标数值
总fork数N
纯镜像N (X%)
差异fork(任意分支领先)N
有价值fork(有实际意义的工作)N
仅星标筛选遗漏率X%

Tiered Ranking

等级排序

Tier 1: Major Extensions

一级:重大扩展

(fork details with ahead_by, key features, files changed)
(fork详情:领先提交数、核心功能、修改文件)

Tier 2: Targeted Features

二级:定向功能

...
...

Tier 3: Infrastructure/Packaging

三级:基础设施/打包

...
...

Cross-Fork Convergence Patterns

跨fork趋同模式

(themes that multiple forks independently implemented)
(多个fork独立实现的共性主题)

Actionable Recommendations

可落地建议

  • Cherry-pick candidates
  • Feature inspiration
  • Security fixes

---
  • Cherry-pick候选
  • 功能灵感
  • 安全修复

---

Post-Change Checklist

修改后检查清单

After modifying THIS skill:
  1. YAML frontmatter valid (no colons in description)
  2. Trigger keywords current in description
  3. All
    ./references/
    links resolve
  4. Pipeline steps numbered consistently
  5. Shell commands tested against a real repository
  6. Append changes to evolution-log.md
修改本方法后请检查:
  1. YAML frontmatter有效(描述中无冒号)
  2. 描述中的触发关键词是最新的
  3. 所有
    ./references/
    链接可访问
  4. 流程步骤编号一致
  5. Shell命令已在真实仓库测试过
  6. 将修改追加到evolution-log.md