Fork Intelligence

Systematic methodology for discovering valuable work in GitHub fork ecosystems. Stars-only filtering misses 60-100% of substantive forks — this skill uses branch-level divergence analysis, upstream PR cross-referencing, and domain-specific heuristics to find what matters.

Validated empirically across 10 repositories spanning Python, Rust, TypeScript, C++/Python, and Node.js (tensortrade, backtesting.py, kokoro, pymoo, firecrawl, barter-rs, pueue, dukascopy-node, ArcticDB, flowsurface).

FIRST — TodoWrite Task Templates

MANDATORY: Select and load the appropriate template before any fork analysis.

Template A — Full Analysis (new repository)

1. Get upstream baseline (stars, forks, default branch, last push)
2. List all forks with pagination, note timestamp clusters
3. Filter to unique-timestamp forks (skip bulk mirrors)
4. Check default branch divergence (ahead_by/behind_by)
5. Check non-default branches for all forks with recent push or >1 branch
6. Evaluate commit content, author emails, tags/releases
7. Cross-reference upstream PR history from fork owners
8. Tier ranking and cross-fork convergence analysis
9. Produce report with actionable recommendations

Template B — Quick Scan (triage only)

1. Get upstream baseline
2. List forks, filter by timestamp clustering
3. Check default branch divergence only
4. Report forks with ahead_by > 0

Template C — Targeted Fork Evaluation (specific fork)

1. Compare fork vs upstream on all branches
2. Examine commit messages and changed files
3. Check for tags/releases, open issues, PRs
4. Assess cherry-pick viability

Signal Priority Order

Ranked by empirical reliability across 10 repositories. See signal-priority.md for details.

Rank	Signal	Reliability	What It Catches
1	Branch-level divergence	Highest	Work on feature branches (50%+ of substantive forks)
2	Upstream PR cross-reference	High	Rebased/force-pushed work invisible to compare API
3	Tags/releases on fork	High	Independent maintenance intent
4	Commit email domains	High	Institutional contributors ( `@company.com` )
5	Timestamp clustering	Medium	Eliminates 85%+ mirror noise
6	Cross-fork convergence	Medium	Reveals unmet upstream demand
7	Stars	Lowest	Often anti-correlated with actual value

Pipeline — 7 Steps

Step 1: Upstream Baseline

bash

UPSTREAM="OWNER/REPO"
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'

Step 2: List All Forks + Timestamp Clustering

bash

# List all forks with activity signals
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'

Timestamp clustering: Forks sharing exact

pushed_at

with upstream are bulk mirrors created by GitHub's fork mechanism and never touched. Group by

pushed_at

— forks with unique timestamps warrant investigation. This alone eliminates 85%+ of noise.

bash

# Filter to unique-timestamp forks (skip bulk mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count}' | \
  jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'

Step 3: Default Branch Divergence

bash

BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

# For each candidate fork
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH" \
  --jq '{ahead_by, behind_by, status}'

The

status

field meanings:

```
identical
```
— pure mirror, skip
```
behind
```
— stale mirror, skip
```
diverged
```
— has original commits AND is behind (interesting)
```
ahead
```
— has original commits, up-to-date with upstream (rare, most valuable)

Important: Always compare from the upstream repo's perspective (

repos/UPSTREAM/compare/...

). The reverse direction (

repos/FORK/compare/...

) returns 404 for some repositories.

Step 4: Non-Default Branch Analysis (CRITICAL)

This is the single biggest methodology improvement. Across all 10 repos tested, 50%+ of the most valuable fork work lived exclusively on feature branches.

Examples:

flowsurface/aviu16: 7,000-line GPU shader heatmap only on
```
shader-heatmap
```
ArcticDB/DerThorsten: 147 commits across
```
conda_build
```
,
```
clang
```
,
```
apple_changes
```
pueue/FrancescElies: Duration display only on
```
cesc/duration
```
barter-rs: 6 of 12 top forks had work only on feature branches

bash

# List branches on a fork
gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20

# Check divergence on a specific branch
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH" \
  --jq '{ahead_by, behind_by, status}'

Heuristics for which forks need branch checks:

Any fork with
```
pushed_at
```
more recent than upstream but
```
ahead_by == 0
```
on default branch
Any fork with more than 1 branch
Branch count > 10 is suspicious — likely non-trivial work (ArcticDB: Rohan-flutterint had 197 branches)

Step 5: Commit Content Evaluation

bash

gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH" \
  --jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\n")[0], date: .commit.committer.date[:10], author: .commit.author.email}'

What to look for:

Commit email domains reveal institutional contributors (
```
@man.com
```
,
```
@quantstack.net
```
)
Subtract merge commits from ahead_by count (e.g., akeda2/pueue showed 35 ahead but 28 were upstream merges)
Build system changes (
```
CMakeLists.txt
```
,
```
Cargo.toml
```
,
```
pyproject.toml
```
) indicate platform enablement
Protobuf schema changes indicate architectural-level features
Test files alongside source changes signal production-intent work

Step 6: Fork-Specific Signals

bash

# Tags/releases (strongest independent maintenance signal)
gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10
gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5

# Open issues on the fork (signals independent project maintenance)
gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'

# Check if repo was renamed (strong divergence intent signal)
gh api "repos/FORK_OWNER/REPO" --jq '.name'

Signal	Strength	Example
Tags/releases on fork	Highest	pueue/freesrz93 had 6 releases
Open PRs against upstream	High	Formal proposals with review context
Open issues on the fork	High	Independent project maintenance
Repo renamed	Medium	flowsurface/sinaha81 became volume_flow
Build config changes	High (compiled languages)	Cargo.toml, CMakeLists.txt diff
Description changed	Weak	Many vanity renames with no code

Step 7: Cross-Fork Convergence + Upstream PR History

bash

# Check upstream PRs from fork owners
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
  --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'

Cross-fork convergence: When multiple forks independently solve the same problem, it signals unmet upstream demand:

firecrawl: 3 forks adopted Patchright for anti-detection
flowsurface: 3 forks added technical indicators independently
kokoro: 2 independent batched inference implementations
barter-rs: 4 forks added Bybit support

Upstream PR cross-reference catches:

Rebased/force-pushed work invisible to compare API
Work that was merged upstream (fork shows 0 ahead but was historically significant)
Declined PRs with valuable code that the fork still maintains

Tier Classification

After running the pipeline, classify forks into tiers:

Tier	Criteria	Action
Tier 1: Major Extensions	New features, architectural changes, >10 original commits	Deep evaluation, cherry-pick candidates
Tier 2: Targeted Features	Focused additions, bug fixes, 2-10 commits	Cherry-pick individual commits
Tier 3: Infrastructure	CI/CD, packaging, deployment, docs	Evaluate if relevant to your setup
Tier 4: Historical	Merged upstream or stale but once significant	Note for context, no action needed

Domain-Specific Patterns

Different codebases exhibit different fork behaviors. See domain-patterns.md for full details.

Domain	Key Pattern	Example
Scientific/ML	Researchers fork-implement-publish-vanish, zero social engagement	pymoo: 300-file fork with 0 stars
Trading/Finance	Exchange connectors dominate; best forks are private	barter-rs: 4 independent Bybit impls
Infrastructure/DevTools	Self-hosting/SaaS-removal is the dominant theme	firecrawl: devflowinc/firecrawl-simple (630 stars)
C++/Python Mixed	Feature work lives on branches; email domains reveal institutions	ArcticDB: @man.com, @quantstack.net
Node.js Libraries	Check npm publication as separate packages	dukascopy-node: kyo06 published `dukascopy-node-plus`
Rust CLI	Cargo.toml diff is reliable quick filter; "superset" forks add subcommands	pueue: freesrz93 added 7 subcommands

Quick-Scan Pipeline (5-minute triage)

For rapid triage of any new repo:

bash

UPSTREAM="OWNER/REPO"
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')

# 1. Baseline
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'

# 2. Forks with unique timestamps (skip mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
  --jq '.[] | {full_name, pushed_at, stargazers_count}' | \
  jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'

# 3. Check ahead_by for each candidate
# (loop over candidates from step 2)

# 4. Check upstream PRs from fork authors
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
  --jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'

Known Limitations

Limitation	Impact	Workaround
GitHub compare API 250-commit limit	Highly divergent forks may truncate	Use `gh api repos/FORK/commits?per_page=1` to get total count
Private forks invisible	Trading firms keep best work private	Accepted limitation
Force-pushed branches break compare API	Shows 0 ahead despite significant work	Cross-reference upstream PR history
Renamed forks may break API calls	Old URLs may 404	Use `gh api repos/FORK_OWNER/REPO --jq '.name'` to detect renames
Rate limiting on large fork ecosystems	>1000 forks = many API calls	Use timestamp clustering to reduce calls by 85%+
Maintainer dev forks look like independent work	Branch names 1:1 with upstream PRs	Cross-reference branch names against upstream PR branch names

Report Template

Use this structure for the final analysis report:

markdown

# Fork Analysis Report: OWNER/REPO

**Repository**: OWNER/REPO (N stars, M forks)
**Analysis date**: YYYY-MM-DD

## Fork Landscape Summary

| Metric                                | Value  |
| ------------------------------------- | ------ |
| Total forks                           | N      |
| Pure mirrors                          | N (X%) |
| Divergent forks (ahead on any branch) | N      |
| Substantive forks (meaningful work)   | N      |
| Stars-only miss rate                  | X%     |

## Tiered Ranking

### Tier 1: Major Extensions

(fork details with ahead_by, key features, files changed)

### Tier 2: Targeted Features

...

### Tier 3: Infrastructure/Packaging

...

## Cross-Fork Convergence Patterns

(themes that multiple forks independently implemented)

## Actionable Recommendations

- Cherry-pick candidates
- Feature inspiration
- Security fixes

Post-Change Checklist

After modifying THIS skill:

YAML frontmatter valid (no colons in description)
Trigger keywords current in description
All
```
./references/
```
links resolve
Pipeline steps numbered consistently
Shell commands tested against a real repository
Append changes to evolution-log.md

fork-intelligence

NPX Install

Tags

SKILL.md Content

Fork Intelligence

FIRST — TodoWrite Task Templates

Template A — Full Analysis (new repository)

Template B — Quick Scan (triage only)

Template C — Targeted Fork Evaluation (specific fork)

Signal Priority Order

Pipeline — 7 Steps

Step 1: Upstream Baseline

Step 2: List All Forks + Timestamp Clustering

Step 3: Default Branch Divergence

Step 4: Non-Default Branch Analysis (CRITICAL)

Step 5: Commit Content Evaluation

Step 6: Fork-Specific Signals

Step 7: Cross-Fork Convergence + Upstream PR History

Tier Classification

Domain-Specific Patterns

Quick-Scan Pipeline (5-minute triage)

Known Limitations

Report Template

Post-Change Checklist