split-pr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Split PR by CODEOWNERS Groups

按CODEOWNERS组拆分PR

Split a large pull request into multiple smaller PRs, where each PR touches the fewest possible CODEOWNERS reviewer groups. The goal is to reduce review burden: a PR that only touches
megatron/core/
needs only the core reviewers, while a PR that also touches
examples/
,
tools/
, and
megatron/training/
pulls in many additional groups.
将大型拉取请求(PR)拆分为多个较小的PR,每个PR涉及尽可能少的CODEOWNERS评审组。目标是减轻评审负担:仅涉及
megatron/core/
的PR只需核心评审人员,而同时涉及
examples/
tools/
megatron/training/
的PR则会引入更多评审组。

Workflow

工作流程

1. Analyze the PR

1. 分析PR

  1. Fetch the PR details:
    gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,author
    and
    gh pr diff <number> --repo NVIDIA/Megatron-LM --stat
    . Also determine the current GitHub user with
    gh api user --jq .login
    .
  2. Parse
    .github/CODEOWNERS
    to build a mapping from file path patterns to owner groups.
  3. For each changed file in the PR, determine which CODEOWNERS groups would be required to review it.
  4. Build a summary table grouped by CODEOWNERS group, showing which files pull in which groups.
  5. Count the total number of distinct reviewer groups the PR currently requires.
  1. 获取PR详情:执行
    gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,author
    gh pr diff <number> --repo NVIDIA/Megatron-LM --stat
    。同时通过
    gh api user --jq .login
    确定当前GitHub用户。
  2. 解析
    .github/CODEOWNERS
    文件,构建文件路径模式与所有者组的映射关系。
  3. 针对PR中的每个变更文件,确定需要哪些CODEOWNERS组进行评审。
  4. 构建按CODEOWNERS组分组的汇总表格,展示哪些文件会引入哪些评审组。
  5. 统计当前PR所需的不同评审组总数。

2. Propose a split that minimizes reviewer groups per PR

2. 提出拆分方案,最小化每个PR的评审组数量

The primary optimization goal: minimize the number of CODEOWNERS reviewer groups required for each resulting PR.
Strategy:
  1. Cluster files by their CODEOWNERS groups. Files owned by the same set of groups naturally belong together.
  2. Identify the largest cluster — this becomes the first (and usually largest) PR.
  3. Remaining files form one or more additional PRs, each ideally requiring only one or two reviewer groups.
  4. If a split creates a dependency (e.g., PR B uses symbols renamed in PR A), the dependent PR must be merged after the first. Note this explicitly.
  5. Each PR must be independently mergeable to main — no broken imports, no missing symbols. Backward-compatible aliases and re-export stubs in the first PR can make this possible.
Present the proposed split as a table:
  • PR name/description
  • Files included
  • CODEOWNERS groups required
  • Dependencies on other PRs (if any)
Wait for user approval before proceeding.
主要优化目标:最小化每个拆分后PR所需的CODEOWNERS评审组数量
策略:
  1. 按CODEOWNERS组对文件进行聚类。由同一组所有者负责的文件自然应归为一类。
  2. 确定最大的聚类——这将成为第一个(通常也是最大的)PR。
  3. 剩余文件形成一个或多个额外PR,每个PR理想情况下只需一个或两个评审组。
  4. 如果拆分产生依赖关系(例如,PR B使用了PR A中重命名的符号),则依赖PR必须在第一个PR合并后再合并。需明确注明这一点。
  5. 每个PR必须能够独立合并到main分支——不能有导入错误,不能缺少符号。第一个PR中的向后兼容别名和重新导出存根可以实现这一点。
以表格形式呈现拟议的拆分方案:
  • PR名称/描述
  • 包含的文件
  • 所需的CODEOWNERS组
  • 对其他PR的依赖(如有)
在继续执行前等待用户批准。

3. Execute the split (after user approval)

3. 执行拆分(获得用户批准后)

For each new PR:
  1. Create a new branch from the appropriate base (
    main
    , or a dependency PR's branch).
  2. Extract the relevant changes:
    git diff upstream/main..<source-branch> -- <file paths> | git apply
    .
  3. Stage, commit with a clear message, and push to the user's fork.
  4. Create the PR as a draft (per repo contributing guidelines).
  5. If the original PR needs to be narrowed in scope, confirm with the user before force-pushing.
  6. Report all PR URLs when done.
针对每个新PR:
  1. 从合适的基准分支(
    main
    ,或依赖PR的分支)创建新分支。
  2. 提取相关变更:
    git diff upstream/main..<source-branch> -- <file paths> | git apply
  3. 暂存、提交(使用清晰的提交信息)并推送到用户的fork仓库。
  4. 创建草稿PR(遵循仓库贡献指南)。
  5. 如果需要缩小原始PR的范围,在强制推送前需与用户确认。
  6. 完成后报告所有PR的URL。

Important guidelines

重要指南

  • Always create PRs as drafts and push to the user's fork, never directly to upstream.
  • Backward-compatible changes (aliases, re-exports, deprecation shims) should go in the first PR so subsequent PRs can depend on them.
  • Test files should go with the production code they test, not in a separate PR.
  • Prefer a single clean commit per split PR over replaying the original commit history.
  • If a file is hard to categorize (e.g., it touches two groups), ask the user which PR it should go in.
  • If the current GitHub user is not the author of the original PR, each new PR's description must explicitly credit the original author (e.g., "Original changes by @<author> in #<number>").
  • 始终创建草稿PR并推送到用户的fork仓库,切勿直接推送到上游仓库。
  • 向后兼容的变更(别名、重新导出、弃用垫片)应放在第一个PR中,以便后续PR可以依赖这些变更。
  • 测试文件应与其测试的生产代码放在同一个PR中,不要单独放在一个PR里。
  • 每个拆分后的PR优先使用单个清晰的提交,而非重放原始提交历史。
  • 如果某个文件难以分类(例如,它涉及两个组),询问用户应将其放入哪个PR。
  • 如果当前GitHub用户不是原始PR的作者,每个新PR的描述必须明确注明原始作者的贡献(例如,“原始变更由@<author>在#<number>中提交”)。