mcore-split-pr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Split PR by CODEOWNERS Groups

按CODEOWNERS组拆分PR

Split a large pull request into multiple smaller PRs, where each PR touches the fewest possible CODEOWNERS reviewer groups. The goal is to reduce review burden: a PR that only touches
megatron/core/
needs only the core reviewers, while a PR that also touches
examples/
,
tools/
, and
megatron/training/
pulls in many additional groups.
将大型拉取请求(PR)拆分为多个较小的PR,每个PR涉及尽可能少的CODEOWNERS审核组。目标是减轻审核负担:仅涉及
megatron/core/
的PR只需核心审核人员,而同时涉及
examples/
tools/
megatron/training/
的PR则会引入更多审核组。

Answer-First Constraints

优先约束条件

For split-planning questions, lead with these constraints before the full workflow:
  • Minimize CODEOWNERS reviewer groups per PR, but each resulting PR must still be independently mergeable and reviewable.
  • Tests travel with the production code they validate; do not split tests into a separate PR just to reduce reviewer groups.
  • If PR B depends on symbols renamed in PR A, call out the dependency and put backward-compatible aliases, re-exports, or shims in PR A when needed.
  • Wait for user approval before execution.
  • Execution creates draft PRs from the right base, applies file-scoped diffs with
    git diff upstream/main..<source-branch> -- <paths> | git apply
    , pushes to the user's fork, and never pushes directly to upstream.
对于拆分规划类问题,在完整工作流程前先说明以下约束:
  • 尽量减少每个PR所需的CODEOWNERS审核组,但每个生成的PR必须仍能独立合并和审核。
  • 测试代码需与其验证的生产代码放在一起;不要仅为减少审核组而将测试代码拆分为单独的PR。
  • 如果PR B依赖于PR A中重命名的符号,需明确指出依赖关系,并在必要时在PR A中添加向后兼容的别名、重导出或垫片(shims)。
  • 执行前需等待用户批准。
  • 执行时会从正确的基准分支创建草稿PR,使用
    git diff upstream/main..<source-branch> -- <paths> | git apply
    应用文件范围的差异,推送到用户的fork仓库,绝不直接推送到上游仓库。

Workflow

工作流程

1. Analyze the PR

1. 分析PR

  1. Fetch the PR details:
    gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,author
    and
    gh pr diff <number> --repo NVIDIA/Megatron-LM --stat
    . Also determine the current GitHub user with
    gh api user --jq .login
    .
  2. Parse
    .github/CODEOWNERS
    to build a mapping from file path patterns to owner groups.
  3. For each changed file in the PR, determine which CODEOWNERS groups would be required to review it.
  4. Build a summary table grouped by CODEOWNERS group, showing which files pull in which groups.
  5. Count the total number of distinct reviewer groups the PR currently requires.
  1. 获取PR详情:执行
    gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,author
    gh pr diff <number> --repo NVIDIA/Megatron-LM --stat
    。同时通过
    gh api user --jq .login
    确定当前GitHub用户。
  2. 解析
    .github/CODEOWNERS
    文件,构建文件路径模式到所有者组的映射。
  3. 针对PR中每个变更的文件,确定需要哪些CODEOWNERS组进行审核。
  4. 构建按CODEOWNERS组分组的汇总表格,展示哪些文件会引入哪些审核组。
  5. 统计当前PR所需的不同审核组总数。

2. Propose a split that minimizes reviewer groups per PR

2. 提出拆分方案,尽量减少每个PR的审核组数量

The primary optimization goal: minimize the number of CODEOWNERS reviewer groups required for each resulting PR.
Strategy:
  1. Cluster files by their CODEOWNERS groups. Files owned by the same set of groups naturally belong together.
  2. Identify the largest cluster — this becomes the first (and usually largest) PR.
  3. Remaining files form one or more additional PRs, each ideally requiring only one or two reviewer groups.
  4. If a split creates a dependency (e.g., PR B uses symbols renamed in PR A), the dependent PR must be merged after the first. Note this explicitly.
  5. Each PR must be independently mergeable to main — no broken imports, no missing symbols. Backward-compatible aliases and re-export stubs in the first PR can make this possible.
Present the proposed split as a table:
  • PR name/description
  • Files included
  • CODEOWNERS groups required
  • Dependencies on other PRs (if any)
Wait for user approval before proceeding.
主要优化目标:尽量减少每个生成的PR所需的CODEOWNERS审核组数量
策略:
  1. 按CODEOWNERS组对文件进行聚类。由同一组所有者负责的文件自然应放在一起。
  2. 确定最大的聚类——这将成为第一个(通常也是最大的)PR。
  3. 剩余文件组成一个或多个额外的PR,每个PR理想情况下只需一个或两个审核组。
  4. 如果拆分产生依赖关系(例如,PR B使用PR A中重命名的符号),则依赖PR必须在第一个PR之后合并。需明确注明这一点。
  5. 每个PR必须能够独立合并到main分支——不能有导入错误,不能缺少符号。第一个PR中的向后兼容别名和重导出存根可以实现这一点。
以表格形式呈现提议的拆分方案:
  • PR名称/描述
  • 包含的文件
  • 所需的CODEOWNERS组
  • 对其他PR的依赖(如有)
继续执行前需等待用户批准。

3. Execute the split (after user approval)

3. 执行拆分(用户批准后)

For each new PR:
  1. Create a new branch from the appropriate base (
    main
    , or a dependency PR's branch).
  2. Extract the relevant changes:
    git diff upstream/main..<source-branch> -- <file paths> | git apply
    .
  3. Stage, commit with a clear message, and push to the user's fork.
  4. Create the PR as a draft (per repo contributing guidelines).
  5. If the original PR needs to be narrowed in scope, confirm with the user before force-pushing.
  6. Report all PR URLs when done.
针对每个新PR:
  1. 从适当的基准分支(
    main
    ,或某个依赖PR的分支)创建新分支。
  2. 提取相关变更:执行
    git diff upstream/main..<source-branch> -- <file paths> | git apply
  3. 暂存、提交(使用清晰的提交信息)并推送到用户的fork仓库。
  4. 将PR创建为草稿(遵循仓库贡献指南)。
  5. 如果需要缩小原始PR的范围,在强制推送前需与用户确认。
  6. 完成后报告所有PR的URL。

Important guidelines

重要指南

  • Always create PRs as drafts and push to the user's fork, never directly to upstream.
  • Backward-compatible changes (aliases, re-exports, deprecation shims) should go in the first PR so subsequent PRs can depend on them.
  • Test files should go with the production code they test, not in a separate PR.
  • Prefer a single clean commit per split PR over replaying the original commit history.
  • If a file is hard to categorize (e.g., it touches two groups), ask the user which PR it should go in.
  • If the current GitHub user is not the author of the original PR, each new PR's description must explicitly credit the original author (e.g., "Original changes by @<author> in #<number>").
  • 始终将PR创建为草稿,并推送到用户的fork仓库,绝不直接推送到上游仓库。
  • 向后兼容的变更(别名、重导出、弃用垫片)应放在第一个PR中,以便后续PR可以依赖它们。
  • 测试代码需与其测试的生产代码放在一起,不要放在单独的PR中。
  • 每个拆分后的PR优先使用单个清晰的提交,而非重放原始提交历史。
  • 如果某个文件难以分类(例如,它涉及两个组),询问用户应将其放入哪个PR。
  • 如果当前GitHub用户不是原始PR的作者,每个新PR的描述必须明确注明原始作者(例如,“原始变更由@<author>在#<number>中提交”)。