mcore-create-issue

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triage CI Failure into a GitHub Issue

将CI失败问题分类并创建GitHub Issue

Investigate a failing GitHub Actions job, extract the root cause, and file a well-structured bug issue against
NVIDIA/Megatron-LM
.
调查失败的GitHub Actions任务,提取根本原因,并针对
NVIDIA/Megatron-LM
提交结构清晰的Bug Issue。

Workflow

工作流程

1. Parse the URL

1. 解析URL

The argument is a GitHub Actions URL. It will be one of:
  • Job URL:
    https://github.com/<owner>/<repo>/actions/runs/<run_id>/job/<job_id>
  • Run URL:
    https://github.com/<owner>/<repo>/actions/runs/<run_id>
Extract
run_id
and, if present,
job_id
.
传入参数为GitHub Actions URL,分为以下两种类型:
  • 任务URL
    https://github.com/<owner>/<repo>/actions/runs/<run_id>/job/<job_id>
  • 运行URL
    https://github.com/<owner>/<repo>/actions/runs/<run_id>
提取
run_id
,如果存在则同时提取
job_id

2. Identify failed jobs

2. 识别失败的任务

  • If a
    job_id
    was provided, use that job directly.
  • If only a
    run_id
    was provided, list all failed jobs in the run:
    bash
    gh run view <run_id> --repo NVIDIA/Megatron-LM --json jobs \
      --jq '[.jobs[] | select(.conclusion == "failure") | {id: .databaseId, name: .name, url: .url}]'
    If multiple jobs failed, ask the user which one to triage, or triage all of them if they say so.
  • 如果提供了
    job_id
    ,直接使用该任务。
  • 如果仅提供了
    run_id
    ,列出该运行中所有失败的任务:
    bash
    gh run view <run_id> --repo NVIDIA/Megatron-LM --json jobs \
      --jq '[.jobs[] | select(.conclusion == "failure") | {id: .databaseId, name: .name, url: .url}]'
    如果有多个任务失败,询问用户要分类哪个任务,若用户要求则全部分类。

3. Fetch the failure logs

3. 获取失败日志

For each failed job, retrieve the logs and narrow them down to the failure:
bash
undefined
针对每个失败的任务,获取日志并定位到失败相关内容:
bash
undefined

Pull the raw log and keep only error-bearing lines

拉取原始日志并仅保留包含错误的行

gh api repos/NVIDIA/Megatron-LM/actions/jobs/<job_id>/logs 2>&1
| grep -E "(FAILED|ERROR|\bError\b|assert|Traceback|Exception|##[error])"
| head -200

Also capture the full job name:

```bash
gh run view --job <job_id> --repo NVIDIA/Megatron-LM --json name --jq .name
If the grep output is sparse, download the full logs and look for the pytest
FAILURES
section or the last non-zero exit signal.
gh api repos/NVIDIA/Megatron-LM/actions/jobs/<job_id>/logs 2>&1
| grep -E "(FAILED|ERROR|\bError\b|assert|Traceback|Exception|##[error])"
| head -200

同时捕获完整的任务名称:

```bash
gh run view --job <job_id> --repo NVIDIA/Megatron-LM --json name --jq .name
如果grep输出内容较少,下载完整日志并查找pytest的
FAILURES
部分或最后一个非零退出信号。

4. Resolve the triggering PR and test author

4. 确定触发的PR和测试文件作者

Triggering PR: the run's head branch follows the pattern
pull-request/<number>
. Extract it and resolve the PR:
bash
gh run view <run_id> --repo NVIDIA/Megatron-LM --json headBranch --jq .headBranch
触发的PR:运行的头部分支遵循
pull-request/<number>
格式。提取该分支并解析对应的PR:
bash
gh run view <run_id> --repo NVIDIA/Megatron-LM --json headBranch --jq .headBranch

→ e.g. "pull-request/4332"

→ 示例:"pull-request/4332"

Extract PR number and fetch metadata:

提取PR编号并获取元数据:

gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json number,title,url
--jq '{number: .number, title: .title, url: .url}'

**Test file author**: find the GitHub login of whoever last touched the failing
test file. The file may not exist on `main` — first determine the PR's base
branch, then search from there:

```bash
gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json number,title,url
--jq '{number: .number, title: .title, url: .url}'

**测试文件作者**:找到最后修改失败测试文件的GitHub登录账号。该文件可能不存在于`main`分支——首先确定PR的基准分支,再从该分支开始搜索:

```bash

1. Get the PR's base branch (e.g. "main", "dev", "release/X.Y")

1. 获取PR的基准分支(例如"main"、"dev"、"release/X.Y")

gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json baseRefName --jq .baseRefName
gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json baseRefName --jq .baseRefName

2. Search commits on that base branch

2. 在该基准分支上搜索提交记录

gh api "repos/NVIDIA/Megatron-LM/commits?path=<test-file-path>&sha=<base-branch>&per_page=1"
--jq '.[0] | {login: .author.login, name: .commit.author.name, sha: .sha}'

If the result is empty (file was introduced by the PR itself), query the PR's
commits instead:

```bash
gh api "repos/NVIDIA/Megatron-LM/pulls/<pr_number>/commits" \
  --jq '[.[] | select(.files? // [] | any(.filename == "<test-file-path>"))] | .[0].author.login'
As a last resort, list the PR commits and pick the author of the commit whose message most closely relates to the failing test file.
gh api "repos/NVIDIA/Megatron-LM/commits?path=<test-file-path>&sha=<base-branch>&per_page=1"
--jq '.[0] | {login: .author.login, name: .commit.author.name, sha: .sha}'

如果结果为空(文件由该PR新增),则查询该PR的提交记录:

```bash
gh api "repos/NVIDIA/Megatron-LM/pulls/<pr_number>/commits" \
  --jq '[.[] | select(.files? // [] | any(.filename == "<test-file-path>"))] | .[0].author.login'
如果以上方法都失败,列出PR的所有提交记录,选择提交信息与失败测试文件最相关的提交作者。

5. Extract the root cause

5. 提取根本原因

From the logs, identify:
  • Failed test(s): lines matching
    FAILED tests/...::...
    give the exact pytest node IDs.
  • Error message: the assertion failure, exception type, or first meaningful traceback frame — keep it under ~30 lines.
  • Job name: the GitHub Actions job name (e.g.
    tests/unit_tests/transformer/moe/**/*.py - latest
    ).
  • Run / job URLs and PR URL: for linking in the issue.
从日志中识别以下信息:
  • 失败的测试用例:匹配
    FAILED tests/...::...
    的行给出了准确的pytest节点ID。
  • 错误信息:断言失败、异常类型或第一个有意义的回溯栈帧——保持在约30行以内。
  • 任务名称:GitHub Actions任务名称(例如
    tests/unit_tests/transformer/moe/**/*.py - latest
    )。
  • 运行/任务URLPR URL:用于在Issue中添加链接。

6. Check for duplicate issues

6. 检查重复Issue

Search for open issues that already cover the same test:
bash
gh issue list --repo NVIDIA/Megatron-LM \
  --state open \
  --search "<failed-test-filename>" \
  --json number,title,url \
  --limit 10
  • If a matching open issue exists, do not create a new one. Report the existing issue to the user and stop.
  • If no match is found, proceed to file a new issue.
搜索已存在的公开Issue,确认是否已有相同测试用例的问题:
bash
gh issue list --repo NVIDIA/Megatron-LM \
  --state open \
  --search "<failed-test-filename>" \
  --json number,title,url \
  --limit 10
  • 如果存在匹配的公开Issue,请勿创建新Issue。将现有Issue告知用户并停止流程。
  • 如果未找到匹配项,继续创建新Issue。

7. Create the issue

7. 创建Issue

Pass
--assignee <test-author-login>
to assign the issue to the test file's author. Include the triggering PR URL in the issue body.
bash
gh issue create \
  --repo NVIDIA/Megatron-LM \
  --title "🐛 CI failure: <failed-test-node-id>" \
  --label "bug" \
  --assignee "<test-author-login>" \
  --body "..."
Use the bug-report template body structure:
markdown
**Describe the bug**

CI test `<failed-test-node-id>` failed in job [`<job-name>`](<job-url>).
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

**Failing run**

| Field | Value |
|-------|-------|
| PR    | [#<pr_number>: <pr_title>](<pr_url>) |
| Run   | [<run_id>](<run_url>) |
| Job   | [<job_name>](<job_url>) |

**Error**
<core error message / traceback — 30 lines max>

**Steps/Code to reproduce bug**

Re-run the failing CI job linked above, or locally inside the dev container:

```bash
pytest <failed-test-node-id>
Additional context
Triaged automatically via
/triage-issue
.

If multiple tests failed in the same job, list each one as a separate bullet
under "Describe the bug" and include the combined error snippets. Assign the
issue to the author of whichever test file appears first in the failure list.
使用
--assignee <test-author-login>
将Issue分配给测试文件的作者。在Issue正文中包含触发的PR URL。
bash
gh issue create \
  --repo NVIDIA/Megatron-LM \
  --title "🐛 CI failure: <failed-test-node-id>" \
  --label "bug" \
  --assignee "<test-author-login>" \
  --body "..."
使用Bug报告模板的正文结构:
markdown
**Describe the bug**

CI test `<failed-test-node-id>` failed in job [`<job-name>`](<job-url>).
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

**Failing run**

| Field | Value |
|-------|-------|
| PR    | [#<pr_number>: <pr_title>](<pr_url>) |
| Run   | [<run_id>](<run_url>) |
| Job   | [<job_name>](<job_url>) |

**Error**
<core error message / traceback — 30 lines max>

**Steps/Code to reproduce bug**

Re-run the failing CI job linked above, or locally inside the dev container:

```bash
pytest <failed-test-node-id>
Additional context
Triaged automatically via
/triage-issue
.

如果同一任务中有多个测试用例失败,在“Describe the bug”下将每个测试用例列为单独的项目符号,并合并错误片段。将Issue分配给失败列表中第一个测试文件的作者。

8. Report back to the user

8. 向用户反馈

Print the URL of the newly created issue (or the duplicate, if found) so the user can review or share it.
打印新创建的Issue的URL(如果找到重复Issue则打印该URL),以便用户查看或分享。

Important guidelines

重要准则

  • Never create an issue if a duplicate already exists — link the existing one instead.
  • Always include the triggering PR link in the issue body.
  • Always assign the issue to the test file's most recent author. If the author lookup fails (e.g. the commit was made by a bot or the login is unavailable), skip
    --assignee
    and note it in the "Additional context" section.
  • Keep the error snippet concise (≤30 lines). Truncate long tracebacks and note that the full log is available via the job URL.
  • Do not guess the root cause — quote the actual log output verbatim.
  • If the job is still in progress or the logs are unavailable, say so and ask the user to retry once the run completes.
  • 如果已存在重复Issue,切勿创建新Issue——改为链接现有Issue。
  • 务必在Issue正文中包含触发的PR链接。
  • 务必将Issue分配给测试文件的最新作者。如果作者查找失败(例如提交由机器人完成或登录账号不可用),跳过
    --assignee
    并在“Additional context”部分注明。
  • 错误片段需简洁(≤30行)。截断过长的回溯栈并注明完整日志可通过任务URL获取。
  • 请勿猜测根本原因——直接引用实际的日志输出。
  • 如果任务仍在进行中或日志不可用,告知用户并请其在运行完成后重试。