create-issue

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triage CI Failure into a GitHub Issue

将CI失败分类排查并创建GitHub Issue

Investigate a failing GitHub Actions job, extract the root cause, and file a well-structured bug issue against
NVIDIA/Megatron-LM
.
调查失败的GitHub Actions任务,提取根本原因,并针对
NVIDIA/Megatron-LM
提交结构清晰的Bug Issue。

Workflow

工作流程

1. Parse the URL

1. 解析URL

The argument is a GitHub Actions URL. It will be one of:
  • Job URL:
    https://github.com/<owner>/<repo>/actions/runs/<run_id>/job/<job_id>
  • Run URL:
    https://github.com/<owner>/<repo>/actions/runs/<run_id>
Extract
run_id
and, if present,
job_id
.
传入的参数是GitHub Actions URL,分为以下两种类型:
  • 任务URL
    https://github.com/<owner>/<repo>/actions/runs/<run_id>/job/<job_id>
  • 运行URL
    https://github.com/<owner>/<repo>/actions/runs/<run_id>
提取
run_id
,如果存在的话同时提取
job_id

2. Identify failed jobs

2. 识别失败任务

  • If a
    job_id
    was provided, use that job directly.
  • If only a
    run_id
    was provided, list all failed jobs in the run:
    bash
    gh run view <run_id> --repo NVIDIA/Megatron-LM --json jobs \
      --jq '[.jobs[] | select(.conclusion == "failure") | {id: .databaseId, name: .name, url: .url}]'
    If multiple jobs failed, ask the user which one to triage, or triage all of them if they say so.
  • 如果提供了
    job_id
    ,直接使用该任务。
  • 如果仅提供了
    run_id
    ,列出该运行中所有失败的任务:
    bash
    gh run view <run_id> --repo NVIDIA/Megatron-LM --json jobs \
      --jq '[.jobs[] | select(.conclusion == "failure") | {id: .databaseId, name: .name, url: .url}]'
    如果有多个任务失败,询问用户要排查哪一个,或者在用户同意的情况下排查所有失败任务。

3. Fetch the failure logs

3. 获取失败日志

For each failed job, retrieve the logs and narrow them down to the failure:
bash
undefined
针对每个失败任务,获取日志并定位到失败相关内容:
bash
undefined

Pull the raw log and keep only error-bearing lines

拉取原始日志并仅保留包含错误的行

gh api repos/NVIDIA/Megatron-LM/actions/jobs/<job_id>/logs 2>&1
| grep -E "(FAILED|ERROR|\bError\b|assert|Traceback|Exception|##[error])"
| head -200

Also capture the full job name:

```bash
gh run view --job <job_id> --repo NVIDIA/Megatron-LM --json name --jq .name
If the grep output is sparse, download the full logs and look for the pytest
FAILURES
section or the last non-zero exit signal.
gh api repos/NVIDIA/Megatron-LM/actions/jobs/<job_id>/logs 2>&1
| grep -E "(FAILED|ERROR|\bError\b|assert|Traceback|Exception|##[error])"
| head -200

同时捕获完整的任务名称:

```bash
gh run view --job <job_id> --repo NVIDIA/Megatron-LM --json name --jq .name
如果grep的输出内容较少,请下载完整日志并查找pytest的
FAILURES
部分或最后一个非零退出信号。

4. Resolve the triggering PR and test author

4. 确定触发PR和测试文件作者

Triggering PR: the run's head branch follows the pattern
pull-request/<number>
. Extract it and resolve the PR:
bash
gh run view <run_id> --repo NVIDIA/Megatron-LM --json headBranch --jq .headBranch
触发PR:运行的头分支遵循
pull-request/<number>
格式。提取该分支并解析对应的PR:
bash
gh run view <run_id> --repo NVIDIA/Megatron-LM --json headBranch --jq .headBranch

→ e.g. "pull-request/4332"

→ 示例:"pull-request/4332"

Extract PR number and fetch metadata:

提取PR编号并获取元数据:

gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json number,title,url
--jq '{number: .number, title: .title, url: .url}'

**Test file author**: find the GitHub login of whoever last touched the failing
test file. The file may not exist on `main` — first determine the PR's base
branch, then search from there:

```bash
gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json number,title,url
--jq '{number: .number, title: .title, url: .url}'

**测试文件作者**:找到最近修改失败测试文件的GitHub登录账号。该文件可能不存在于`main`分支——首先确定PR的基准分支,然后从该分支开始搜索:

```bash

1. Get the PR's base branch (e.g. "main", "dev", "release/X.Y")

1. 获取PR的基准分支(例如 "main", "dev", "release/X.Y")

gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json baseRefName --jq .baseRefName
gh pr view <pr_number> --repo NVIDIA/Megatron-LM --json baseRefName --jq .baseRefName

2. Search commits on that base branch

2. 在该基准分支上搜索提交记录

gh api "repos/NVIDIA/Megatron-LM/commits?path=<test-file-path>&sha=<base-branch>&per_page=1"
--jq '.[0] | {login: .author.login, name: .commit.author.name, sha: .sha}'

If the result is empty (file was introduced by the PR itself), query the PR's
commits instead:

```bash
gh api "repos/NVIDIA/Megatron-LM/pulls/<pr_number>/commits" \
  --jq '[.[] | select(.files? // [] | any(.filename == "<test-file-path>"))] | .[0].author.login'
As a last resort, list the PR commits and pick the author of the commit whose message most closely relates to the failing test file.
gh api "repos/NVIDIA/Megatron-LM/commits?path=<test-file-path>&sha=<base-branch>&per_page=1"
--jq '.[0] | {login: .author.login, name: .commit.author.name, sha: .sha}'

如果结果为空(文件是由该PR新增的),则查询该PR的提交记录:

```bash
gh api "repos/NVIDIA/Megatron-LM/pulls/<pr_number>/commits" \
  --jq '[.[] | select(.files? // [] | any(.filename == "<test-file-path>"))] | .[0].author.login'
如果以上方法都失败,列出该PR的所有提交记录,并选择提交信息与失败测试文件最相关的提交作者。

5. Extract the root cause

5. 提取根本原因

From the logs, identify:
  • Failed test(s): lines matching
    FAILED tests/...::...
    give the exact pytest node IDs.
  • Error message: the assertion failure, exception type, or first meaningful traceback frame — keep it under ~30 lines.
  • Job name: the GitHub Actions job name (e.g.
    tests/unit_tests/transformer/moe/**/*.py - latest
    ).
  • Run / job URLs and PR URL: for linking in the issue.
从日志中识别以下信息:
  • 失败测试用例:匹配
    FAILED tests/...::...
    的行给出了准确的pytest节点ID。
  • 错误信息:断言失败、异常类型或第一个有意义的回溯栈帧——控制在约30行以内。
  • 任务名称:GitHub Actions任务名称(例如
    tests/unit_tests/transformer/moe/**/*.py - latest
    )。
  • 运行/任务URLPR URL:用于在Issue中添加链接。

6. Check for duplicate issues

6. 检查重复Issue

Search for open issues that already cover the same test:
bash
gh issue list --repo NVIDIA/Megatron-LM \
  --state open \
  --search "<failed-test-filename>" \
  --json number,title,url \
  --limit 10
  • If a matching open issue exists, do not create a new one. Report the existing issue to the user and stop.
  • If no match is found, proceed to file a new issue.
搜索是否已有覆盖相同测试用例的开放Issue:
bash
gh issue list --repo NVIDIA/Megatron-LM \
  --state open \
  --search "<failed-test-filename>" \
  --json number,title,url \
  --limit 10
  • 如果存在匹配的开放Issue,不要创建新Issue。向用户报告现有Issue并停止流程。
  • 如果未找到匹配项,继续创建新Issue。

7. Create the issue

7. 创建Issue

Pass
--assignee <test-author-login>
to assign the issue to the test file's author. Include the triggering PR URL in the issue body.
bash
gh issue create \
  --repo NVIDIA/Megatron-LM \
  --title "🐛 CI failure: <failed-test-node-id>" \
  --label "bug" \
  --assignee "<test-author-login>" \
  --body "..."
Use the bug-report template body structure:
markdown
**Describe the bug**

CI test `<failed-test-node-id>` failed in job [`<job-name>`](<job-url>).
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

**Failing run**

| Field | Value |
|-------|-------|
| PR    | [#<pr_number>: <pr_title>](<pr_url>) |
| Run   | [<run_id>](<run_url>) |
| Job   | [<job_name>](<job_url>) |

**Error**
<core error message / traceback — 30 lines max>

**Steps/Code to reproduce bug**

Re-run the failing CI job linked above, or locally inside the dev container:

```bash
pytest <failed-test-node-id>
Additional context
Triaged automatically via
/triage-issue
.

If multiple tests failed in the same job, list each one as a separate bullet
under "Describe the bug" and include the combined error snippets. Assign the
issue to the author of whichever test file appears first in the failure list.
使用
--assignee <test-author-login>
将Issue分配给测试文件的作者。在Issue正文中包含触发PR的URL。
bash
gh issue create \
  --repo NVIDIA/Megatron-LM \
  --title "🐛 CI failure: <failed-test-node-id>" \
  --label "bug" \
  --assignee "<test-author-login>" \
  --body "..."
使用以下Bug报告模板结构:
markdown
**描述Bug**

CI测试`<failed-test-node-id>`在任务[`<job-name>`](<job-url>)中失败。
标记@NVIDIA/mcore-oncall以引起值班人员的注意。

**失败运行信息**

| 字段 ||
|-------|-------|
| PR    | [#<pr_number>: <pr_title>](<pr_url>) |
| 运行   | [<run_id>](<run_url>) |
| 任务   | [<job_name>](<job_url>) |

**错误信息**
<核心错误信息/回溯栈 — 最多30行>

**复现步骤/代码**

重新运行上述链接的失败CI任务,或在开发容器中本地运行:

```bash
pytest <failed-test-node-id>
额外说明
通过
/triage-issue
自动分类排查。

如果同一任务中有多个测试用例失败,在"描述Bug"部分将每个测试用例列为单独的项目符号,并包含合并后的错误片段。将Issue分配给失败列表中第一个测试文件的作者。

8. Report back to the user

8. 向用户反馈

Print the URL of the newly created issue (or the duplicate, if found) so the user can review or share it.
打印新创建的Issue URL(如果找到重复Issue则打印重复Issue的URL),以便用户查看或分享。

Important guidelines

重要准则

  • Never create an issue if a duplicate already exists — link the existing one instead.
  • Always include the triggering PR link in the issue body.
  • Always assign the issue to the test file's most recent author. If the author lookup fails (e.g. the commit was made by a bot or the login is unavailable), skip
    --assignee
    and note it in the "Additional context" section.
  • Keep the error snippet concise (≤30 lines). Truncate long tracebacks and note that the full log is available via the job URL.
  • Do not guess the root cause — quote the actual log output verbatim.
  • If the job is still in progress or the logs are unavailable, say so and ask the user to retry once the run completes.
  • 如果存在重复Issue,绝对不要创建新Issue——改为链接现有Issue。
  • 始终在Issue正文中包含触发PR的链接。
  • 始终将Issue分配给测试文件的最新作者。如果无法找到作者(例如提交由机器人完成或登录账号不可用),跳过
    --assignee
    并在"额外说明"部分注明。
  • 错误片段要简洁(≤30行)。截断过长的回溯栈并注明完整日志可通过任务URL获取。
  • 不要猜测根本原因——直接引用实际的日志输出。
  • 如果任务仍在运行或日志不可用,请告知用户并请其在运行完成后重试。