ci-triage

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ci-triage

ci-triage

Shared logic for classifying a failed CI workflow run and recording the outcome. Used by humans (or Claude in an interactive session) when triaging a red
main
workflow or a red check on an open PR — the latter via the consumer repo's
<repo>-pr-lifecycle
skill (e.g.
onsager-pr-lifecycle
,
lean-spec-pr-lifecycle
,
duhem-pr-lifecycle
).
This skill owns the taxonomy, the de-dup rules for the
main-red
issue, and the issue template. Workflow-specific reproduction steps live in the consumer repo's
<repo>-pr-lifecycle
skill (for build-tool-specific detail) and in
web-testing
(for e2e).
用于分类失败的CI工作流运行并记录结果的共享逻辑。当处理
main
分支故障工作流或开放PR上的故障检查时,由人工(或交互式会话中的Claude)使用——后者通过消费仓库的
<repo>-pr-lifecycle
技能调用(例如
onsager-pr-lifecycle
lean-spec-pr-lifecycle
duhem-pr-lifecycle
)。
该技能负责分类体系、
main-red
议题的去重规则以及议题模板。工作流特定的复现步骤存放在消费仓库的
<repo>-pr-lifecycle
技能中(针对构建工具的细节)和
web-testing
技能中(针对e2e测试)。

Taxonomy

分类体系

Every failure lands in exactly one bucket. Be explicit — "unclear" is not a bucket,
needs-human
is.
BucketSignalDefault action
regression
Reproduces deterministically on
main
at HEAD
File/update
main-red
, @-mention suspect author
flake
Same workflow passed on the previous main commit without code changeComment on the existing
main-red
issue if one is open; otherwise skip
infra
Postgres service didn't come up, rustup 403, pnpm registry downFile/update
main-red
labelled
infra
; do not @-mention authors
needs-human
Logs truncated, auth failed, classification genuinely ambiguousOpen a
main-red
issue with raw log excerpt, label
needs-human
Do not invent a fifth bucket. If the signal you see fits nothing above, it's
needs-human
.
每个故障都属于且仅属于一个分类。分类需明确——“不明确”不属于分类,
needs-human
才是。
Bucket信号描述默认操作
regression
main
分支HEAD节点可确定性复现
创建/更新
main-red
议题,@提及可疑提交的作者
flake
未修改代码的情况下,相同工作流在上一个main分支提交时运行成功若已有开放的
main-red
议题则添加评论;否则跳过
infra
Postgres服务未启动、rustup 403错误、pnpm registry宕机等创建/更新带
infra
标签的
main-red
议题;不@提及作者
needs-human
日志被截断、认证失败、分类结果确实不明确创建包含原始日志片段的
main-red
议题,添加
needs-human
标签
请勿新增第五个分类。如果遇到的信号不符合上述任何一种,归为
needs-human

Suspect commit identification

可疑提交识别

The GitHub
workflow_run
payload includes
head_sha
. That's the commit that triggered this run — on a
push: main
event, it is the merge commit.
  1. Fetch the commit via
    mcp__github__get_commit
    .
  2. If the commit message matches
    Merge pull request #N
    , the suspect PR is
    #N
    ; its author is the suspect author.
  3. Otherwise (direct push to main, squash-merge), use the commit author directly.
Never blame more than one commit per failure. If the previous main commit's CI was also red, link that issue rather than opening a new one.
GitHub的
workflow_run
负载包含
head_sha
,即触发本次运行的提交——在
push: main
事件中,它是合并提交。
  1. 通过
    mcp__github__get_commit
    获取提交信息。
  2. 如果提交信息匹配
    Merge pull request #N
    ,则可疑PR为
    #N
    ;其作者为可疑作者。
  3. 其他情况(直接推送到main分支、 squash合并),直接使用提交作者。
每次故障最多指向一个提交。如果上一个main分支提交的CI也失败,链接到该议题而非创建新议题。

The rolling
main-red
issue

滚动
main-red
议题

One open
main-red
issue at a time.
If main is broken for three days, that's one issue accumulating comments — not twelve.
Before filing:
  1. mcp__github__list_issues
    with
    labels: main-red, state: open
    .
  2. If one exists, append a comment:
    Run #<run-id> also failed. Workflow
    <name>
    , bucket
    <bucket>
    , suspect <sha-short> (#<pr-or-none>, @<author>). <one-line-cause>.
  3. Only open a new issue if none is open. Title:
    main is red: <workflow> (<bucket>)
    . Labels:
    main-red
    , plus
    infra
    or
    needs-human
    if applicable.
When main goes green again (the next successful run on the same workflow), close the issue with a comment naming the green run id. This close step is manual.
同一时间仅保留一个开放的
main-red
议题
。如果main分支故障持续三天,应在同一个议题中累积评论——而非创建12个议题。
创建前:
  1. 使用
    mcp__github__list_issues
    ,参数为
    labels: main-red, state: open
  2. 如果存在开放议题,添加评论:
    运行#<run-id>同样失败。工作流
    <name>
    ,分类
    <bucket>
    ,可疑提交<sha-short> (#<pr-or-none>, @<author>)。<故障简要说明>。
  3. 仅当没有开放议题时才创建新议题。标题:
    main is red: <workflow> (<bucket>)
    。标签:
    main-red
    ,若适用则添加
    infra
    needs-human
当main分支恢复正常(同一工作流的下一次成功运行)时,关闭该议题并在评论中注明恢复正常的运行ID。关闭操作需手动执行。

Issue body template

议题正文模板

markdown
**Workflow**: <workflow-name>
**Run**: <run-url>
**First failed step**: <step-name>
**Bucket**: <regression|flake|infra|needs-human>
**Suspect**: <sha-short><commit-subject> (PR #<n>, @<author>)
markdown
**Workflow**: <workflow-name>
**Run**: <run-url>
**First failed step**: <step-name>
**Bucket**: <regression|flake|infra|needs-human>
**Suspect**: <sha-short><commit-subject> (PR #<n>, @<author>)

Failure excerpt

Failure excerpt

<last 30 lines of the failing step, or the ripgrep-extracted error block>
<last 30 lines of the failing step, or the ripgrep-extracted error block>

Reproduction

Reproduction

<one of:> - Deterministic: `<exact command from the workflow yaml>` - Flake: passed on <prev-sha-short>; rerun button: <run-url>/attempts/2 - Infra: <service name> — <symptom> - Needs human: <why the logs are ambiguous>
<one of:> - Deterministic: `<exact command from the workflow yaml>` - Flake: passed on <prev-sha-short>; rerun button: <run-url>/attempts/2 - Infra: <service name> — <symptom> - Needs human: <why the logs are ambiguous>

Next action

Next action

<one line — "revert #N and reland", "rerun", "fix <specific thing>", "human eyes">

Keep the excerpt tight. Dumping the full log helps nobody.
<one line — "revert #N and reland", "rerun", "fix <specific thing>", "human eyes">

日志片段需精简。粘贴完整日志毫无帮助。

Reproducing locally

本地复现

A human invoking this skill via the consumer repo's
<repo>-pr-lifecycle
skill should reproduce before filing, using the commands in that skill's CI-triage section. For a
main
failure caught from outside a PR, check out
main
at the suspect SHA and run the same commands locally before filing.
For
e2e
failures specifically, delegate classification to
web-testing
's triage mode
— it handles regression-vs-flake for browser-driven tests (the ambiguous case).
通过消费仓库
<repo>-pr-lifecycle
技能调用本技能的人员,应在创建议题前使用该技能CI处理部分的命令进行复现。对于在PR外发现的
main
分支故障,应检出可疑SHA对应的
main
分支,在创建议题前本地运行相同命令。
针对
e2e
故障,需将分类工作委托给
web-testing
的处理模式
——它负责处理浏览器驱动测试的回归vs偶发故障(模糊场景)。

Log access

日志访问

WebFetch
cannot read authenticated GitHub Actions logs (403). The GitHub MCP gives you:
  • mcp__github__pull_request_read
    with
    method: get_check_runs
    — step names, status, timings (no log body).
  • The workflow run's
    jobs_url
    via
    mcp__github__get_commit
    +
    check_suite
    traversal — same metadata.
Log bodies are not reliably accessible from the GitHub MCP. When the log body is unavailable, classify from step names + exit codes + the workflow yaml, and bias toward
needs-human
rather than guessing.
WebFetch
无法读取需要认证的GitHub Actions日志(403错误)。GitHub MCP提供以下能力:
  • mcp__github__pull_request_read
    ,参数
    method: get_check_runs
    ——步骤名称、状态、耗时(无日志内容)。
  • 通过
    mcp__github__get_commit
    +
    check_suite
    遍历获取工作流运行的
    jobs_url
    ——仅元数据。
GitHub MCP无法可靠获取日志内容。当无法获取日志内容时,根据步骤名称、退出码和工作流yaml进行分类,优先归为
needs-human
而非猜测。

Flake-detection heuristic

偶发故障(flake)检测启发式

A failure is
flake
only if both hold:
  1. The same workflow on the previous main commit passed (check via
    mcp__github__list_commits
    + check runs on the prior sha).
  2. The failing step's logs do not contain a symbol name, file path, or assertion message that appears in the suspect commit's diff.
One of those alone is not enough. A deterministic regression can pass on the prior commit; a real flake can mention a touched file by coincidence.
仅当同时满足以下两个条件时,故障才归为
flake
  1. 上一个main分支提交的相同工作流运行成功(通过
    mcp__github__list_commits
    + 前一个sha的检查运行验证)。
  2. 失败步骤的日志中不包含可疑提交diff中出现的符号名、文件路径或断言消息。
仅满足其中一个条件不足以判定。确定性回归故障也可能在上一个提交时运行成功;真实的偶发故障也可能巧合提及被修改的文件。

Constraints

约束

  • Never open a PR from this skill. Triage is read-only on the codebase.
  • Never @-mention for
    infra
    or
    flake
    buckets — alert fatigue kills the signal.
  • Never close a
    main-red
    issue without a green run id to cite.
  • Scope: any GitHub-Actions–driven repo. The taxonomy and rolling
    main-red
    pattern are repo-agnostic; the consumer repo's CLAUDE.md can override triggers and label conventions.
  • 切勿通过本技能创建PR。本技能仅对代码库进行只读处理。
  • 切勿针对
    infra
    flake
    分类@提及人员——告警疲劳会削弱信号。
  • 切勿在未引用恢复正常的运行ID的情况下关闭
    main-red
    议题。
  • 适用范围:任何GitHub Actions驱动的仓库。分类体系和滚动
    main-red
    模式与仓库无关;消费仓库的CLAUDE.md可覆盖触发条件和标签约定。

Relationship to other surfaces

与其他技能的关系

SurfaceRole
<repo>-pr-lifecycle
(consumer-repo skill, e.g.
onsager-pr-lifecycle
)
Interactive caller; humans use this when triaging a red PR check.
web-testing
Delegated to for
e2e
workflow classification.
Surface角色
<repo>-pr-lifecycle
(消费仓库技能,例如
onsager-pr-lifecycle
交互式调用者;人员处理PR故障检查时使用本技能。
web-testing
负责
e2e
工作流的分类委托。
",