ci-monitoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CI Monitoring

CI 监控

Overview

概述

Monitor CI pipeline and resolve failures until green.
CRITICAL: CI is validation, not discovery.
If CI finds a bug you didn't find locally, your local testing was insufficient.
Before blaming CI, ask yourself:
  1. Did you run all tests locally?
  2. Did you test against local services (postgres, redis)?
  3. Did you run the same checks CI runs?
  4. Did you run integration tests, not just unit tests with mocks?
CI should only fail for: environment differences, flaky tests, or infrastructure issues—never for bugs you could have caught locally.
Core principle: CI failures are blockers. But they should never be surprises.
Announce at start: "I'm monitoring CI and will resolve any failures."
监控CI流水线,解决失败问题直至流水线变绿。
重要提示:CI是验证环节,而非发现环节。
如果CI发现了你在本地没找到的bug,说明你的本地测试不充分。
在指责CI之前,请先问自己:
  1. 你在本地运行了所有测试吗?
  2. 你针对本地服务(postgres、redis)进行测试了吗?
  3. 你运行了CI会执行的所有检查吗?
  4. 你运行了集成测试,而不只是使用mock的单元测试吗?
CI仅应因以下情况失败:环境差异、不稳定测试(flaky tests)或基础设施问题——绝不能是你本可以在本地发现的bug。
核心原则:CI失败是阻塞项,但绝不应是意外情况。
开始时告知:"我正在监控CI,将解决所有失败问题。"

The CI Loop

CI 循环

PR Created
┌─────────────┐
│ Wait for CI │
└──────┬──────┘
┌─────────────┐
│ CI Status?  │
└──────┬──────┘
   ┌───┴───┐
   │       │
 Green   Red/Failed
   │       │
   ▼       ▼
┌─────────┐  ┌─────────────┐
│ MERGE   │  │ Diagnose    │
│ THE PR  │  │ failure     │
└────┬────┘  └──────┬──────┘
     │              │
     ▼              ▼
┌─────────┐  ┌─────────────┐
│ Continue│  │ Fixable?    │
│ to next │  └──────┬──────┘
│ issue   │         │
└─────────┘    ┌────┴────┐
               │         │
              Yes        No
               │         │
               ▼         ▼
          ┌─────────┐  ┌─────────────┐
          │ Fix and │  │ Document as │
          │ push    │  │ unresolvable│
          └────┬────┘  └─────────────┘
               └────► Back to "Wait for CI"
PR Created
┌─────────────┐
│ Wait for CI │
└──────┬──────┘
┌─────────────┐
│ CI Status?  │
└──────┬──────┘
   ┌───┴───┐
   │       │
 Green   Red/Failed
   │       │
   ▼       ▼
┌─────────┐  ┌─────────────┐
│ MERGE   │  │ Diagnose    │
│ THE PR  │  │ failure     │
└────┬────┘  └──────┬──────┘
     │              │
     ▼              ▼
┌─────────┐  ┌─────────────┐
│ Continue│  │ Fixable?    │
│ to next │  └──────┬──────┘
│ issue   │         │
└─────────┘    ┌────┴────┐
               │         │
              Yes        No
               │         │
               ▼         ▼
          ┌─────────┐  ┌─────────────┐
          │ Fix and │  │ Document as │
          │ push    │  │ unresolvable│
          └────┬────┘  └─────────────┘
               └────► Back to "Wait for CI"

CRITICAL: Green CI = Merge Immediately

重要提示:CI变绿后立即合并

When CI passes, you MUST merge the PR and continue working.
Do NOT:
  • Stop and report "CI is green, ready for review"
  • Wait for user confirmation
  • Summarize and ask what to do next
DO:
  • Merge the PR immediately:
    gh pr merge [PR_NUMBER] --squash --delete-branch
  • Mark the linked issue as Done
  • Continue to the next issue in scope
bash
undefined
当CI通过时,你必须立即合并PR并继续工作。
禁止:
  • 停下来报告"CI已变绿,等待审核"
  • 等待用户确认
  • 总结情况并询问下一步操作
必须:
  • 立即合并PR:
    gh pr merge [PR_NUMBER] --squash --delete-branch
  • 将关联的issue标记为已完成
  • 处理范围内的下一个issue
bash
undefined

When CI passes

When CI passes

gh pr merge [PR_NUMBER] --squash --delete-branch
gh pr merge [PR_NUMBER] --squash --delete-branch

Update linked issue status

Update linked issue status

gh issue edit [ISSUE_NUMBER] --remove-label "status:in-review" --add-label "status:done"
gh issue edit [ISSUE_NUMBER] --remove-label "status:in-review" --add-label "status:done"

Continue to next issue (do not stop)

Continue to next issue (do not stop)


**The only exception:** PRs with `do-not-merge` label require explicit user action.

**唯一例外:带有`do-not-merge`标签的PR需要用户明确操作。**

Checking CI Status

检查CI状态

Using GitHub CLI

使用GitHub CLI

bash
undefined
bash
undefined

Check all CI checks

Check all CI checks

gh pr checks [PR_NUMBER]
gh pr checks [PR_NUMBER]

Watch CI in real-time

Watch CI in real-time

gh pr checks [PR_NUMBER] --watch
gh pr checks [PR_NUMBER] --watch

Get detailed status

Get detailed status

gh pr view [PR_NUMBER] --json statusCheckRollup
undefined
gh pr view [PR_NUMBER] --json statusCheckRollup
undefined

Expected Output

预期输出

All checks were successful
0 failing, 0 pending, 5 passing

CHECKS
✓  build          1m23s
✓  lint           45s
✓  test           3m12s
✓  typecheck      1m05s
✓  security-scan  2m30s
All checks were successful
0 failing, 0 pending, 5 passing

CHECKS
✓  build          1m23s
✓  lint           45s
✓  test           3m12s
✓  typecheck      1m05s
✓  security-scan  2m30s

Handling Failures

处理失败问题

Step 1: Identify the Failure

步骤1:定位失败项

bash
undefined
bash
undefined

Get failed check details

Get failed check details

gh pr checks [PR_NUMBER]
gh pr checks [PR_NUMBER]

View workflow run logs

View workflow run logs

gh run view [RUN_ID] --log-failed
undefined
gh run view [RUN_ID] --log-failed
undefined

Step 2: Diagnose the Cause

步骤2:排查原因

Common failure types:
TypeSymptomsCause
Test failure
FAIL
in test output
Code bug or test bug
Build failureCompilation errorsType errors, syntax errors
Lint failureStyle violationsFormatting, conventions
Typecheck failureType errorsMissing types, wrong types
TimeoutJob exceeded time limitPerformance issue or stuck test
Flaky testPasses locally, fails CIRace condition, environment difference
常见失败类型:
类型症状原因
测试失败测试输出中出现
FAIL
代码bug或测试bug
构建失败编译错误类型错误、语法错误
Lint检查失败风格违规格式问题、不符合规范
类型检查失败类型错误缺失类型、类型不匹配
超时任务超出时间限制性能问题或测试卡住
不稳定测试本地通过,CI失败;CI重试后通过竞态条件、环境差异

Step 3: Fix the Issue

步骤3:修复问题

Test Failures

测试失败

bash
undefined
bash
undefined

Reproduce locally

Reproduce locally

pnpm test
pnpm test

Run specific failing test

Run specific failing test

pnpm test --grep "test name"
pnpm test --grep "test name"

Fix the code or test

Fix the code or test

Commit and push

Commit and push

undefined
undefined

Build Failures

构建失败

bash
undefined
bash
undefined

Reproduce locally

Reproduce locally

pnpm build
pnpm build

Fix compilation errors

Fix compilation errors

Commit and push

Commit and push

undefined
undefined

Lint Failures

Lint检查失败

bash
undefined
bash
undefined

Check lint errors

Check lint errors

pnpm lint
pnpm lint

Auto-fix what's possible

Auto-fix what's possible

pnpm lint:fix
pnpm lint:fix

Manually fix remaining

Manually fix remaining

Commit and push

Commit and push

undefined
undefined

Type Failures

类型检查失败

bash
undefined
bash
undefined

Check type errors

Check type errors

pnpm typecheck
pnpm typecheck

Fix type issues

Fix type issues

Commit and push

Commit and push

undefined
undefined

Step 4: Push Fix and Wait

步骤4:推送修复并等待

bash
undefined
bash
undefined

Commit fix

Commit fix

git add . git commit -m "fix(ci): Resolve test failure in user validation"
git add . git commit -m "fix(ci): Resolve test failure in user validation"

Push

Push

git push
git push

Wait for CI again

Wait for CI again

gh pr checks [PR_NUMBER] --watch
undefined
gh pr checks [PR_NUMBER] --watch
undefined

Step 5: Repeat Until Green

步骤5:循环直至变绿

Loop through diagnose → fix → push → wait until all checks pass.
重复 排查→修复→推送→等待 的流程,直至所有检查通过。

Flaky Tests

不稳定测试(Flaky Tests)

Identifying Flakiness

识别不稳定测试

Test passes locally
Test fails in CI
Test passes on retry in CI
Test passes locally
Test fails in CI
Test passes on retry in CI

Handling Flakiness

处理不稳定测试

  1. Don't just retry - Find the root cause
  2. Check for race conditions - Timing-dependent code
  3. Check for environment differences - Paths, env vars, services
  4. Check for state pollution - Tests affecting each other
typescript
// Common flaky pattern: timing dependency
// BAD
await saveData();
await delay(100);  // Hoping 100ms is enough
const result = await loadData();

// GOOD: Wait for condition
await saveData();
await waitFor(() => dataExists());
const result = await loadData();
  1. 不要仅重试 - 找到根本原因
  2. 检查竞态条件 - 依赖时序的代码
  3. 检查环境差异 - 路径、环境变量、服务
  4. 检查状态污染 - 测试之间互相影响
typescript
// Common flaky pattern: timing dependency
// BAD
await saveData();
await delay(100);  // Hoping 100ms is enough
const result = await loadData();

// GOOD: Wait for condition
await saveData();
await waitFor(() => dataExists());
const result = await loadData();

Unresolvable Failures

无法解决的问题

Sometimes failures can't be fixed in the current PR:
有时当前PR无法修复失败问题:

Legitimate Unresolvable Cases

合理的无法解决场景

CaseExample
CI infrastructure issueService down, rate limited
Pre-existing flaky testNot introduced by this PR
Upstream dependency issueExternal API changed
Requires manual interventionNeeds secrets, permissions
场景示例
CI基础设施问题服务宕机、请求受限
已存在的不稳定测试并非当前PR引入
上游依赖问题外部API变更
需要人工干预需要密钥、权限

Process for Unresolvable

无法解决问题的处理流程

  1. Document the issue
bash
gh pr comment [PR_NUMBER] --body "## CI Issue

The \`security-scan\` check is failing due to a known issue with the scanner service (see #999).

This is not related to changes in this PR. The scan passes when run locally.

Requesting bypass approval from @maintainer."
  1. Create issue if new
bash
gh issue create \
  --title "CI: Security scanner service timeout" \
  --body "The security scanner is timing out in CI..."
  1. Request bypass if appropriate
Some teams allow merging with known infrastructure failures.
  1. Do NOT merge with real failures
If the failure is from your code, it must be fixed.
  1. 记录问题
bash
gh pr comment [PR_NUMBER] --body "## CI Issue

The \`security-scan\` check is failing due to a known issue with the scanner service (see #999).

This is not related to changes in this PR. The scan passes when run locally.

Requesting bypass approval from @maintainer."
  1. 若为新问题则创建issue
bash
gh issue create \
  --title "CI: Security scanner service timeout" \
  --body "The security scanner is timing out in CI..."
  1. 必要时申请绕过
部分团队允许在已知基础设施故障的情况下合并PR。
  1. 绝不能合并存在真实代码问题的PR
若失败是由你的代码导致,必须修复后再合并。

CI Best Practices

CI最佳实践

Run Locally First (MANDATORY)

先在本地运行(强制要求)

CI is the last resort, not the first check.
Before pushing, run EVERYTHING CI will run:
bash
undefined
CI是最后一道防线,而非第一道检查。
推送前,运行CI会执行的所有内容:
bash
undefined

Run the same checks CI will run

Run the same checks CI will run

pnpm lint pnpm typecheck pnpm test # Unit tests pnpm test:integration # Integration tests against real services pnpm build
pnpm lint pnpm typecheck pnpm test # Unit tests pnpm test:integration # Integration tests against real services pnpm build

If you have database changes

If you have database changes

docker-compose up -d postgres pnpm migrate

**If your project has docker-compose services:**
- Start them before testing: `docker-compose up -d`
- Run integration tests against real services
- Verify migrations apply to real database
- Don't rely on mocks alone

**Skill:** `local-service-testing`
docker-compose up -d postgres pnpm migrate

**若项目包含docker-compose服务:**
- 测试前启动服务:`docker-compose up -d`
- 针对真实服务运行集成测试
- 验证迁移可应用于真实数据库
- 不要仅依赖mock

**技能:`local-service-testing`**

Commit Incrementally

增量提交

Don't push 10 commits at once. Push smaller changes:
bash
undefined
不要一次性推送10个提交,推送更小的变更:
bash
undefined

Small fix, push, verify

Small fix, push, verify

git push
git push

Wait for CI

Wait for CI

gh pr checks --watch
gh pr checks --watch

Then next change

Then next change

undefined
undefined

Monitor Actively

主动监控

Don't "push and forget":
bash
undefined
不要“推送后就不管了”:
bash
undefined

Watch CI after each push

Watch CI after each push

gh pr checks [PR_NUMBER] --watch
undefined
gh pr checks [PR_NUMBER] --watch
undefined

Checklist

检查清单

For each CI run:
  • Waited for CI to complete
  • All checks examined
  • Failures diagnosed (if any)
  • Fixes implemented (if needed)
  • Re-pushed and re-checked (if fixed)
  • All green
When CI is green:
  • PR merged immediately (
    gh pr merge --squash --delete-branch
    )
  • Linked issue marked Done
  • Continued to next issue (do NOT stop and report)
For unresolvable issues:
  • Root cause identified
  • Not caused by PR changes
  • Documented in PR comment
  • Issue created if new problem
  • Bypass approval requested if appropriate
每次CI运行:
  • 等待CI完成
  • 检查所有项
  • 排查失败原因(若有)
  • 实施修复(若需要)
  • 重新推送并检查(若已修复)
  • 所有项变绿
当CI变绿时:
  • 立即合并PR (
    gh pr merge --squash --delete-branch
    )
  • 关联issue标记为已完成
  • 处理下一个任务(不要停下来报告)
对于无法解决的问题:
  • 确定根本原因
  • 与当前PR变更无关
  • 在PR评论中记录
  • 若为新问题则创建issue
  • 必要时申请绕过审批

Integration

集成

This skill is called by:
  • issue-driven-development
    - Step 13
  • autonomous-orchestration
    - Main loop and bootstrap
This skill follows:
  • pr-creation
    - PR exists
This skill completes:
  • The PR lifecycle - merge is the final step, not "verification-before-merge"
This skill may trigger:
  • error-recovery
    - If CI reveals deeper issues
本技能由以下流程调用:
  • issue-driven-development
    - 步骤13
  • autonomous-orchestration
    - 主循环与启动流程
本技能基于以下流程:
  • pr-creation
    - PR已创建
本技能完成以下环节:
  • PR生命周期 - 合并是最终步骤,而非“合并前验证”
本技能可能触发:
  • error-recovery
    - 若CI暴露更深层次问题