fix-sentry-issues
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFix Sentry Issues
修复Sentry问题
Systematically discover, triage, investigate, and fix production issues using Sentry MCP. One PR per issue, root-cause analysis required.
借助Sentry MCP系统化地发现、分类、调查并修复生产环境问题。每个问题对应一个PR,必须进行根因分析。
Critical Rule: Truth-Seek, Don't Suppress
核心原则:探寻真相,而非掩盖问题
NEVER treat log level changes as fixes. Changing to or silences Sentry but doesn't fix the user's experience.
logger.errorlogger.warnlogger.infoFor every failing code path, ask "Why does this fail?" — not "How do I make Sentry quiet?"
绝对不要将修改日志级别当作修复方案。把改为或只会让Sentry不再告警,但并没有解决用户实际遇到的问题。
logger.errorlogger.warnlogger.info对于每一条报错的代码路径,要问**“为什么会失败?”——而不是“怎么让Sentry不再告警?”**
Anti-patterns to avoid
需要避免的反模式
These are specific failure modes from real experience. Do NOT do these:
-
Batch-classifying issues as "expected" without investigating each one. Reading an error message and seeing a fallback path does NOT mean you understand the failure. You must trace the full input path to understand what's being sent and why it fails.
-
Treating "has a fallback" as "not a problem." A fallback means the user gets degraded results. Ask: why does the primary path fail? Can we prevent the failure upstream? Is the input wrong? Is the timeout too tight? Is there a missing filter?
-
Combining multiple issues into one "noise reduction" PR. Each issue has its own root cause. Investigate and fix them individually. The only exception is issues that share an identical root cause discovered through investigation.
-
Throwing away error details. Never changeto
catch (error) { logger.error(..., error) }. The structured error data (status codes, messages, stack traces) is exactly what you need to understand the failure.catch { logger.info(...) } -
Deciding the fix during triage. The triage table should classify issues as "Investigate" or "Ignore" — never pre-decide that the fix is a log level change. You don't know the fix until you've completed investigation.
这些是从实际经验中总结出的典型错误做法,绝对不要这么做:
-
未逐个调查就批量标记问题为“预期情况”。仅看错误信息并发现有降级处理逻辑,不代表你真正理解了失败原因。你必须追踪完整的输入路径,明确传入的数据是什么、为什么会失败。
-
将“存在降级处理”等同于“没有问题”。降级处理意味着用户会得到体验打折扣的结果。要问:主路径为什么会失败?我们能否在源头避免失败?输入数据是否有误?超时时间是否过短?是否缺少必要的过滤?
-
将多个问题合并到一个“清理无效告警”PR中。每个问题都有其独立的根因,需要分别调查和修复。唯一例外是经调查确认多个问题的根因完全相同的情况。
-
丢弃错误详情。绝对不要把改为
catch (error) { logger.error(..., error) }。结构化的错误数据(状态码、消息、栈追踪)正是你理解失败原因的关键。catch { logger.info(...) } -
在分类阶段就确定修复方案。分类表中的操作列只能是“调查”或“忽略”——绝对不要预先决定修复方案是修改日志级别。在完成调查前,你无法确定正确的修复方式。
When a log level change IS valid
哪些场景下修改日志级别是合理的
A downgrade to is valid ONLY for genuinely expected operational states — NOT for failures with fallbacks. Examples:
logger.info- Valid: User's Notion database doesn't have an optional "Author" column → property skipped. This is user configuration, not a failure.
- Valid: Supabase returns 404 for a link the user deleted. The resource genuinely doesn't exist.
- Invalid: Firecrawl scrape fails 300 times/day → downgrade to info. WHY is it failing? Are we sending URLs it can't handle? Are we hitting rate limits?
- Invalid: Summary generation times out → downgrade to info. WHY is the API slow? Is the content too large? Is there a network issue?
只有当某个状态确实是预期的正常运营状态时,才可以将日志级别降级为——而不是针对有降级处理的失败场景。示例:
logger.info- 合理场景:用户的Notion数据库没有可选的“作者”列 → 跳过该属性。这是用户配置问题,而非系统失败。
- 合理场景:Supabase对用户已删除的链接返回404。该资源确实不存在。
- 不合理场景:Firecrawl每日抓取失败300次 → 降级为info级别。为什么会失败?我们是否传入了它无法处理的URL?是否触发了速率限制?
- 不合理场景:摘要生成超时 → 降级为info级别。为什么API会变慢?内容是否过大?是否存在网络问题?
Phase 1: Discover
阶段1:发现问题
Use Sentry MCP to find the org, project, and all unresolved issues. Use first to load the Sentry MCP tools.
ToolSearchmcp__sentry__find_organizations()
mcp__sentry__find_projects(organizationSlug, regionUrl)
mcp__sentry__search_issues(
organizationSlug, projectSlugOrId, regionUrl,
naturalLanguageQuery: "all unresolved issues sorted by events",
limit: 25
)Build a triage table. The Action column should be Investigate or Ignore — never a pre-decided fix:
markdown
| ID | Title | Events | Action | Reason |
|----|-------|--------|--------|--------|
| PROJ-A | Error in save | 14 | Investigate | User-facing save failure |
| PROJ-B | GM_register... | 3 | Ignore | Greasemonkey extension |使用Sentry MCP查找组织、项目以及所有未解决的问题。首先使用加载Sentry MCP工具。
ToolSearchmcp__sentry__find_organizations()
mcp__sentry__find_projects(organizationSlug, regionUrl)
mcp__sentry__search_issues(
organizationSlug, projectSlugOrId, regionUrl,
naturalLanguageQuery: "all unresolved issues sorted by events",
limit: 25
)构建分类表。操作列只能是调查或忽略——不要预先填写修复方案:
markdown
| ID | 标题 | 事件数 | 操作 | 原因 |
|----|-------|--------|--------|--------|
| PROJ-A | 保存时出错 | 14 | 调查 | 影响用户的保存失败 |
| PROJ-B | GM_register... | 3 | 忽略 | Greasemonkey扩展导致 |Phase 2: Triage
阶段2:分类问题
Classify every issue before writing any code. Only two categories at this stage:
在编写任何代码之前,先对所有问题进行分类。此阶段仅分为两类:
Investigate (our code, worth understanding)
调查(属于我们的代码,值得深入了解)
- Multiple events establishing a pattern
- User sees degraded experience (error status, missing data, broken UI)
- High-volume warnings that might indicate an upstream problem
- Recurring on every run/sync (stale references, cron-triggered)
- 多次出现,形成规律
- 用户体验受影响(错误状态码、数据缺失、UI损坏)
- 大量告警,可能预示上游存在问题
- 每次运行/同步时都会重复出现(过期引用、定时任务触发)
Ignore (third-party noise)
忽略(第三方无效告警)
- Browser extension code (,
GM_registerMenuCommand,CONFIG, MetaMask JSON-RPC)currentInset - Stale module imports after deploy (— self-resolving)
ChunkLoadError - Single-event transients with no reproduction path
- Issues already fixed by a recent commit
Apply triage decisions:
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "ignored") // noise
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved") // already fixed- 浏览器扩展代码导致的错误(、
GM_registerMenuCommand、CONFIG、MetaMask JSON-RPC)currentInset - 部署后出现的过期模块导入错误(——会自行解决)
ChunkLoadError - 仅出现一次且无法复现的临时错误
- 已被近期提交修复的问题
应用分类决策:
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "ignored") // 无效告警
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved") // 已修复Phase 3: Investigate (one issue at a time)
阶段3:调查问题(逐个处理)
For each "Investigate" issue, work through these steps in order. Do NOT skip steps or batch multiple issues together.
对于每个标记为“调查”的问题,按以下步骤依次操作。不要跳过步骤,也不要批量处理多个问题。
3a. Pull event-level data
3a. 获取事件级数据
Issue summaries hide the details you need. Always pull actual events AND the full issue details:
mcp__sentry__get_issue_details(issueId, organizationSlug, regionUrl)
mcp__sentry__search_issue_events(
issueId, organizationSlug, regionUrl,
naturalLanguageQuery: "all events with extra data",
limit: 15
)Extract from the events: actual URLs, request parameters, stack traces, timestamps, user context, extra data fields (status codes, content lengths, etc.). These are the real inputs that triggered the failure.
问题摘要会隐藏你需要的细节。务必拉取实际事件以及完整的问题详情:
mcp__sentry__get_issue_details(issueId, organizationSlug, regionUrl)
mcp__sentry__search_issue_events(
issueId, organizationSlug, regionUrl,
naturalLanguageQuery: "all events with extra data",
limit: 15
)从事件中提取:实际URL、请求参数、栈追踪、时间戳、用户上下文、额外数据字段(状态码、内容长度等)。这些正是触发失败的真实输入数据。
3b. Cross-reference with Axiom logs
3b. 关联Axiom日志
Axiom events include fields that correlate with Sentry errors. Use the Axiom CLI to pull surrounding logs for richer context:
traceIdbash
undefinedAxiom事件包含与Sentry错误关联的字段。使用Axiom CLI拉取相关日志,获取更丰富的上下文:
traceIdbash
undefinedGet the traceId from the Sentry event's trace context
从Sentry事件的trace上下文中获取traceId
Then query Axiom for all events with that traceId
然后查询Axiom中所有包含该traceId的事件
axiom query "['shiori-events'] | where traceId == '<traceId>'" -f json
axiom query "['shiori-events'] | where traceId == '<traceId>'" -f json
Or search by userId around the error timestamp for broader context
或者根据userId和错误时间范围搜索更广泛的上下文
axiom query "['shiori-events'] | where userId == '<userId>' | where _time > datetime('2025-01-01T00:00:00Z') and _time < datetime('2025-01-01T01:00:00Z')" -f json
Axiom logs include fields like `authMethod`, `client_version`, `event` type, and request metadata that Sentry often lacks. This helps you understand what the user was doing before and after the error.axiom query "['shiori-events'] | where userId == '<userId>' | where _time > datetime('2025-01-01T00:00:00Z') and _time < datetime('2025-01-01T01:00:00Z')" -f json
Axiom日志包含`authMethod`、`client_version`、事件类型和请求元数据等字段,这些是Sentry通常没有的信息。这有助于你了解用户在错误发生前后的操作。3c. Read the failing code path
3c. 阅读报错的代码路径
Follow the stack trace. Read every file in the chain. Understand what the code does before proposing changes. Use subagents for parallel file exploration if the stack is deep.
跟随栈追踪信息,阅读调用链中的每个文件。在提出修改方案之前,先理解代码的功能。如果调用链很深,可以使用子代理并行探索多个文件。
3d. Trace the input path upstream
3d. 向上游追踪输入路径
This is the step most often skipped, and the most important:
- What data reaches the failing function? Trace backwards from the error to the original input. What URL/payload/parameters were passed?
- Should this input have reached this code path at all? Is there a missing filter, validation, or early return upstream?
- What does the input look like? For URL-based failures: is it a binary file? A redirect? A localhost URL? Something the API can't handle?
- Is the failure in our code or an external service? If external: can we prevent sending bad inputs? Can we add better pre-filtering?
这是最常被跳过但也是最重要的一步:
- **哪些数据到达了报错的函数?**从错误点反向追踪到原始输入。传入的是哪些URL/负载/参数?
- **这些输入本应到达这个代码路径吗?**上游是否缺少过滤、验证或提前返回逻辑?
- **输入的具体格式是什么?**对于URL相关的失败:是二进制文件?重定向?本地URL?还是API无法处理的内容?
- **失败是出现在我们的代码中还是外部服务中?**如果是外部服务:我们能否避免发送无效输入?能否添加更好的前置过滤?
3e. Reproduce and verify
3e. 复现并验证
Use the actual failing inputs from Sentry events:
- Call the function with the exact data that failed
- the actual URLs that timed out — are they reachable?
fetch() - Add temporary statements to verify your understanding of the code flow
console.log - Check if the failure is in our code or an external service
使用Sentry事件中的真实失败输入:
- 使用导致失败的精确数据调用函数
- 用请求超时的实际URL——是否可达?
fetch() - 添加临时的语句,验证你对代码流程的理解
console.log - 确认失败是出现在我们的代码中还是外部服务中
3f. Identify root cause
3f. 确定根因
Ask these questions in order:
- Why does this specific input fail? (e.g., "Firecrawl can't scrape a .png URL")
- Why does this input reach this code path? (e.g., "No extension check before calling Firecrawl")
- What's the right fix? (e.g., "Filter binary URLs before calling Firecrawl" — not "suppress the log")
- Should we also improve observability? (e.g., "Add status code to the log so we can see the failure distribution")
Common root causes:
| Pattern | Root Cause | Real Fix |
|---|---|---|
| External API fails on certain URLs | Wrong inputs being sent (binary files, bad formats) | Filter/validate inputs before sending |
| External API timeout | Timeout too tight, or input too large, or missing retry | Investigate what's slow, adjust timeout or input size |
| DB rejects "invalid json" | Unsanitized input (null bytes, control chars) | Sanitize before insert |
| Processing stuck in "error" | Timeout budget doesn't account for full pipeline | Adjust timeouts, save partial results on timeout |
| Same error on every cron run | Stale reference to deleted external resource | Detect staleness, auto-clean |
| Error logged but details not useful | Error object not included, or status code missing | Improve the log to include actionable details |
按顺序问自己这些问题:
- 为什么这个特定输入会失败?(例如:“Firecrawl无法抓取.png格式的URL”)
- 为什么这个输入会到达这个代码路径?(例如:“调用Firecrawl之前没有检查文件扩展名”)
- 正确的修复方案是什么?(例如:“调用Firecrawl之前过滤二进制URL”——而非“屏蔽日志”)
- 我们是否需要改进可观测性?(例如:“在日志中添加状态码,以便查看失败分布情况”)
常见根因:
| 模式 | 根因 | 正确修复方案 |
|---|---|---|
| 外部API对某些URL请求失败 | 发送了错误的输入(二进制文件、格式错误) | 发送前过滤/验证输入 |
| 外部API超时 | 超时时间过短、输入过大或缺少重试机制 | 调查慢请求原因,调整超时时间或输入大小 |
| 数据库拒绝“无效json” | 输入未经过滤(空字节、控制字符) | 插入前清理输入 |
| 处理过程卡在“错误”状态 | 超时预算未考虑完整流程 | 调整超时时间,超时保存部分结果 |
| 定时任务每次运行都会出现相同错误 | 引用了已删除的外部资源 | 检测过期引用,自动清理 |
| 日志中存在错误但缺少有用详情 | 未包含错误对象或状态码 | 改进日志,添加可用于排查的详情 |
3g. Know your log levels
3g. 了解日志级别
Log levels control what reaches Sentry:
| Level | Sends to Sentry? | Use for |
|---|---|---|
| Yes (error) | Unexpected bugs, states that should never occur |
| Yes (warning) | Handled failures worth monitoring — keep until you understand the pattern |
| No | Genuinely expected operational states (not "failures with fallbacks") |
日志级别决定了哪些内容会发送到Sentry:
| 级别 | 是否发送到Sentry? | 使用场景 |
|---|---|---|
| 是(错误级别) | 意外的bug、本不应出现的状态 |
| 是(警告级别) | 已处理但值得监控的失败——保留直到理解规律 |
| 否 | 确实是预期的正常运营状态(而非“有降级处理的失败”) |
Phase 4: Fix
阶段4:修复问题
4a. Branch from main
4a. 从main分支创建分支
bash
git checkout main && git pull
git checkout -b fix/<descriptive-name>One branch per issue. Keep fixes focused.
bash
git checkout main && git pull
git checkout -b fix/<描述性名称>每个问题对应一个分支,保持修复内容聚焦。
4b. Write tests first
4b. 先编写测试
Tests must use data derived from actual Sentry events, not hypothetical inputs. The test should fail before the fix and pass after.
测试必须使用来自Sentry事件的真实数据,而非假设的输入。测试应在修复前失败,修复后通过。
4c. Implement the fix
4c. 实现修复方案
Fix the root cause, not the symptom.
Self-check before committing: If the fix is primarily a log level change, STOP. Ask yourself:
- Did I investigate why this fails, or did I just see a fallback and suppress?
- Can I prevent the failure upstream instead of silencing it?
- Am I throwing away error details that would help debug future occurrences?
- Would a staff engineer look at this PR and say "but why does it fail in the first place?"
修复根因,而非症状。
提交前自检:如果修复方案主要是修改日志级别,请停止操作。问自己:
- 我是否调查了失败的原因,还是仅仅看到有降级处理就选择屏蔽?
- 我能否在源头避免失败,而不是掩盖它?
- 我是否丢弃了有助于未来排查的错误详情?
- 资深工程师看到这个PR时,会不会问“但它为什么会失败?”
4d. Verify
4d. 验证修复
- Run tests (e.g., )
bun run test - Run lint
- Confirm the fix handles the actual failing inputs from Sentry events
- Remove any temporary statements
console.log
- 运行测试(例如:)
bun run test - 运行代码检查
- 确认修复方案能处理Sentry事件中的真实失败输入
- 删除所有临时的语句
console.log
4e. Create PR
4e. 创建PR
bash
git push -u origin fix/<descriptive-name>
gh pr create --title "<short title>" --body "$(cat <<'EOF'bash
git push -u origin fix/<描述性名称>
gh pr create --title "<简短标题>" --body "$(cat <<'EOF'Summary
摘要
- Root cause: [What was actually wrong — the upstream reason, not just "it throws an error"]
- Fix: [What changed and why this prevents the failure, not just silences it]
- 根因:[实际存在的问题——上游原因,而非仅仅“抛出错误”]
- 修复方案:[修改内容以及为什么能避免失败,而非仅仅屏蔽日志]
Test plan
测试计划
- Tests written using data from Sentry events
- All tests pass
- Lint passes EOF )"
undefined- 使用Sentry事件中的数据编写测试
- 所有测试通过
- 代码检查通过 EOF )"
undefined4f. Resolve in Sentry
4f. 在Sentry中标记为已解决
After PR is merged:
bash
git checkout main && git pullmcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved")PR合并后:
bash
git checkout main && git pullmcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved")Phase 5: Repeat
阶段5:重复流程
Work through issues by priority (most events first). After each PR:
- Return to main, pull latest
- Pick next issue from the triage table
- Start Phase 3 again — full investigation for each issue
按优先级处理问题(事件数最多的优先)。每个PR完成后:
- 切回main分支,拉取最新代码
- 从分类表中选择下一个问题
- 再次从阶段3开始——对每个问题进行完整调查
Checklist Per Issue
每个问题的检查清单
[ ] Pulled event-level data (not just issue summary)
[ ] Cross-referenced with Axiom logs using traceId for surrounding context
[ ] Read the failing code path end-to-end
[ ] Traced the input path upstream — understood what data triggers the failure
[ ] Identified root cause (not just "it has a fallback")
[ ] Fix prevents the failure, not just suppresses the log
[ ] Tests use real-world data from Sentry events
[ ] Tests pass, lint passes
[ ] No error details thrown away (catch variables, status codes, etc.)
[ ] PR created with upstream root cause explanation
[ ] Sentry issue resolved after merge[ ] 拉取了事件级数据(而非仅问题摘要)
[ ] 使用traceId关联Axiom日志,获取上下文
[ ] 完整阅读了报错的代码路径
[ ] 向上游追踪了输入路径——理解了触发失败的数据
[ ] 确定了根因(而非仅“存在降级处理”)
[ ] 修复方案避免了失败,而非仅仅屏蔽日志
[ ] 测试使用了Sentry事件中的真实数据
[ ] 测试通过,代码检查通过
[ ] 未丢弃任何错误详情(捕获变量、状态码等)
[ ] PR中说明了上游根因
[ ] PR合并后在Sentry中标记问题为已解决