For each candidate that survived Phase 2.5 triage, answer these seven questions.
Question 5 has seven sub-steps (A through G); Step G is the adversarial check.
Findings that fail Step G drop out — they don't get a priority, they don't go in
the Do/Skip tables, they go on the dropped-candidates list with the reason.
1. What happened? One sentence — the symptom, not the fix.
2. Is the scorer correct? (mandatory for score-penalty findings)
- Scorer correct → fix the Printing Press (templates, binary, or skill)
- Scorer wrong → fix the scoring tool, not the Printing Press
- Both → fix both, label which is primary
3. What category?
| Category | Description |
|---|
| Bug | Generated code is wrong |
| Scorer bug | Scoring tool reports a false positive |
| Template gap | No template for a common pattern |
| Assumption mismatch | Printing Press assumes X but API uses Y |
| Recurring friction | Happens every generation, might be inherent |
| Missing scaffolding | Feature class the Printing Press could emit but doesn't |
| Default gap | Printing Press emits a wrong or placeholder default |
| Discovered optimization | Improvement found during use |
| Skill instruction gap | Skill told Claude wrong thing or missed a step |
4. Where in the Printing Press does this originate?
Pick exactly one component. The
column drives the
label
applied to the issue when filed (Phase 6), which is how agents filter related
work across retros (
gh issue list --label comp:<slug>
).
| Component | Slug | Path |
|---|
| Generator templates | | |
| Spec parser | | |
| OpenAPI parser | | |
| Catalog | | |
| Main skill | | skills/printing-press/SKILL.md
|
| Verify/dogfood/scorecard | | CLI commands |
If a finding genuinely spans two components, pick the one where the durable
fix lands. Don't multi-label.
5. Blast radius and fallback cost — should the Printing Press handle this?
Step A: Cross-API stress test. Test across API shapes (standard REST, proxy-envelope,
RPC-style) and input methods (OpenAPI, crowd-sniffed, HAR-sniffed, no spec).
Step B: Name three concrete APIs from the catalog with direct evidence. Not "every
API with multi-word resources" or "any browser-sniffed CLI." Name three specific APIs
already in
~/printing-press/library/
(or the embedded
directory) where you
can point to evidence the pattern exists: a path in their spec, a known endpoint shape,
a header the vendor documents, an output you can reproduce. "Stripe, Notion, GitHub
probably have this" is hand-waving; "Stripe (Stripe-Version header in spec line N),
GitHub (X-GitHub-Api-Version on the issues endpoints), Linear (api-version on /v2/*)"
is evidence. If you can name only two with evidence — or three with hand-waving — the
finding drops to
P3 max with a annotation, or moves to Drop.
Step C: Counter-check question. Ask explicitly: "If I implemented this fix, would it
actively hurt any API that doesn't have this pattern?" If yes, the fix needs a guard or
condition before being P1/P2 — not a default change. Example: turning on client-side
truncation by default would hurt APIs that need server-side pagination for
correctness; it stays P2 only because it's gated on profiler-detected absence of a
paginator. Without that guard the same finding is unsafe to land.
Step D: Recurrence-cost check. Search prior retros under
~/printing-press/manuscripts/*/proofs/*-retro-*.md
for the same finding. If the same
finding has been raised in 2+ prior retros without being implemented, the prior cost-
benefit math has been "no" twice. Don't re-raise it at the same priority — either move
to P3 with a "raised N times, still not justified" annotation, or reframe the finding
into a smaller incremental fix that addresses part of the friction. Recurrence at the
same priority is a triage failure, not stronger evidence.
Capture matched prior retros. When the search returns hits, record each as a
structured tuple — retro CLI name, retro file path (or GitHub issue number if the
retro file's frontmatter contains one), and a one-word classification:
- — the prior retro proposed the same fix direction. Strengthens the case;
reference it in Step F.
- — the prior retro proposed an opposing fix or chose a different
default. Surface this explicitly: a maintainer reading the new finding must see
the disagreement. State in one sentence why this retro reaches a different
conclusion (e.g., "prior retro saw single-paginator APIs; this one saw an
always-paginated API where the prior default would break").
- — the prior retro raised an adjacent finding in the same component
area but a different specific fix. Useful context, doesn't change the case.
These tuples flow forward into the per-finding template ("Related prior retros")
in the retro doc and merge into the issue body's "Related issues" block alongside
the Step 2.5 dedup scan's
outputs. GitHub auto-cross-links any
issue number you write, so contradictions and alignments show up in both retro
timelines without further action.
Step E: Assess fallback cost. How reliably will Claude catch and fix this across every
future API? A "simple" edit Claude forgets 30% of the time means 30% ship with the defect.
Step F: Make the tradeoff. Default is don't change the machine. The burden of
proof is on the finding to justify a machine change. Continue to Step G only when all
three of these are true:
(a) Step B named three concrete APIs with evidence (not speculation).
(b) Step D's recurrence-cost check didn't disqualify the finding.
(c) Step C's counter-check didn't surface a hurts-other-APIs concern that lacks a guard.
If a finding can't clear all three, it doesn't get a priority — it goes to Drop with
the specific reason ("only named 2 APIs with evidence" / "raised 3 times, still not
justified" / "fix would hurt single-paginator APIs without a guard").
When the finding applies to an API subclass, include: Condition (when to activate),
Guard (when to skip), Frequency estimate.
Step G: Construct the case against filing. Before recording the finding, write
1-2 sentences arguing the opposite — what makes this look like a printed-CLI
fix, an iteration artifact, or a wishlist item. Why might a maintainer close this
as "works as designed" or "too narrow for a machine fix"? What's the strongest
version of "this shouldn't be filed"?
If the case-against is stronger than the case-for, drop the finding. If they're
roughly even, drop the finding (default direction is don't-file). Only when the
case-for is clearly stronger does the finding survive to Phase 4.
This step is not a formality. It is the explicit place where weak findings die.
A finding that survives Step G should be able to state, in one sentence, why
the case-against fails — and that sentence is worth quoting in the retro entry.
6. Is this inherent or fixable? Push hard on whether smarter templates, a
post-processing step, or better spec analysis could eliminate the friction. If inherent,
propose the cheapest mitigation.
7. What is the durable fix? Prefer: template fix > binary post-processing > skill instruction.
Mark uncertainty explicitly. If you can't confidently isolate one root cause
or one fix, say so — list the candidate causes (or candidate fixes) and how an
implementer could disambiguate before committing. The issue body surfaces this
uncertainty so the agent picking up the work doesn't lock in a wrong-but-plausible
diagnosis. Confidence isn't a virtue when it's manufactured; an honest "either A
or B; verify by X" is more useful than a wrong prescription.
Strip API-specific details from the proposed fix. The durable fix must work across
APIs, not just the one that surfaced the finding. If the fix includes hardcoded param
names (e.g.,
,
), date formats (e.g.,
), chunking
strategies (e.g., monthly), or domain-specific logic, those are printed-CLI details
leaking into the machine recommendation. The machine fix should be parameterized —
driven by what the profiler detects in the spec, not by what one API happens to need.
Example of the anti-pattern:
- Finding: "ESPN sync needs for historical data"
- Bad fix: "Add with format, / flags, and monthly chunking to the sync template"
- Good fix: "When the profiler detects a date-range query param, emit a flag that passes the value through to the API"
The bad fix bakes ESPN's date format, scope params, and chunking strategy into the
machine. The good fix lets the profiler drive behavior from the spec.
对于每个通过第2.5阶段筛选的候选结论,回答以下七个问题。问题5有七个子步骤(A到G);步骤G是对抗性检查。未通过步骤G的结论被丢弃——它们没有优先级,不会进入“执行/跳过”表格,而是被添加到丢弃候选列表并说明原因。
**1. 发生了什么?**一句话描述——症状,而非修复方案。
2. 评分器是否正确?(来自分数惩罚的结论必填)
- 评分器正确 → 修复Printing Press(模板、二进制文件或技能)
- 评分器错误 → 修复评分工具,而非Printing Press
- 两者都有问题 → 修复两者,标记哪个是主要的
3. 类别是什么?
| 类别 | 描述 |
|---|
| Bug | 生成的代码错误 |
| Scorer bug | 评分工具报告误报 |
| Template gap | 常见模式缺少模板 |
| Assumption mismatch | Printing Press假设X,但API使用Y |
| Recurring friction | 每次生成都会发生,可能是固有问题 |
| Missing scaffolding | Printing Press可以生成但未生成的功能类别 |
| Default gap | Printing Press生成错误或占位符默认值 |
| Discovered optimization | 使用过程中发现的改进 |
| Skill instruction gap | Skill告诉Claude错误的内容或遗漏步骤 |
4. 问题起源于Printing Press的哪个部分?
选择恰好一个组件。
列决定提交Issue时应用的
标签(第6阶段),Agent可以通过该标签筛选不同回顾分析中的相关工作(
gh issue list --label comp:<slug>
)。
| 组件 | Slug | 路径 |
|---|
| 生成器模板 | | |
| 规范解析器 | | |
| OpenAPI解析器 | | |
| 目录 | | |
| 主技能 | | skills/printing-press/SKILL.md
|
| 验证/内部测试/评分卡 | | CLI命令 |
如果结论确实涉及两个组件,选择持久修复所在的组件。不要添加多个标签。
5. 影响范围与 fallback 成本——Printing Press应该处理这个问题吗?
**步骤A:跨API压力测试。**在不同API结构(标准REST、代理封装、RPC风格)和输入方式(OpenAPI、浏览器抓取、HAR抓取、无规范)下测试。
步骤B:从目录中命名三个有直接证据的具体API。不是“每个有多词资源的API”或“任何浏览器抓取的CLI”。命名三个已在~/printing-press/library/
(或嵌入式目录)中的具体API,你可以指出该模式存在的证据:规范中的路径、已知的端点结构、供应商文档的标头、可重现的输出。“Stripe、Notion、GitHub可能有这个”是主观猜测;“Stripe(规范第N行的Stripe-Version标头)、GitHub(问题端点的X-GitHub-Api-Version)、Linear(/v2/*上的api-version)”是证据。如果你只能命名两个有证据的API——或三个主观猜测的API——结论最多列为P3,并添加注释,或被丢弃。
**步骤C:反向检查问题。**明确问自己:“如果我实现这个修复,会不会对没有这个模式的API造成负面影响?”如果是,修复需要添加防护或条件才能成为P1/P2——不能是默认修改。例如,默认启用客户端
截断会破坏需要服务器端分页才能保证正确性的API;只有在分析器检测到没有分页器时才启用,它才能保持P2优先级。没有这个防护,同样的结论是不安全的。
**步骤D:重复成本检查。**在
~/printing-press/manuscripts/*/proofs/*-retro-*.md
下搜索相同的结论。如果相同的结论在2次以上的回顾分析中提出但未被实现,之前的成本效益分析已经两次得出“不”的结论。不要以相同的优先级重新提出——要么列为P3并添加“已提出N次,仍不合理”的注释,要么将结论重构为更小的增量修复,解决部分阻碍。以相同优先级重复提出是筛选失败,而非更强的证据。
**记录匹配的先前回顾分析。**当搜索返回结果时,将每个结果记录为结构化元组——回顾分析的CLI名称、回顾分析文件路径(如果回顾分析文件的前置内容包含GitHub Issue编号,则为该编号),以及一个单词的分类:
- ——先前的回顾分析提议相同的修复方向。加强结论;在步骤F中引用。
- ——先前的回顾分析提议相反的修复或选择不同的默认值。明确指出:阅读新结论的维护者必须看到这种分歧。用一句话说明为什么这次回顾分析得出不同的结论(例如,“先前的回顾分析针对单分页器API;这次针对始终分页的API,先前的默认值会导致问题”)。
- ——先前的回顾分析提出了同一组件领域的相邻结论,但具体修复不同。有用的背景信息,不影响结论。
这些元组会进入回顾分析文档的每个结论模板(“相关先前回顾分析”),并与第2.5阶段去重扫描的
输出合并到Issue正文的“相关Issue”块中。GitHub会自动链接你写入的任何
Issue编号,因此分歧和一致性会在两个回顾分析时间线中显示,无需进一步操作。
**步骤E:评估fallback成本。**Claude在未来每个API中发现并修复这个问题的可靠性如何?Claude有30%的几率忘记的“简单”编辑意味着30%的CLI会带着缺陷交付。
步骤F:权衡决策。默认是不修改机器。结论需承担举证责任,证明修改机器是合理的。只有当以下三个条件都满足时,才继续到步骤G:
(a) 步骤B命名了三个有证据的具体API(而非猜测)。
(b) 步骤D的重复成本检查未 disqualify结论。
(c) 步骤C的反向检查未发现缺乏防护的、会伤害其他API的问题。
如果结论无法满足所有三个条件,它不会获得优先级——会被丢弃并说明具体原因(“仅命名2个有证据的API”/“已提出3次,仍不合理”/“修复会伤害单分页器API且无防护”)。
当结论适用于API子类时,包含:激活条件、跳过防护、频率估计。
**步骤G:构建反对提交的理由。**在记录结论之前,写1-2句话论证相反的观点——为什么这看起来像是单个CLI的修复、迭代产物或愿望清单项目。为什么维护者可能会将其标记为“按设计工作”或“范围太窄,不适合机器修复”?“不应该提交”的最强理由是什么?
如果反对理由比支持理由更强,丢弃结论。如果两者大致相当,丢弃结论(默认方向是不提交)。只有当支持理由明显更强时,结论才能进入第4阶段。
这一步不是形式主义。这是弱结论被淘汰的明确环节。通过步骤G的结论应该能够用一句话说明反对理由不成立——这句话值得在回顾分析条目中引用。
**6. 这是固有问题还是可修复的?**深入思考更智能的模板、后处理步骤或更好的规范分析是否可以消除阻碍。如果是固有问题,提出成本最低的缓解方案。
**7. 持久修复方案是什么?**优先选择:模板修复 > 二进制文件后处理 > 技能说明。
**明确标记不确定性。**如果你无法自信地确定一个根本原因或一个修复方案,请说明——列出候选原因(或候选修复方案),以及实现者在提交前如何消除歧义。Issue正文会显示这种不确定性,以便接手工作的Agent不会锁定错误但看似合理的诊断。制造的信心不是美德;诚实的“要么A要么B;通过X验证”比错误的处方更有用。
**从提议的修复中剥离API特定细节。**持久修复必须适用于所有API,而不仅仅是发现该问题的那个API。如果修复包含硬编码的参数名称(例如
、
)、日期格式(例如
)、分块策略(例如每月)或特定领域逻辑,这些是单个CLI的细节泄露到机器建议中。机器修复应该是参数化的——由分析器在规范中检测到的内容驱动,而非某个API恰好需要的内容。
反模式示例:
- 结论:“ESPN同步需要获取历史数据”
- 糟糕的修复:“向同步模板添加(格式为)、/标志,以及每月分块”
- 良好的修复:“当分析器检测到日期范围查询参数时,生成标志,将值传递给API”
糟糕的修复将ESPN的日期格式、范围参数和分块策略硬编码到机器中。良好的修复让分析器根据规范驱动行为。