codex-review
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCross-Model Code Review with Codex
基于Codex的跨模型代码评审
Cross-model validation using Codex CLI's MCP tools. Claude writes code, Codex reviews it — different architecture, different training distribution, no self-approval bias.
Core insight: Single-model self-review is systematically biased. Cross-model review catches different bug classes because the reviewer has fundamentally different blind spots than the author.
使用Codex CLI的MCP工具实现跨模型校验。Claude编写代码,Codex负责评审——不同的架构、不同的训练数据分布,不存在自审批偏差。
核心洞察: 单模型自评审存在系统性偏差。跨模型评审能够发现不同类型的缺陷,因为评审模型和代码编写模型的盲点存在本质差异。
The Two Codex Tools
两款Codex工具
The Codex CLI MCP server exposes two tools relevant for reviews:
| Tool | Best For | Key Constraint |
|---|---|---|
| Codex review | Structured diff review with prioritized findings | |
| Codex codex | Freeform deep-dive on specific concerns | Requires explicit diff context in prompt |
Always pass these parameters:
- — most capable coding model
model: "gpt-5.3-codex" - — maximum depth (on
reasoningEffort: "xhigh"tool)codex - — reviewer must never modify the working tree (on
sandbox: "read-only"tool)codex
Codex CLI MCP服务提供两款适用于代码评审的工具:
| 工具 | 最佳适用场景 | 核心限制 |
|---|---|---|
| Codex review | 带优先级结果的结构化差异评审 | |
| Codex codex | 针对特定问题的自由形式深度分析 | 需要在prompt中明确提供差异上下文 |
始终传入以下参数:
- — 能力最强的编码模型
model: "gpt-5.3-codex" - — 最大分析深度(适用于
reasoningEffort: "xhigh"工具)codex - — 评审工具绝对不能修改工作目录(适用于
sandbox: "read-only"工具)codex
Review Patterns
评审模式
Pattern 1: Pre-PR Full Review (Default)
模式1:PR前全量评审(默认)
The standard review before opening a PR. Use for any non-trivial change.
Step 1 — Structured review (catches correctness + general issues):
Codex review(
base: "main",
model: "gpt-5.3-codex",
title: "Pre-PR Review"
)
Step 2 — Security deep-dive (if code touches auth, input handling, or APIs):
Codex codex(
prompt: <security template from references/prompts.md>,
model: "gpt-5.3-codex",
reasoningEffort: "xhigh",
sandbox: "read-only"
)
Step 3 — Fix findings, then re-review:
Codex review(
base: "main",
model: "gpt-5.3-codex",
title: "Re-review after fixes"
)提交PR前的标准评审流程,适用于所有非 trivial 变更。
Step 1 — 结构化评审(发现正确性问题+通用问题):
Codex review(
base: "main",
model: "gpt-5.3-codex",
title: "Pre-PR Review"
)
Step 2 — 安全深度分析(如果代码涉及认证、输入处理或API):
Codex codex(
prompt: <security template from references/prompts.md>,
model: "gpt-5.3-codex",
reasoningEffort: "xhigh",
sandbox: "read-only"
)
Step 3 — 修复问题后重新评审:
Codex review(
base: "main",
model: "gpt-5.3-codex",
title: "Re-review after fixes"
)Pattern 2: Commit-Level Review
模式2:Commit级评审
Quick check after each meaningful commit. Good for iterative development.
Codex review(
commit: "<SHA>",
model: "gpt-5.3-codex",
title: "Commit review"
)每次有意义的提交后做快速检查,适合迭代开发场景。
Codex review(
commit: "<SHA>",
model: "gpt-5.3-codex",
title: "Commit review"
)Pattern 3: WIP Check
模式3:WIP检查
Review uncommitted work mid-development. Catches issues before they're baked in.
Codex review(
uncommitted: true,
model: "gpt-5.3-codex",
title: "WIP check"
)Note: cannot combine with a custom .
uncommitted: trueprompt开发过程中评审未提交的工作,在问题被固化前提前发现。
Codex review(
uncommitted: true,
model: "gpt-5.3-codex",
title: "WIP check"
)注意: 不能与自定义同时使用。
uncommitted: truepromptPattern 4: Focused Investigation
模式4:定向排查
Surgical deep-dive on a specific concern (error handling, concurrency, data flow).
Codex codex(
prompt: "Analyze [specific concern] in the changes between main and HEAD.
For each issue found: cite file and line, explain the risk,
suggest a concrete fix. Confidence threshold: only flag issues
you are >=70% confident about.",
model: "gpt-5.3-codex",
reasoningEffort: "xhigh",
sandbox: "read-only"
)针对特定问题(错误处理、并发、数据流)做精准深度分析。
Codex codex(
prompt: "Analyze [specific concern] in the changes between main and HEAD.
For each issue found: cite file and line, explain the risk,
suggest a concrete fix. Confidence threshold: only flag issues
you are >=70% confident about.",
model: "gpt-5.3-codex",
reasoningEffort: "xhigh",
sandbox: "read-only"
)Pattern 5: Ralph Loop (Implement-Review-Fix)
模式5:Ralph循环(实现-评审-修复)
Iterative quality enforcement — implement, review, fix, repeat. Max 3 iterations.
Iteration 1:
Claude → implement feature
Codex → review(base: "main") → findings
Claude → fix critical/high findings
Iteration 2:
Codex → review(base: "main") → verify fixes + catch remaining
Claude → fix remaining issues
Iteration 3 (final):
Codex → review(base: "main") → clean bill of health
(or accept known trade-offs and document them)
STOP after 3 iterations. Diminishing returns beyond this.迭代式质量管控——实现、评审、修复,重复执行,最多3轮迭代。
Iteration 1:
Claude → implement feature
Codex → review(base: "main") → findings
Claude → fix critical/high findings
Iteration 2:
Codex → review(base: "main") → verify fixes + catch remaining
Claude → fix remaining issues
Iteration 3 (final):
Codex → review(base: "main") → clean bill of health
(or accept known trade-offs and document them)
STOP after 3 iterations. Diminishing returns beyond this.Multi-Pass Strategy
多轮评审策略
For thorough reviews, run multiple focused passes instead of one vague pass. Each pass gets a specific persona and concern domain.
| Pass | Focus | Tool | Reasoning |
|---|---|---|---|
| Correctness | Bugs, logic, edge cases, race conditions | | |
| Security | OWASP Top 10, injection, auth, secrets | | |
| Architecture | Coupling, abstractions, API consistency | | |
| Performance | O(n^2), N+1 queries, memory leaks | | |
Run passes sequentially. Fix critical findings between passes to avoid noise compounding.
When to use multi-pass vs single-pass:
| Change Size | Strategy |
|---|---|
| < 50 lines, single concern | Single |
| 50-300 lines, feature work | |
| 300+ lines or architecture change | Full 4-pass |
| Security-sensitive (auth, payments, crypto) | Always include security pass |
要实现全面评审,建议执行多轮定向检查而非单次模糊评审。每一轮检查赋予特定的角色和关注领域。
| 轮次 | 关注方向 | 工具 | 推理深度 |
|---|---|---|---|
| 正确性 | Bug、逻辑、边缘 case、竞态条件 | | |
| 安全性 | OWASP Top 10、注入、认证、密钥 | | |
| 架构 | 耦合度、抽象、API一致性 | | |
| 性能 | O(n^2)复杂度、N+1查询、内存泄漏 | | |
按顺序执行各轮评审,每轮之间修复严重问题,避免无效信息叠加。
多轮评审 vs 单轮评审的适用场景:
| 变更规模 | 策略 |
|---|---|
| < 50行,单一关注点 | 单次 |
| 50-300行,功能开发 | |
| 300行以上或架构变更 | 完整4轮评审 |
| 安全敏感场景(认证、支付、加密) | 必须包含安全检查轮次 |
Decision Tree: Which Pattern?
决策树:选择合适的评审模式
dot
digraph review_decision {
rankdir=TB;
node [shape=diamond];
"What stage?" -> "Pre-commit" [label="writing code"];
"What stage?" -> "Pre-PR" [label="ready to submit"];
"What stage?" -> "Post-commit" [label="just committed"];
"What stage?" -> "Investigating" [label="specific concern"];
node [shape=box];
"Pre-commit" -> "Pattern 3: WIP Check";
"Pre-PR" -> "How big?";
"Post-commit" -> "Pattern 2: Commit Review";
"Investigating" -> "Pattern 4: Focused Investigation";
"How big?" [shape=diamond];
"How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
"How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}dot
digraph review_decision {
rankdir=TB;
node [shape=diamond];
"What stage?" -> "Pre-commit" [label="writing code"];
"What stage?" -> "Pre-PR" [label="ready to submit"];
"What stage?" -> "Post-commit" [label="just committed"];
"What stage?" -> "Investigating" [label="specific concern"];
node [shape=box];
"Pre-commit" -> "Pattern 3: WIP Check";
"Pre-PR" -> "How big?";
"Post-commit" -> "Pattern 2: Commit Review";
"Investigating" -> "Pattern 4: Focused Investigation";
"How big?" [shape=diamond];
"How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
"How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}Prompt Engineering Rules
Prompt工程规则
- Assign a persona — "senior security engineer" beats "review for security"
- Specify what to skip — "Skip formatting, naming style, minor docs gaps" prevents bikeshedding
- Require confidence scores — Only act on findings with confidence >= 0.7
- Demand file:line citations — Vague findings without location are not actionable
- Ask for concrete fixes — "Suggest a specific fix" not just "this is a problem"
- One domain per pass — Security-only, architecture-only. Mixing dilutes depth.
Ready-to-use prompt templates are in .
references/prompts.md- 分配明确角色 — "资深安全工程师" 比 "评审安全问题" 效果更好
- 指定跳过的内容 — "跳过格式、命名风格、微小文档缺口" 可以避免无意义的争论
- 要求置信度评分 — 仅处理置信度>=0.7的问题
- 要求提供文件:行号引用 — 没有明确位置的模糊结果无法落地
- 要求给出具体修复方案 — "建议具体修复方式" 而不只是 "这里有问题"
- 每轮仅关注一个领域 — 仅安全、仅架构,混合多个领域会降低分析深度
可直接使用的prompt模板存放在中。
references/prompts.mdAnti-Patterns
反模式
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| "Review this code" | Too vague — produces surface-level bikeshedding | Use specific domain prompts with persona |
| Single pass for everything | Context dilution — every dimension gets shallow treatment | Multi-pass with one concern per pass |
| Self-review (Claude reviews Claude's code) | Systematic bias — models approve their own patterns | Cross-model: Claude writes, Codex reviews |
| No confidence threshold | Noise floods signal — 0.3 confidence findings waste time | Only act on >= 0.7 confidence |
| Style comments in review | LLMs default to bikeshedding without explicit skip directives | "Skip: formatting, naming, minor docs" |
| > 3 review iterations | Diminishing returns, increasing noise, overbaking | Stop at 3. Accept trade-offs. |
| Review without project context | Generic advice disconnected from codebase conventions | Codex reads CLAUDE.md/AGENTS.md automatically |
| 反模式 | 失效原因 | 解决方案 |
|---|---|---|
| "评审这段代码" | 过于模糊,只会产出表层无意义的结论 | 使用带角色的特定领域prompt |
| 所有场景都用单轮评审 | 上下文分散,每个维度都只能得到浅层次分析 | 每轮仅关注一个问题的多轮评审 |
| 自评审(Claude评审Claude写的代码) | 系统性偏差,模型会认可自己产出的代码模式 | 跨模型:Claude编写,Codex评审 |
| 没有置信度阈值 | 噪音淹没有效信号,置信度0.3的问题会浪费时间 | 仅处理置信度>=0.7的问题 |
| 评审包含风格类建议 | 没有明确跳过指令时,LLM默认会产出无意义的风格评论 | 增加"跳过:格式、命名、微小文档问题"指令 |
| 超过3轮评审迭代 | 收益递减,噪音增加,过度优化 | 3轮后停止,接受必要的权衡 |
| 无项目上下文的评审 | 与代码库规范脱节的通用建议 | Codex会自动读取CLAUDE.md/AGENTS.md获取上下文 |
What This Skill is NOT
本技能的能力边界
- Not a replacement for human review. Cross-model review catches bugs but can't evaluate product direction or user experience.
- Not a linter. Don't use Codex review for formatting or style — that's what linters are for.
- Not infallible. 5-15% false positive rate is normal. Triage findings, don't blindly fix everything.
- Not for self-approval. The whole point is cross-model validation. Don't use Claude to review Claude's code.
- 不是人工评审的替代品:跨模型评审可以发现缺陷,但无法评估产品方向或用户体验
- 不是linter:不要用Codex评审做格式或风格检查,这类工作应该交给linter
- 不是万无一失的:5-15%的误报率是正常情况,需要对结果做分类,不要盲目修复所有问题
- 不支持自审批:本技能的核心价值是跨模型校验,不要用Claude评审Claude写的代码
References
参考资料
For ready-to-use prompt templates, see .
references/prompts.md可直接使用的prompt模板请查看。
references/prompts.md