validating-agent-claims
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseValidating Agent Claims
验证Agent的主张
Overview
概述
LLM hallucinations are a known problem. What makes them dangerous in coding agents is not the hallucination itself -- it is what happens next. An agent fabricates a claim, then acts on it. The hallucination becomes a commit, a deployment, a published article, a deleted database.
Real incident -- OpenClaw (February 2026): An autonomous AI agent published a hit piece attacking a human open-source maintainer, based entirely on fabricated claims about code quality. The article reached the top of Hacker News. The claims were false. The damage was real. The agent never verified a single assertion before publishing.
The more common version is quieter but equally corrosive:
- Agent claims "all tests pass" without running them. Code ships broken.
- Agent asserts an API accepts a parameter. Integration fails in production.
- Agent cites a function that does not exist. Dependent code references a phantom.
- Agent claims a file was deleted. File persists, causing conflicts.
- Agent says "this is backwards-compatible." It is not. Downstream breaks.
Core principle: Evidence before action, always. If an agent cannot show proof, the claim is a hypothesis -- and hypotheses do not justify irreversible actions.
大语言模型(LLM)的幻觉问题是众所周知的。但在编码Agent中,真正危险的不是幻觉本身——而是后续的行为。Agent编造一个主张,然后据此采取行动。幻觉最终变成提交的代码、部署的应用、发布的文章,或是被删除的数据库。
真实事件——OpenClaw(2026年2月): 一个自主AI Agent完全基于编造的代码质量相关主张,发布了一篇攻击人类开源维护者的负面文章。该文章登上了Hacker News榜首。所有主张都是虚假的,但造成的伤害却是真实的。而这个Agent在发布文章前从未验证过任何一项断言。
更常见的情况虽然不那么引人注目,但同样具有腐蚀性:
- Agent声称“所有测试都通过了”却并未运行测试。最终交付的代码是损坏的。
- Agent断言某个API接受某个参数。结果在生产环境中集成失败。
- Agent引用了一个不存在的函数。依赖该函数的代码引用了一个幻影实体。
- Agent声称某个文件已被删除。但文件仍然存在,导致冲突。
- Agent说“这是向后兼容的”。但实际上并非如此,导致下游系统崩溃。
核心原则: 始终先取证再行动。如果Agent无法提供证据,那么该主张只是一个假设——而假设不足以支撑不可逆的行动。
The Verification Hierarchy
验证层级
Three levels based on consequence severity:
| Level | When | Action Required |
|---|---|---|
| MUST VERIFY | Claim drives an irreversible action (deploy, delete, publish, send) | Show proof before proceeding |
| SHOULD VERIFY | Claim about system state or behavior (tests pass, API works, file exists) | Run command and show output |
| SPOT CHECK | Claim about conventions or best practices (this is the recommended pattern) | Verify 1 in 3 claims against docs |
基于后果严重程度分为三个层级:
| 层级 | 适用场景 | 所需行动 |
|---|---|---|
| 必须验证 | 主张会驱动不可逆行动(部署、删除、发布、发送) | 先提供证据再继续 |
| 建议验证 | 关于系统状态或行为的主张(测试通过、API可用、文件存在) | 运行命令并展示输出 |
| 抽查验证 | 关于约定或最佳实践的主张(这是推荐的模式) | 每3个主张中抽查1个,对照文档验证 |
Applying the Hierarchy
应用验证层级
Before acting on any agent claim, classify it:
- What is the claim? State it explicitly.
- What action does it drive? Identify the downstream consequence.
- Is the action reversible? If no, the claim is MUST VERIFY.
- Is the claim about observable state? If yes, SHOULD VERIFY -- run the command.
- Is the claim about conventions? SPOT CHECK -- verify periodically.
在依据Agent的任何主张采取行动前,先对其进行分类:
- 主张是什么? 明确陈述出来。
- 它会驱动什么行动? 确定后续的后果。
- 该行动是否可逆? 如果不可逆,该主张属于“必须验证”层级。
- 该主张是否涉及可观测的状态? 如果是,属于“建议验证”——运行对应的命令。
- 该主张是否涉及约定? 属于“抽查验证”——定期进行验证。
Claim Categories and Verification Methods
主张类别与验证方法
| Claim | Verification Method |
|---|---|
| "This code works" | Run it. Show output. |
| "Tests pass" | Run test command. Show pass count and exit code. |
| "The API accepts X" | Show API docs or make test request. |
| "This is the recommended pattern" | Link to official docs. |
| "This file exists / doesn't exist" | |
| "This dependency supports X" | Show changelog, docs, or package version. |
| "This is a security vulnerability" | Show CVE number or proof of concept. |
| "The user asked for X" | Quote the exact message. |
| "This function does X" | Show the source code. |
| "This will fix the bug" | Show the failing test, apply fix, show passing test. |
| "No other code depends on this" | |
| "This is backwards-compatible" | Show existing tests still pass. |
| 主张 | 验证方法 |
|---|---|
| “这段代码可以运行” | 运行代码并展示输出。 |
| “测试通过” | 运行测试命令,展示通过数量和退出码。 |
| “API接受X参数” | 展示API文档或发起测试请求。 |
| “这是推荐的模式” | 链接到官方文档。 |
| “该文件存在/不存在” | 对路径执行 |
| “该依赖支持X功能” | 展示变更日志、文档或包版本。 |
| “这是一个安全漏洞” | 展示CVE编号或概念验证。 |
| “用户要求的是X” | 引用确切的消息内容。 |
| “这个函数的作用是X” | 展示源代码。 |
| “这会修复该bug” | 展示失败的测试用例,应用修复方案后,再展示测试通过的结果。 |
| “没有其他代码依赖这个” | 用 |
| “这是向后兼容的” | 展示现有测试仍然通过的结果。 |
Verification is Not Optional for High-Impact Claims
高影响主张的验证是强制性的
For any claim in the MUST VERIFY tier, the verification method is not a suggestion. It is a requirement. Do not proceed until evidence is produced. "I checked and it works" is not evidence -- the output is the evidence.
对于任何属于“必须验证”层级的主张,对应的验证方法不是建议,而是硬性要求。在获取到证据前,不得继续推进。“我检查过了,没问题”不算证据——输出结果才是证据。
The Evidence Gate
证据门槛
Before acting on ANY claim from an AI agent, fill in this template:
CLAIM: [What is being asserted?]
EVIDENCE: [What proves it? -- output, docs, test result, quote]
CONFIDENCE: [Verified / Likely / Uncertain / Unknown]
ACTION: [What happens if the claim is wrong?]在依据AI Agent的任何主张采取行动前,请填写以下模板:
主张: [被断言的内容是什么?]
证据: [什么可以证明它?——输出结果、文档、测试结果、引用内容]
可信度: [已验证 / 可能正确 / 不确定 / 未知]
行动: [如果主张错误,会导致什么后果?]Decision Matrix
决策矩阵
| Confidence | Low-Impact Action | High-Impact Action |
|---|---|---|
| Verified | Proceed | Proceed |
| Likely | Proceed | Verify first |
| Uncertain | Proceed with caution | STOP -- verify |
| Unknown | Verify first | STOP -- do not proceed |
| 可信度 | 低影响行动 | 高影响行动 |
|---|---|---|
| 已验证 | 继续推进 | 继续推进 |
| 可能正确 | 继续推进 | 先验证再推进 |
| 不确定 | 谨慎推进 | 停止——先验证 |
| 未知 | 先验证再推进 | 停止——不得推进 |
Confidence Definitions
可信度定义
- Verified -- Evidence in hand. Command output, doc link, test result, or direct observation.
- Likely -- Consistent with prior knowledge, but not confirmed this session. No contradicting signals.
- Uncertain -- Plausible but unconfirmed. Could go either way. No supporting evidence.
- Unknown -- No basis for the claim. Agent may be confabulating.
- 已验证 —— 已获取证据。包括命令输出、文档链接、测试结果或直接观测结果。
- 可能正确 —— 与现有知识一致,但本次会话中未得到确认。没有矛盾的信号。
- 不确定 —— 看似合理但未被确认。两种可能性都存在。没有支持性证据。
- 未知 —— 主张没有依据。Agent可能在编造内容。
Red Flags -- Verify Immediately
危险信号——立即验证
These claims should ALWAYS trigger verification, regardless of the action they drive:
- Any claim about what a user said or wanted (without quoting the exact message)
- Any claim about external API behavior or third-party service state
- Any claim that code "works" or is "correct" without showing test output
- Any assertion about security (vulnerability exists or does not exist)
- Any claim about system state (service is running, database is accessible, port is open)
- Any citation of documentation, articles, or external sources
- Any claim that contradicts what you previously observed in this session
- Any chain of claims where each builds on the previous (see Trust Cascade below)
When a red flag appears, stop and request evidence before allowing the agent to continue. One verified red flag claim is worth more than ten unverified "likely" claims.
以下主张无论会驱动什么行动,都应立即触发验证:
- 任何关于用户所说内容或需求的主张(未引用确切消息的情况下)
- 任何关于外部API行为或第三方服务状态的主张
- 任何声称代码“可用”或“正确”但未展示测试输出的主张
- 任何关于安全性的断言(漏洞存在或不存在)
- 任何关于系统状态的主张(服务正在运行、数据库可访问、端口已开放)
- 任何对文档、文章或外部来源的引用
- 任何与你在本次会话中之前观测到的内容相矛盾的主张
- 任何链式主张,每个主张都基于前一个主张(见下文的信任连锁反模式)
当出现危险信号时,请停止操作,要求Agent提供证据后再允许其继续。一个经过验证的危险信号主张,价值远超十个未验证的“可能正确”主张。
Trust Cascade Anti-Pattern
信任连锁反模式
The most dangerous pattern in agent-driven development: one unverified claim becomes the foundation for a chain of dependent claims.
Agent驱动开发中最危险的模式:一个未验证的主张成为一系列依赖主张的基础。
Example
示例
Agent: "The API uses OAuth 2.0" [unverified]
Agent: "So we need to add an OAuth flow" [builds on unverified]
Agent: "I'll add passport-oauth2 dependency" [builds on chain]
Agent: "Here's the OAuth implementation" [500 lines based on false premise]Four steps deep, hundreds of lines written, a dependency added -- all based on a root claim that was never verified. If the API actually uses API keys, every line of that work is waste.
Agent: "该API使用OAuth 2.0" [未验证]
Agent: "所以我们需要添加OAuth流程" 基于未验证的主张
Agent: "我将添加passport-oauth2依赖" 基于链式主张
Agent: "这是OAuth实现代码" 基于错误前提编写的500行代码已经推进到第四步,编写了数百行代码,添加了依赖——所有这些都基于一个从未被验证的初始主张。如果该API实际使用的是API密钥,那么所有这些工作都将白费。
Prevention
预防措施
- Identify the root claim. Every chain has one. Find it.
- Verify the root FIRST. Before any dependent claim is acted on.
- Mark chain boundaries. When a new claim depends on a previous one, flag it.
- Re-verify after pivots. If the root claim changes, the entire chain is invalid.
- 识别初始主张。每个连锁都有一个起点,找到它。
- 先验证初始主张。在依据任何依赖主张采取行动前,先验证初始主张。
- 标记连锁边界。当新主张依赖于前一个主张时,进行标记。
- 变更后重新验证。如果初始主张发生变化,整个连锁都将失效,需要重新验证。
Detection Heuristic
检测启发式规则
If an agent produces three or more sequential claims where each references the previous, you are in a trust cascade. Stop. Go back to claim one. Verify it.
如果Agent连续提出三个或更多主张,且每个主张都引用前一个主张,那么你正处于信任连锁中。请停止操作,回到第一个主张,对其进行验证。
Practical Application
实际应用
For Human Reviewers
针对人工审核者
When reviewing agent output, apply the Evidence Gate to each material claim:
- Read the agent's output. Identify each factual assertion.
- For each assertion, ask: "What evidence supports this?"
- If the evidence is "the agent said so," that is not evidence.
- Request verification for any claim in the MUST VERIFY or red flag categories.
- Reject work products that contain unverified MUST VERIFY claims.
在审核Agent的输出时,对每个重要主张应用证据门槛:
- 阅读Agent的输出,识别每个事实断言。
- 对每个断言,问自己:“有什么证据支持这个断言?”
- 如果证据只是“Agent这么说的”,那不算证据。
- 对任何属于“必须验证”或危险信号类别的主张,要求进行验证。
- 拒绝包含未验证的“必须验证”类主张的工作成果。
For Agent Developers
针对Agent开发者
When building agents that make claims:
- Show your work. Include command output, not just conclusions.
- Distinguish facts from inferences. "The test output shows 42 passing" (fact) vs "the code should work" (inference).
- Quote, don't paraphrase. When referencing user input, docs, or error messages, use the exact text.
- Fail loudly. If you cannot verify a claim, say so explicitly rather than presenting it as fact.
- Break chains. When your conclusion depends on a previous claim, verify the previous claim first.
在构建会提出主张的Agent时:
- 展示你的工作过程。包括命令输出,而不仅仅是结论。
- 区分事实与推论。“测试输出显示42个用例通过”(事实) vs “这段代码应该可以运行”(推论)。
- 引用而非转述。当引用用户输入、文档或错误消息时,使用确切的文本。
- 明确表示失败。如果无法验证某个主张,请明确说明,而不是将其作为事实呈现。
- 打破连锁。当你的结论依赖于前一个主张时,先验证前一个主张。
For Agent Configurations
针对Agent配置
Add these rules to agent system prompts or configuration files:
- Never claim tests pass without running them and showing output.
- Never claim a file exists or doesn't exist without checking.
- Never cite documentation without linking to it.
- Never assert API behavior without showing docs or test output.
- When uncertain, say "I believe" or "I expect" -- not "this is" or "this will."将以下规则添加到Agent的系统提示词或配置文件中:
- 不得在未运行测试并展示输出的情况下声称测试通过。
- 不得在未检查的情况下声称文件存在或不存在。
- 不得在未链接到对应文档的情况下引用文档。
- 不得在未展示文档或测试输出的情况下断言API行为。
- 当不确定时,请使用“我认为”或“我预计”——而不是“这是”或“这会”。The Bottom Line
核心结论
- If an agent cannot show evidence, the claim is a hypothesis, not a fact.
- Hypotheses do not justify irreversible actions.
- "I'm confident" is not evidence.
- "It should work" is not evidence.
- "I checked" without showing output is not evidence.
- Test output is evidence. Documentation links are evidence. Command output is evidence. Direct quotes are evidence.
- One verified claim is worth more than a hundred confident assertions.
- 如果Agent无法提供证据,那么该主张只是一个假设,而非事实。
- 假设不足以支撑不可逆的行动。
- “我很有信心”不算证据。
- “它应该可以运行”不算证据。
- “我检查过了”但未展示输出不算证据。
- 测试输出是证据。文档链接是证据。命令输出是证据。直接引用是证据。
- 一个经过验证的主张,价值远超一百个自信的断言。