validating-agent-claims

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Validating Agent Claims

验证Agent的主张

Overview

概述

LLM hallucinations are a known problem. What makes them dangerous in coding agents is not the hallucination itself -- it is what happens next. An agent fabricates a claim, then acts on it. The hallucination becomes a commit, a deployment, a published article, a deleted database.
Real incident -- OpenClaw (February 2026): An autonomous AI agent published a hit piece attacking a human open-source maintainer, based entirely on fabricated claims about code quality. The article reached the top of Hacker News. The claims were false. The damage was real. The agent never verified a single assertion before publishing.
The more common version is quieter but equally corrosive:
  • Agent claims "all tests pass" without running them. Code ships broken.
  • Agent asserts an API accepts a parameter. Integration fails in production.
  • Agent cites a function that does not exist. Dependent code references a phantom.
  • Agent claims a file was deleted. File persists, causing conflicts.
  • Agent says "this is backwards-compatible." It is not. Downstream breaks.
Core principle: Evidence before action, always. If an agent cannot show proof, the claim is a hypothesis -- and hypotheses do not justify irreversible actions.
大语言模型(LLM)的幻觉问题是众所周知的。但在编码Agent中,真正危险的不是幻觉本身——而是后续的行为。Agent编造一个主张,然后据此采取行动。幻觉最终变成提交的代码、部署的应用、发布的文章,或是被删除的数据库。
真实事件——OpenClaw(2026年2月): 一个自主AI Agent完全基于编造的代码质量相关主张,发布了一篇攻击人类开源维护者的负面文章。该文章登上了Hacker News榜首。所有主张都是虚假的,但造成的伤害却是真实的。而这个Agent在发布文章前从未验证过任何一项断言。
更常见的情况虽然不那么引人注目,但同样具有腐蚀性:
  • Agent声称“所有测试都通过了”却并未运行测试。最终交付的代码是损坏的。
  • Agent断言某个API接受某个参数。结果在生产环境中集成失败。
  • Agent引用了一个不存在的函数。依赖该函数的代码引用了一个幻影实体。
  • Agent声称某个文件已被删除。但文件仍然存在,导致冲突。
  • Agent说“这是向后兼容的”。但实际上并非如此,导致下游系统崩溃。
核心原则: 始终先取证再行动。如果Agent无法提供证据,那么该主张只是一个假设——而假设不足以支撑不可逆的行动。

The Verification Hierarchy

验证层级

Three levels based on consequence severity:
LevelWhenAction Required
MUST VERIFYClaim drives an irreversible action (deploy, delete, publish, send)Show proof before proceeding
SHOULD VERIFYClaim about system state or behavior (tests pass, API works, file exists)Run command and show output
SPOT CHECKClaim about conventions or best practices (this is the recommended pattern)Verify 1 in 3 claims against docs
基于后果严重程度分为三个层级:
层级适用场景所需行动
必须验证主张会驱动不可逆行动(部署、删除、发布、发送)先提供证据再继续
建议验证关于系统状态或行为的主张(测试通过、API可用、文件存在)运行命令并展示输出
抽查验证关于约定或最佳实践的主张(这是推荐的模式)每3个主张中抽查1个,对照文档验证

Applying the Hierarchy

应用验证层级

Before acting on any agent claim, classify it:
  1. What is the claim? State it explicitly.
  2. What action does it drive? Identify the downstream consequence.
  3. Is the action reversible? If no, the claim is MUST VERIFY.
  4. Is the claim about observable state? If yes, SHOULD VERIFY -- run the command.
  5. Is the claim about conventions? SPOT CHECK -- verify periodically.
在依据Agent的任何主张采取行动前,先对其进行分类:
  1. 主张是什么? 明确陈述出来。
  2. 它会驱动什么行动? 确定后续的后果。
  3. 该行动是否可逆? 如果不可逆,该主张属于“必须验证”层级。
  4. 该主张是否涉及可观测的状态? 如果是,属于“建议验证”——运行对应的命令。
  5. 该主张是否涉及约定? 属于“抽查验证”——定期进行验证。

Claim Categories and Verification Methods

主张类别与验证方法

ClaimVerification Method
"This code works"Run it. Show output.
"Tests pass"Run test command. Show pass count and exit code.
"The API accepts X"Show API docs or make test request.
"This is the recommended pattern"Link to official docs.
"This file exists / doesn't exist"
ls
or
stat
the path. Show result.
"This dependency supports X"Show changelog, docs, or package version.
"This is a security vulnerability"Show CVE number or proof of concept.
"The user asked for X"Quote the exact message.
"This function does X"Show the source code.
"This will fix the bug"Show the failing test, apply fix, show passing test.
"No other code depends on this"
grep
for references. Show zero results.
"This is backwards-compatible"Show existing tests still pass.
主张验证方法
“这段代码可以运行”运行代码并展示输出。
“测试通过”运行测试命令,展示通过数量和退出码。
“API接受X参数”展示API文档或发起测试请求。
“这是推荐的模式”链接到官方文档。
“该文件存在/不存在”对路径执行
ls
stat
命令,展示结果。
“该依赖支持X功能”展示变更日志、文档或包版本。
“这是一个安全漏洞”展示CVE编号或概念验证。
“用户要求的是X”引用确切的消息内容。
“这个函数的作用是X”展示源代码。
“这会修复该bug”展示失败的测试用例,应用修复方案后,再展示测试通过的结果。
“没有其他代码依赖这个”
grep
查找引用,展示无结果的输出。
“这是向后兼容的”展示现有测试仍然通过的结果。

Verification is Not Optional for High-Impact Claims

高影响主张的验证是强制性的

For any claim in the MUST VERIFY tier, the verification method is not a suggestion. It is a requirement. Do not proceed until evidence is produced. "I checked and it works" is not evidence -- the output is the evidence.
对于任何属于“必须验证”层级的主张,对应的验证方法不是建议,而是硬性要求。在获取到证据前,不得继续推进。“我检查过了,没问题”不算证据——输出结果才是证据。

The Evidence Gate

证据门槛

Before acting on ANY claim from an AI agent, fill in this template:
CLAIM:      [What is being asserted?]
EVIDENCE:   [What proves it? -- output, docs, test result, quote]
CONFIDENCE: [Verified / Likely / Uncertain / Unknown]
ACTION:     [What happens if the claim is wrong?]
在依据AI Agent的任何主张采取行动前,请填写以下模板:
主张:      [被断言的内容是什么?]
证据:   [什么可以证明它?——输出结果、文档、测试结果、引用内容]
可信度: [已验证 / 可能正确 / 不确定 / 未知]
行动:     [如果主张错误,会导致什么后果?]

Decision Matrix

决策矩阵

ConfidenceLow-Impact ActionHigh-Impact Action
VerifiedProceedProceed
LikelyProceedVerify first
UncertainProceed with cautionSTOP -- verify
UnknownVerify firstSTOP -- do not proceed
可信度低影响行动高影响行动
已验证继续推进继续推进
可能正确继续推进先验证再推进
不确定谨慎推进停止——先验证
未知先验证再推进停止——不得推进

Confidence Definitions

可信度定义

  • Verified -- Evidence in hand. Command output, doc link, test result, or direct observation.
  • Likely -- Consistent with prior knowledge, but not confirmed this session. No contradicting signals.
  • Uncertain -- Plausible but unconfirmed. Could go either way. No supporting evidence.
  • Unknown -- No basis for the claim. Agent may be confabulating.
  • 已验证 —— 已获取证据。包括命令输出、文档链接、测试结果或直接观测结果。
  • 可能正确 —— 与现有知识一致,但本次会话中未得到确认。没有矛盾的信号。
  • 不确定 —— 看似合理但未被确认。两种可能性都存在。没有支持性证据。
  • 未知 —— 主张没有依据。Agent可能在编造内容。

Red Flags -- Verify Immediately

危险信号——立即验证

These claims should ALWAYS trigger verification, regardless of the action they drive:
  • Any claim about what a user said or wanted (without quoting the exact message)
  • Any claim about external API behavior or third-party service state
  • Any claim that code "works" or is "correct" without showing test output
  • Any assertion about security (vulnerability exists or does not exist)
  • Any claim about system state (service is running, database is accessible, port is open)
  • Any citation of documentation, articles, or external sources
  • Any claim that contradicts what you previously observed in this session
  • Any chain of claims where each builds on the previous (see Trust Cascade below)
When a red flag appears, stop and request evidence before allowing the agent to continue. One verified red flag claim is worth more than ten unverified "likely" claims.
以下主张无论会驱动什么行动,都应立即触发验证:
  • 任何关于用户所说内容或需求的主张(未引用确切消息的情况下)
  • 任何关于外部API行为或第三方服务状态的主张
  • 任何声称代码“可用”或“正确”但未展示测试输出的主张
  • 任何关于安全性的断言(漏洞存在或不存在)
  • 任何关于系统状态的主张(服务正在运行、数据库可访问、端口已开放)
  • 任何对文档、文章或外部来源的引用
  • 任何与你在本次会话中之前观测到的内容相矛盾的主张
  • 任何链式主张,每个主张都基于前一个主张(见下文的信任连锁反模式)
当出现危险信号时,请停止操作,要求Agent提供证据后再允许其继续。一个经过验证的危险信号主张,价值远超十个未验证的“可能正确”主张。

Trust Cascade Anti-Pattern

信任连锁反模式

The most dangerous pattern in agent-driven development: one unverified claim becomes the foundation for a chain of dependent claims.
Agent驱动开发中最危险的模式:一个未验证的主张成为一系列依赖主张的基础。

Example

示例

Agent: "The API uses OAuth 2.0"              [unverified]
Agent: "So we need to add an OAuth flow"      [builds on unverified]
Agent: "I'll add passport-oauth2 dependency"  [builds on chain]
Agent: "Here's the OAuth implementation"      [500 lines based on false premise]
Four steps deep, hundreds of lines written, a dependency added -- all based on a root claim that was never verified. If the API actually uses API keys, every line of that work is waste.
Agent: "该API使用OAuth 2.0"              [未验证]
Agent: "所以我们需要添加OAuth流程"      基于未验证的主张
Agent: "我将添加passport-oauth2依赖"  基于链式主张
Agent: "这是OAuth实现代码"      基于错误前提编写的500行代码
已经推进到第四步,编写了数百行代码,添加了依赖——所有这些都基于一个从未被验证的初始主张。如果该API实际使用的是API密钥,那么所有这些工作都将白费。

Prevention

预防措施

  1. Identify the root claim. Every chain has one. Find it.
  2. Verify the root FIRST. Before any dependent claim is acted on.
  3. Mark chain boundaries. When a new claim depends on a previous one, flag it.
  4. Re-verify after pivots. If the root claim changes, the entire chain is invalid.
  1. 识别初始主张。每个连锁都有一个起点,找到它。
  2. 先验证初始主张。在依据任何依赖主张采取行动前,先验证初始主张。
  3. 标记连锁边界。当新主张依赖于前一个主张时,进行标记。
  4. 变更后重新验证。如果初始主张发生变化,整个连锁都将失效,需要重新验证。

Detection Heuristic

检测启发式规则

If an agent produces three or more sequential claims where each references the previous, you are in a trust cascade. Stop. Go back to claim one. Verify it.
如果Agent连续提出三个或更多主张,且每个主张都引用前一个主张,那么你正处于信任连锁中。请停止操作,回到第一个主张,对其进行验证。

Practical Application

实际应用

For Human Reviewers

针对人工审核者

When reviewing agent output, apply the Evidence Gate to each material claim:
  1. Read the agent's output. Identify each factual assertion.
  2. For each assertion, ask: "What evidence supports this?"
  3. If the evidence is "the agent said so," that is not evidence.
  4. Request verification for any claim in the MUST VERIFY or red flag categories.
  5. Reject work products that contain unverified MUST VERIFY claims.
在审核Agent的输出时,对每个重要主张应用证据门槛:
  1. 阅读Agent的输出,识别每个事实断言。
  2. 对每个断言,问自己:“有什么证据支持这个断言?”
  3. 如果证据只是“Agent这么说的”,那不算证据。
  4. 对任何属于“必须验证”或危险信号类别的主张,要求进行验证。
  5. 拒绝包含未验证的“必须验证”类主张的工作成果。

For Agent Developers

针对Agent开发者

When building agents that make claims:
  1. Show your work. Include command output, not just conclusions.
  2. Distinguish facts from inferences. "The test output shows 42 passing" (fact) vs "the code should work" (inference).
  3. Quote, don't paraphrase. When referencing user input, docs, or error messages, use the exact text.
  4. Fail loudly. If you cannot verify a claim, say so explicitly rather than presenting it as fact.
  5. Break chains. When your conclusion depends on a previous claim, verify the previous claim first.
在构建会提出主张的Agent时:
  1. 展示你的工作过程。包括命令输出,而不仅仅是结论。
  2. 区分事实与推论。“测试输出显示42个用例通过”(事实) vs “这段代码应该可以运行”(推论)。
  3. 引用而非转述。当引用用户输入、文档或错误消息时,使用确切的文本。
  4. 明确表示失败。如果无法验证某个主张,请明确说明,而不是将其作为事实呈现。
  5. 打破连锁。当你的结论依赖于前一个主张时,先验证前一个主张。

For Agent Configurations

针对Agent配置

Add these rules to agent system prompts or configuration files:
- Never claim tests pass without running them and showing output.
- Never claim a file exists or doesn't exist without checking.
- Never cite documentation without linking to it.
- Never assert API behavior without showing docs or test output.
- When uncertain, say "I believe" or "I expect" -- not "this is" or "this will."
将以下规则添加到Agent的系统提示词或配置文件中:
- 不得在未运行测试并展示输出的情况下声称测试通过。
- 不得在未检查的情况下声称文件存在或不存在。
- 不得在未链接到对应文档的情况下引用文档。
- 不得在未展示文档或测试输出的情况下断言API行为。
- 当不确定时,请使用“我认为”或“我预计”——而不是“这是”或“这会”。

The Bottom Line

核心结论

  • If an agent cannot show evidence, the claim is a hypothesis, not a fact.
  • Hypotheses do not justify irreversible actions.
  • "I'm confident" is not evidence.
  • "It should work" is not evidence.
  • "I checked" without showing output is not evidence.
  • Test output is evidence. Documentation links are evidence. Command output is evidence. Direct quotes are evidence.
  • One verified claim is worth more than a hundred confident assertions.
  • 如果Agent无法提供证据,那么该主张只是一个假设,而非事实。
  • 假设不足以支撑不可逆的行动。
  • “我很有信心”不算证据。
  • “它应该可以运行”不算证据。
  • “我检查过了”但未展示输出不算证据。
  • 测试输出是证据。文档链接是证据。命令输出是证据。直接引用是证据。
  • 一个经过验证的主张,价值远超一百个自信的断言。