empirical-prompt-tuning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Empirical Prompt Tuning

Empirical Prompt Tuning（实证提示词调优）

The author of a prompt cannot judge its quality. The clearer the writer thinks something is, the more likely another agent will stumble on it. The core of this skill is to have a bias-free executor actually run the instruction, evaluate it two-sidedly, and iterate. Do not stop until improvements plateau.

Prompt的作者无法客观判断其质量。作者认为越清晰的内容，其他Agent反而越容易在上面出错。这项技能的核心是让无偏见的执行者实际运行指令，进行双向评估，并持续迭代。直到改进进入平台期方可停止。

When to use

使用时机

Right after creating or substantially revising a skill / slash command / task prompt
When an agent does not behave as expected and you want to attribute the cause to ambiguity on the instruction side
When hardening high-importance instructions (frequently used skills, automation-core prompts)

When not to use:

One-off throwaway prompts (evaluation cost does not pay off)
When the goal is not to improve success rate but merely to reflect the writer's subjective preferences

创建或大幅修改技能/斜杠命令/任务Prompt后立即使用
当Agent行为不符合预期，且你希望将原因归因于指令中的歧义时使用
当强化高重要性指令（频繁使用的技能、自动化核心Prompt）时使用

不适用场景：

一次性临时Prompt（评估成本得不偿失）
目标不是提高成功率，而仅仅是体现作者主观偏好时

Workflow

工作流程

Iteration 0 — description / body consistency check (static, no dispatch needed)
- Read the triggers / use cases claimed by the frontmatter
```
description
```
- Read the scope the body actually covers
- If there is a gap, reconcile description or body before moving to iter 1
- Example: description says "navigation / form filling / data extraction" but the body is only a CLI reference for
```
npx playwright test
```
  — detect that kind of gap
- If you skip this, the subagent will "reinterpret" the body to match the description, and accuracy will come out high even though the skill does not actually meet the requirements (false positive)
Baseline preparation: Fix the target prompt and prepare the following two things.
- Evaluation scenarios, 2 to 3 kinds (1 median + 1 to 2 edge). Realistic tasks that assume actual situations where the target prompt would apply.
- Requirements checklist (for computing accuracy). For each scenario, enumerate 3 to 7 items the deliverable must satisfy. Accuracy % = items satisfied / total items. Fix this in advance (do not move it afterward).
Bias-free read: Have a "blank-slate" executor read the instruction. Dispatch a new subagent via the Task tool. Do not substitute with a self-reread (it is structurally impossible to view text you just wrote objectively). When running multiple scenarios in parallel, place multiple Agent invocations within a single message. For how to handle environments where dispatch is unavailable, see the "Environment constraints" section.
Execution: Hand the subagent a prompt that follows the subagent invocation contract described below, and have it execute the scenario. The executor produces an implementation or output and returns a self-report at the end.
Two-sided evaluation: Record the following from the returned results.
- Executor self-report (extracted from the body of the subagent's report): unclear points / discretionary fill-ins / places where template application got stuck
- Trace interpretation: each unclear point is tagged with the phase it originated in (Understanding / Planning / Execution / Formatting — see "Subagent invocation contract"). Phase-local fixes land better than global "the prompt was unclear" fixes; a single Understanding-phase ambiguity often looks like a chain of Execution-phase failures.
- Structured reflection: each unclear point must be returned as
```
Issue / Cause / General Fix Rule
```
  . The
```
General Fix Rule
```
  is the class-level abstraction that feeds the "Failure pattern ledger" — without it, fixes stay as one-off patches that rediscover the same mistake later.
- Instruction-side measurements (the judgment rules are defined canonically in this section; refer to it from elsewhere):
  - Success/failure: counts as success (○) only when all requirements tagged
```
[critical]
```
    are ○. If even one is × or partial, it is failure (×). The label is the binary ○ / × only.
  - Accuracy (achievement rate of the requirements checklist, %. ○ = full score, × = 0, partial = 0.5; sum and divide by total items)
  - Step count (use the
```
tool_uses
```
    field in the usage meta attached to the Task tool return value as-is. Include Read / Grep, do not exclude them)
  - Duration (
```
duration_ms
```
    from the Task tool usage meta)
  - Retry count (how many times the subagent redid the same decision. Extract from the subagent's self-report; not measurable from the instruction side)
  - On failure, add a one-line note to the "unclear points" section of the presentation format stating "which [critical] item dropped" (for root cause tracing)
- The requirements checklist must include at least one
```
[critical]
```
  -tagged item (if there are zero, the success judgment becomes vacuous). Do not add or remove [critical] tags after the fact.
Apply the diff: Put the minimum fix into the prompt to eliminate the unclear points. One theme per iteration (multiple related fixes are OK, unrelated fixes go to next time).
- Before applying the fix, explicitly state "which item in the requirements checklist / judgment wording this fix satisfies" (fixes inferred from axis names often do not land. See the "Fix propagation patterns" section below.)
- Consult the failure pattern ledger first. If the structured reflection's
```
General Fix Rule
```
  already matches a known pattern, the first question is "why didn't the existing fix prevent it?" — the fix may need to move closer to the top of the prompt, or be re-worded, before a new ledger entry is added.
Re-evaluate: Run 2 → 5 again with a new subagent (do not reuse the same agent: it has learned the previous improvements). Increase parallelism if iterating further does not plateau improvements.
Convergence check: The rough rule is "stop when 2 consecutive iterations have zero new unclear points AND metric improvements fall below the thresholds (below)". Make it 3 consecutive for high-importance prompts.

迭代0 — 描述/内容一致性检查（静态检查，无需调度）
- 阅读前置元数据
```
description
```
  中声明的触发条件/使用场景
- 阅读正文实际覆盖的范围
- 如果存在差距，先协调描述或正文，再进入迭代1
- 示例：描述中提到"导航/表单填写/数据提取"，但正文仅为
```
npx playwright test
```
  的CLI参考——要检测这类差距
- 如果跳过此步骤，子Agent（subagent）会重新解读正文以匹配描述，导致即使技能未满足要求，准确率也会显示很高（假阳性）
基线准备：确定目标Prompt，并准备以下两项内容。
- 评估场景：2-3种（1种常规场景 + 1-2种边缘场景）。模拟目标Prompt实际应用的真实任务。
- 需求检查表（用于计算准确率）：针对每个场景，列举交付成果必须满足的3-7项要求。准确率% = 满足的项数 / 总项数。需提前确定（后续不得修改）。
无偏见阅读：让"白板状态"的执行者阅读指令。通过Task工具调度新的subagent。不得用自我重读替代（你无法客观看待自己刚写的文本）。当并行运行多个场景时，在单条消息中调用多个Agent。关于调度不可用环境的处理方法，参见"环境约束"章节。
执行：向subagent提供符合下文所述subagent调用契约的Prompt，让其执行场景。执行者生成实现方案或输出，并在最后返回自我报告。
双向评估：从返回结果中记录以下内容。
- 执行者自我报告（从subagent报告正文中提取）：不明确点/自行补充内容/模板应用受阻的地方
- 轨迹解读：每个不明确点标记其起源阶段（理解/规划/执行/格式化——参见"Subagent调用契约"）。针对特定阶段的修复比全局"Prompt不明确"的修复更有效；一个理解阶段的歧义通常会表现为一系列执行阶段的失败。
- 结构化反思：每个不明确点必须以
```
问题/原因/通用修复规则
```
  的形式返回。
```
通用修复规则
```
  是用于填充"失败模式台账"的类级抽象——没有它，修复只会是一次性补丁，后续仍会重复犯同样的错误。
- 指令端指标（判断规则在此章节规范定义，其他地方可参考）：
  - 成功/失败：只有当所有标记为
```
[critical]
```
    的要求都满足（○）时，才算成功（○）。即使有一项不满足（×）或部分满足，均为失败（×）。仅标记二元结果○/×。
  - 准确率（需求检查表达成率，%。○=满分，×=0，部分满足=0.5；求和后除以总项数）
  - 步骤数（直接使用Task工具返回值中usage meta的
```
tool_uses
```
    字段。包含Read/Grep操作，不得排除）
  - 时长（来自Task工具usage meta的
```
duration_ms
```
    ）
  - 重试次数（subagent重复同一决策的次数。从subagent的自我报告中提取；无法从指令端直接测量）
  - 失败时，在展示格式的"不明确点"部分添加一行说明，指出"哪项[critical]要求未满足"（用于根因追踪）
- 需求检查表必须包含至少一项标记为
```
[critical]
```
  的要求（如果没有，成功判断将失去意义）。不得事后添加或移除[critical]标记。
应用修改：对Prompt进行最小化修复，以消除不明确点。每次迭代聚焦一个主题（可包含多个相关修复，无关修复留到下一次迭代）。
- 应用修复前，明确说明"此修复满足需求检查表中的哪项要求/判断表述"（仅从指标名称推断的修复通常无效。参见下文"修复传播模式"章节。）
- 先查阅失败模式台账。如果结构化反思的
```
通用修复规则
```
  与已知模式匹配，首先要问"为什么现有修复未能防止问题复发？"——在添加新台账条目之前，可能需要将修复移至Prompt更靠前的位置，或重新措辞。
重新评估：使用新的subagent再次运行步骤2→5（不得复用同一Agent：它已经了解之前的改进内容）。如果迭代未进入平台期，可增加并行度。
收敛检查：大致规则是"当连续2次迭代没有新的不明确点，且指标提升低于阈值（如下）时停止"。对于高重要性Prompt，要求连续3次迭代满足条件。

Evaluation axes

评估维度

Axis	How to capture	Meaning
Success/failure	Did the executor produce the intended deliverable (binary)	Minimum bar
Accuracy	What % of requirements the deliverable satisfies	Degree of partial success
Step count	Tool-call / decision-step count used by the executor	Indicator of instruction waste
Duration	Executor's duration_ms	Proxy indicator of cognitive load
Retry count	How many times the same decision was redone	Signal of instruction ambiguity
Unclear points (self-report)	Executor enumerates as bullets	Qualitative improvement material
Discretionary fill-ins (self-report)	Decisions not fixed by the instruction	Surfaces implicit specification

Weighting: Qualitative (unclear points / discretionary fill-ins) is primary, quantitative (time / step count) is auxiliary. Chasing only time reduction makes the prompt too thin.

维度	捕获方式	含义
成功/失败	执行者是否生成了预期的交付成果（二元结果）	最低标准
准确率	交付成果满足需求的百分比	部分成功的程度
步骤数	执行者使用的工具调用/决策步骤数	指令冗余度指标
时长	执行者的duration_ms	认知负荷的代理指标
重试次数	同一决策重复的次数	指令歧义的信号
不明确点（自我报告）	执行者以项目符号列举	定性改进素材
自行补充内容（自我报告）	指令未明确、由执行者自行判断的决策	暴露隐含的规范

权重：定性指标（不明确点/自行补充内容）为首要指标，定量指标（时间/步骤数）为辅助指标。仅追求时间减少会导致Prompt过于简略。

Qualitative interpretation of

tool_uses

tool_uses

的定性解读

Looking only at accuracy hides skill problems. Using

tool_uses

as a relative value across scenarios reveals structural defects:

If one scenario is 3-5x or more vs the others, that skill is a sign of being decision-tree-index-leaning with low self-containment. The executor is being forced into references descent.
Typical example: all scenarios have
```
tool_uses
```
of 1-3 but one scenario alone has 15+ → there is no recipe for that scenario in the skill itself, so it is cross-searching references/
Countermeasure: adding an "inline minimum complete example" or "guidance on when to read references" at the top of SKILL.md in iter 2 significantly drops
```
tool_uses
```

Even at 100% accuracy, a skew in

tool_uses

is grounds for triggering iter 2. "Cut off based on accuracy alone" tends to miss structural defects.

仅看准确率会隐藏技能问题。将

tool_uses

作为跨场景的相对值可以揭示结构性缺陷：

如果某个场景的
```
tool_uses
```
是其他场景的3-5倍或更多，说明该技能倾向于依赖决策树索引，自包含性低。执行者被迫不断查阅参考资料。
典型示例：所有场景的
```
tool_uses
```
为1-3，但某个场景单独达到15+→技能本身没有针对该场景的操作指南，因此执行者需要跨场景搜索参考资料
对策：在迭代2中，在SKILL.md顶部添加"内联最小完整示例"或"何时查阅参考资料的指南"，可显著降低
```
tool_uses
```

即使准确率达到100%，

tool_uses

的异常偏差也应触发迭代2。"仅基于准确率停止"往往会遗漏结构性缺陷。

Fix propagation patterns (conservative / overshoot / zero-shoot)

修复传播模式（保守/过度/无效）

Fix → effect is not linear. Pre-estimation can play out in the following 3 patterns:

Conservative swing (estimate > actual): one fix aimed at multiple axes but only moved one. "Aiming at multiple axes tends to miss."
Overshoot (estimate < actual): one structural piece of information (e.g., a combination of command + config + expected output) satisfied judgment wording across multiple axes at once. "Combinations of information structurally hit multiple axes."
Zero-shoot (estimate > 0, actual = 0): a fix inferred from the axis name did not reach any of the judgment wording. "Axis names and judgment wording are different things."

To stabilize this, before applying the diff, have the subagent verbalize "which judgment wording this fix satisfies". Estimation accuracy does not come out unless you tie things at the threshold-wording level. When adding a new evaluation axis, also concretize the judgment criteria for each point down to the threshold-wording level (at a granularity the subagent can judge, such as "all explicit" or "full text of a minimum working configuration" — so it knows what constitutes 2 points).

修复→效果并非线性。预估结果可能呈现以下3种模式：

保守偏差（预估>实际）：针对多个维度的修复仅对一个维度有效。"瞄准多个维度往往会落空。"
过度偏差（预估<实际）：一项结构化信息（如命令+配置+预期输出的组合）同时满足多个维度的判断表述。"信息组合可从结构上覆盖多个维度。"
无效偏差（预估>0，实际=0）：从指标名称推断的修复未满足任何判断表述。"指标名称与判断表述是不同的概念。"

为稳定修复效果，应用修改前，让subagent明确说明"此修复满足哪项判断表述"。只有将修复与阈值表述绑定，才能提高预估准确率。添加新评估维度时，也要将每个维度的判断标准具体化到阈值表述层面（达到subagent可判断的粒度，如"全部明确"或"最小可行配置的完整文本"——让它知道什么情况算满足2分）。

Subagent invocation contract

Subagent调用契约

The prompt given to the executor takes the following structure. This is the input contract for "two-sided evaluation".

You are an executor reading <target prompt name> with a blank slate.

提供给执行者的Prompt采用以下结构。这是"双向评估"的输入契约。

You are an executor reading <target prompt name> with a blank slate.

Target prompt

Scenario

Requirements checklist (items the deliverable must satisfy)

[critical] <item that belongs to the minimum bar>
<normal item>
<normal item>

... (Judgment rules are canonically defined in "Workflow 4. Two-sided evaluation / Instruction-side measurements". At least one [critical] is required.)

[critical] <item that belongs to the minimum bar>
<normal item>
<normal item>

... (Judgment rules are canonically defined in "Workflow 4. Two-sided evaluation / Instruction-side measurements". At least one [critical] is required.)

Task

Follow the target prompt to execute the scenario and produce the deliverable.
On completion, respond with the report structure below.

Follow the target prompt to execute the scenario and produce the deliverable.
On completion, respond with the report structure below.

Report structure

Deliverable: <artifact or execution summary>
Requirement achievement: ○ / × / partial (with reason) for each item
Trace (tag OK / stuck / skipped for each phase, one-line reason when not OK):
- Understanding (reading the instruction and building a mental model)
- Planning (deciding the approach / ordering)
- Execution (actually doing the work)
- Formatting (shaping the deliverable to the expected form)
- Collapsed form allowed: when all four phases are OK, a single line
```
Trace: all OK
```
  is sufficient. Emit phase-by-phase only when any phase is stuck or skipped. (This avoids happy-path boilerplate; the trace structure only earns its cost when something actually goes wrong.)
Unclear points (structured): for each issue, three lines:
- Issue: <what observably happened>
- Cause: <why, diagnosed at the instruction level>
- General Fix Rule: <a class-level rule, not a spot fix, that would prevent this class of mistake>
Discretionary fill-ins: places not fixed by the instruction and filled in by your own judgment (bullets)
Retries: number of times you redid the same decision and why


The caller extracts the self-report portion from the report and fills the evaluation-axis table by obtaining `tool_uses` / `duration_ms` from the Agent tool's usage meta.

Deliverable: <artifact or execution summary>
Requirement achievement: ○ / × / partial (with reason) for each item
Trace (tag OK / stuck / skipped for each phase, one-line reason when not OK):
- Understanding (reading the instruction and building a mental model)
- Planning (deciding the approach / ordering)
- Execution (actually doing the work)
- Formatting (shaping the deliverable to the expected form)
- Collapsed form allowed: when all four phases are OK, a single line
```
Trace: all OK
```
  is sufficient. Emit phase-by-phase only when any phase is stuck or skipped. (This avoids happy-path boilerplate; the trace structure only earns its cost when something actually goes wrong.)
Unclear points (structured): for each issue, three lines:
- Issue: <what observably happened>
- Cause: <why, diagnosed at the instruction level>
- General Fix Rule: <a class-level rule, not a spot fix, that would prevent this class of mistake>
Discretionary fill-ins: places not fixed by the instruction and filled in by your own judgment (bullets)
Retries: number of times you redid the same decision and why


调用者从报告中提取自我报告部分，并通过Agent工具的usage meta获取`tool_uses`/`duration_ms`来填充评估维度表格。

Environment constraints

环境约束

In environments where dispatching a new subagent is not possible (already running as a subagent, Task tool is disabled, etc.), do not apply this skill.

Alternative 1: ask the parent session's user to start a separate Claude Code session and delegate the evaluation there
Alternative 2: give up on evaluation and explicitly report to the user "empirical evaluation skipped: dispatch unavailable"
NG: substitute with a self-reread (bias enters, so you must not trust the evaluation result)

Structural review mode: when you want to check only the consistency and clarity of the description of the skill / prompt rather than run empirical evaluation, carve it out explicitly as structural review mode. Note clearly in the request prompt to the subagent "this round is structural review mode: text consistency check, not execution". That way the subagent will not trip on the skip behavior in the environment-constraints section and can return a static review. Structural review is an aid to empirical, not a replacement (it cannot be used for consecutive-clear judgment).

在无法调度新subagent的环境中（已作为subagent运行、Task工具禁用等），不得应用此技能。

替代方案1：请求父会话的用户启动单独的Claude Code会话，并将评估任务委托给该会话
替代方案2：放弃评估，并明确向用户报告"实证评估已跳过：调度不可用"
禁止：用自我重读替代（会引入偏见，因此评估结果不可信）

结构审查模式：当你只想检查技能/Prompt的描述一致性与清晰度，而非进行实证评估时，可明确开启结构审查模式。在给subagent的请求Prompt中清晰注明"本轮为结构审查模式：仅检查文本一致性，不执行任务"。这样subagent就不会因环境约束中的跳过行为而出错，可返回静态审查结果。结构审查是实证评估的辅助手段，而非替代方案（无法用于连续无问题判断）。

Iteration stopping criteria

迭代停止标准

Convergence (stop): 2 consecutive rounds satisfying all of the following:
- New unclear points: 0
- Accuracy improvement vs previous: +3 points or less (saturation such as 5% → 8%)
- Step count variation vs previous: within ±10%
- Duration variation vs previous: within ±15%
- Overfitting check: at convergence judgment, add 1 hold-out scenario not used so far and evaluate. If accuracy drops 15 points or more from the recent average, overfitting. Go back to baseline scenario design and add edges.
Divergence (suspect the design): if new unclear points do not decrease across 3+ iterations → the design direction of the prompt itself may be wrong. Stop fixing by patches and rewrite the structure
Resource cutoff: stop when importance and improvement cost no longer balance (the "ship at 80 points" call)

收敛（停止）：连续2轮满足所有以下条件：
- 新不明确点：0
- 与上一轮相比准确率提升：+3个百分点或更少（如5%→8%的饱和状态）
- 与上一轮相比步骤数变化：±10%以内
- 与上一轮相比时长变化：±15%以内
- 过拟合检查：判断收敛时，添加1个未使用过的保留场景进行评估。如果准确率比近期平均值下降15个百分点或更多，说明存在过拟合。回到基线场景设计阶段，添加边缘场景。
发散（质疑设计）：如果连续3+轮迭代新不明确点未减少→Prompt本身的设计方向可能有误。停止补丁式修复，重新改写结构
资源 cutoff：当重要性与改进成本不再平衡时停止（即"80分就发布"的决策）

Failure pattern ledger

失败模式台账

Maintain a cumulative list of failure modes across iterations. Without it, each iteration re-discovers the same class of mistake, and accuracy improvements stall without the operator noticing that the same

General Fix Rule

keeps surfacing under different surface wording.

Entry format:

- **Pattern name**: short descriptive handle (not "ambiguous X"; prefer "over-eager template application when skip clause is absent")
  - Example: <representative Issue wording from some iter>
  - General Fix Rule: <the class-level rule from that iter's structured reflection>
  - Seen in: iter N, iter M, ...

Rules:

Before generating a fix in Workflow step 5, scan the ledger. If the current
```
General Fix Rule
```
matches an existing entry, update
```
Seen in
```
and investigate why the existing fix did not prevent recurrence (wording ambiguity? position too late in the prompt? missing example?) before creating a new entry.
A pattern that recurs 3+ times despite targeted fixes is a structural signal — escalate to the "Divergence" criterion above rather than continuing to patch.
The ledger is per-target-prompt, not global across all empirical-prompt-tuning runs.

维护跨迭代的失败模式累积列表。没有它，每次迭代都会重复发现同一类错误，准确率提升会停滞，而操作者却未注意到相同的

通用修复规则

在不同表述下反复出现。

条目格式：

- **Pattern name**: short descriptive handle (not "ambiguous X"; prefer "over-eager template application when skip clause is absent")
  - Example: <representative Issue wording from some iter>
  - General Fix Rule: <the class-level rule from that iter's structured reflection>
  - Seen in: iter N, iter M, ...

规则：

在工作流程步骤5生成修复前，先扫描台账。如果当前
```
通用修复规则
```
与现有条目匹配，更新
```
Seen in
```
并调查现有修复未能防止复发的原因（措辞歧义？位置太靠后？缺少示例？），然后再创建新条目。
即使进行针对性修复仍重复出现3+次的模式是结构性信号——升级到上述"发散"标准，而非继续补丁式修复。
台账是针对单个目标Prompt的，而非所有实证Prompt调优任务的全局台账。

Variant exploration (optional, plateau-breaking)

变体探索（可选，突破平台期）

When iterations approach a plateau but convergence criteria (2 consecutive clears) are not met, suspect local optimum and run a 2-variant round:

Conservative variant: current prompt + next-best minor fix
Exploratory variant: current prompt with one structural change — reorder sections, split a dense paragraph, drop a redundant section, or add a missing scaffolding (e.g., a worked example)

Dispatch fresh subagents on the same scenarios in parallel (one message with multiple Agent tool calls). Keep the variant with higher accuracy; on tie, prefer fewer unclear points; on further tie, prefer lower

tool_uses

Pairwise-comparison caveats:

Do not ask a subagent to rate "A vs B" directly. LLM position bias and self-preference bias make such judgments noisy at small n.
Compare on the objective axes only (accuracy, step count, unclear-points count, phase-weakness counts). Those are reproducible; "which prompt felt better" is not.
If qualitative comparison is genuinely needed, counterbalance: run both orderings (A,B) and (B,A) and accept a verdict only if both orderings agree.

Cost: variant exploration doubles dispatch count per iteration. Use when plateau is suspected, not by default.

当迭代接近平台期但未满足收敛标准（连续2轮无问题）时，怀疑陷入局部最优，可运行一轮双变体测试：

保守变体：当前Prompt + 次优小修复
探索变体：当前Prompt进行一项结构性修改——重新排序章节、拆分密集段落、删除冗余章节或添加缺失的框架（如示例）

并行调度新subagent运行相同场景（单条消息中调用多个Agent工具）。保留准确率更高的变体；若准确率相同，选择不明确点更少的；若仍相同，选择

tool_uses

更低的。

成对比较注意事项：

不得直接让subagent评分"A vs B"。LLM的位置偏差和自我偏好偏差会导致小样本判断结果噪声大。
仅比较客观维度（准确率、步骤数、不明确点数量、阶段薄弱点数量）。这些指标可重复；"哪个Prompt感觉更好"不可靠。
如果确实需要定性比较，需平衡顺序：同时运行(A,B)和(B,A)两种顺序，只有当两种顺序的结论一致时才接受 verdict。

成本：变体探索会使每轮迭代的调度次数翻倍。仅在怀疑陷入平台期时使用，而非默认操作。

Presentation format

展示格式

Record and present to the user with the following form at each iteration:

undefined

每次迭代都需按以下格式记录并呈现给用户：

undefined

Iteration N

Changes (diff from previous)

<one-line fix content>
Pattern applied: <pattern name from ledger, or "(new)">

<one-line fix content>
Pattern applied: <pattern name from ledger, or "(new)">

Execution results (per scenario)

Scenario	Success/Failure	Accuracy	steps	duration	retries	Weak phase
A	○	90%	4	20s	0	—
B	×	60%	9	41s	2	Execution

Scenario	Success/Failure	Accuracy	steps	duration	retries	Weak phase
A	○	90%	4	20s	0	—
B	×	60%	9	41s	2	Execution

Structured reflection (newly surfaced this time)

<Scenario B>: [critical] item N is × — <one-line reason for drop>
- Issue: <what observably happened>
- Cause: <why, at the instruction level>
- General Fix Rule: <class-level abstraction>
<Scenario A>: (nothing new)

<Scenario B>: [critical] item N is × — <one-line reason for drop>
- Issue: <what observably happened>
- Cause: <why, at the instruction level>
- General Fix Rule: <class-level abstraction>
<Scenario A>: (nothing new)

Discretionary fill-ins (newly surfaced this time)

<Scenario B>: <fill-in content>

<Scenario B>: <fill-in content>

Ledger updates

Added: <pattern name> (from Scenario B)
Re-seen: <pattern name> (originally iter K) — existing fix did not prevent recurrence because <reason>

Added: <pattern name> (from Scenario B)
Re-seen: <pattern name> (originally iter K) — existing fix did not prevent recurrence because <reason>

Next fix proposal

<one-line minimum fix>

(Convergence check: X consecutive clears / Y rounds remaining to stop condition)

undefined

<one-line minimum fix>

(Convergence check: X consecutive clears / Y rounds remaining to stop condition)

undefined

Red flags (beware of rationalization)

警示信号（警惕合理化借口）

Rationalization that surfaces	Reality
"Rereading it myself has the same effect"	You cannot view text you just wrote "objectively". Always dispatch a new subagent.
"One scenario is enough"	One scenario overfits. Minimum 2, ideally 3.
"Zero unclear points once, so we're done"	Could be coincidence. Finalize with 2 consecutive rounds.
"Let's knock out multiple unclear points at once"	You lose track of what worked. One theme per iteration.
"Split each related micro-fix strictly into its own iter"	Trap in the opposite direction. "One theme" is a semantic unit. 2-3 related micro-fixes can be bundled into 1 iter. Splitting too far explodes the iter count.
"Metrics are good, so ignore qualitative feedback"	Time reduction can also be a sign of being too thin. Keep qualitative primary.
"Rewriting from scratch is faster"	Correct if unclear points do not decrease across 3+ iterations. Before that stage, it is escape.
"Let's reuse the same subagent"	It has learned the previous improvements. Always dispatch a new one.

出现的合理化借口	实际情况
"我自己重读也能达到同样效果"	你无法客观看待自己刚写的文本。务必调度新的subagent。
"一个场景就足够了"	单个场景会导致过拟合。最少2个，理想情况3个。
"一次没有不明确点就完成了"	可能是巧合。需连续2轮确认后才算最终完成。
"一次性解决多个不明确点"	你会无法追踪哪些修复有效。每次迭代聚焦一个主题。
"严格将每个相关微修复拆分为单独迭代"	走向另一个极端的陷阱。"一个主题"是语义单元。2-3个相关微修复可打包为1次迭代。拆分过细会导致迭代次数爆炸。
"指标很好，所以忽略定性反馈"	时间减少也可能是Prompt过于简略的信号。始终以定性指标为首要。
"从头重写更快"	只有当连续3+轮迭代不明确点未减少时才正确。在此之前，这是逃避行为。
"复用同一个subagent"	它已经了解之前的改进内容。务必调度新的subagent。

Common failures

常见失败

Scenario too easy / too hard: neither produces signal. One at the median of real use, one edge
Only looking at metrics: chasing only time reduction strips important explanations and makes it fragile
Too many changes per iteration: you can no longer trace "which fix back then worked". One fix per iteration
Tuning scenarios to match the fix: making the scenario side easier just to make unclear points look eliminated → putting the cart before the horse

场景太简单/太难：都无法产生有效信号。一个场景为实际使用的常规情况，一个为边缘情况
仅关注指标：仅追求时间减少会剥离重要说明，导致Prompt脆弱
每次迭代修改过多：你无法追踪"当时哪项修复有效"。每次迭代一项修复
调整场景以匹配修复：为了让不明确点看起来消失而简化场景→本末倒置

empirical-prompt-tuning

Original

Translation

Empirical Prompt Tuning

Empirical Prompt Tuning（实证提示词调优）

When to use

使用时机

Workflow

工作流程

Evaluation axes

评估维度

Qualitative interpretation of tool_uses

tool_uses的定性解读

Fix propagation patterns (conservative / overshoot / zero-shoot)

修复传播模式（保守/过度/无效）

Subagent invocation contract

Subagent调用契约

Target prompt

Target prompt

Scenario

Scenario

Requirements checklist (items the deliverable must satisfy)

Requirements checklist (items the deliverable must satisfy)

Task

Task

Report structure

Report structure

Environment constraints

环境约束

Iteration stopping criteria

迭代停止标准

Failure pattern ledger

失败模式台账

Variant exploration (optional, plateau-breaking)

变体探索（可选，突破平台期）

Presentation format

展示格式

Iteration N

Iteration N

Changes (diff from previous)

Changes (diff from previous)

Execution results (per scenario)

Execution results (per scenario)

Structured reflection (newly surfaced this time)

Structured reflection (newly surfaced this time)

Discretionary fill-ins (newly surfaced this time)

Discretionary fill-ins (newly surfaced this time)

Ledger updates

Ledger updates

Next fix proposal

Next fix proposal

Red flags (beware of rationalization)

警示信号（警惕合理化借口）

Common failures

常见失败

Related

相关技能

Qualitative interpretation of
`tool_uses`

`tool_uses`
的定性解读