empirical-prompt-tuning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Empirical Prompt Tuning

プロンプトの品質は書いた本人には分からない。書き手が「明瞭だ」と思うものほど、別エージェントが読むと詰まる。バイアスを排した実行者に実際に動かしてもらい、両面で評価して反復する のが本 skill の核。改善が頭打ちになるまで止めない。

The quality of a prompt isn't apparent to the person who wrote it. What the writer thinks is "clear" often causes other agents to get stuck. The core of this skill is having unbiased executors actually run the prompt, evaluate from both sides, and iterate. Don't stop until improvement plateaus.

いつ使うか

When to Use It

skill / slash command / タスクプロンプトを新規作成・大幅改訂した直後
エージェントが期待通り動かず、原因を指示側の曖昧さに求めたいとき
重要度の高い指示（頻繁に使う skill、自動化の中核プロンプト）を堅牢化したいとき

使わない場面:

一回限りの使い捨てプロンプト（評価コストが割に合わない）
成功率の改善が目的ではなく、書き手の主観的好みを反映したいだけのとき

Immediately after creating or significantly revising a skill / slash command / task prompt
When an agent isn't behaving as expected, and you suspect the cause is ambiguity in the instructions
When you want to harden high-priority instructions (frequently used skills, core prompts for automation)

Scenarios to Avoid:

One-off disposable prompts (evaluation costs aren't justified)
When your goal isn't to improve success rates, but only to reflect the writer's subjective preferences

ワークフロー

Workflow

Iteration 0 — description と body の整合チェック（静的、dispatch 不要）
- frontmatter
```
description
```
  が謳う trigger / 用途を読む
- body がカバーする範囲を読む
- 乖離があれば iter 1 に進む前に description か body を合わせる
- 例: description「navigation / form filling / data extraction」と書いてあるが body は
```
npx playwright test
```
  の CLI ref のみ、のような乖離を検出
- これを飛ばすと、subagent は description に合わせて body を「再解釈」し、実質 skill が要件を満たしていないのに精度が出る（false positive）
ベースライン準備: 対象プロンプトを確定し、次の 2 つを用意する。
- 評価シナリオ 2 〜 3 種（中央値 1 + edge 1 〜 2）。現実に起こりうるタスクで、対象プロンプトを実際に適用する場面を想定する。
- 要件チェックリスト（精度算出のため）。シナリオごとに「成果物が満たすべき要件」を 3 〜 7 項目で列挙する。精度 % = 満たした項目数 / 全項目数。事前に固定すること（後から動かさない）。
バイアス排除読み: 指示を「白紙」の実行者に読ませる。Task tool で 新規 subagent を dispatch する。自己再読で済ませない（直前に書いた文章を客観視することは構造的に不可能）。並列で複数シナリオを同時実行する場合は単一メッセージ内で複数 Agent 呼び出しを並べる。dispatch 不能環境の扱いは「環境制約」節を参照。
実行: 後述の subagent 起動契約 に従ったプロンプトを subagent に渡し、シナリオを実行させる。実行者は実装や出力を生成し、最後に自己申告レポートを返す。
両面評価: 戻ってきた結果から次を記録する。
- 実行者の自己申告（subagent のレポート本文から抽出）: 不明瞭点 / 裁量補完 / テンプレ適用で詰まった箇所
- 指示側の計測（判定規則は本節で一元定義、他箇所は本節を参照する）:
  - 成功/失敗:
```
[critical]
```
    タグの付いた要件が 全て ○ のときのみ成功（○）。うち 1 つでも × または部分的なら失敗（×）。ラベルは ○ / × の 2 値のみ。
  - 精度（要件チェックリストの達成率 %。○ = 満点、× = 0、部分的 = 0.5 で合算、全項目数で割る）
  - ステップ数（Task tool の戻り値に付く usage メタの
```
tool_uses
```
    をそのまま使う。Read / Grep も含める、除外しない）
  - 所要時間（Task tool の usage メタの
```
duration_ms
```
    ）
  - 再試行回数（subagent が同じ判断をやり直した回数。subagent の自己申告レポートから抽出、指示側では測れない）
  - 失敗時は「どの [critical] 項目が落ちたか」を提示フォーマットの "不明瞭点" 節に 1 行添える（原因追跡のため）
- 要件チェックリストには
```
[critical]
```
  タグ付き項目を 最低 1 つ 含めること（0 件だと成功判定が vacuous になる）。事後に [critical] の付け外しをしない。
差分適用: 不明瞭点を潰す最小修正をプロンプトに入れる。1 イテレーション 1 テーマ（関連する複数修正は OK、無関係な修正は次回に回す）。
- 修正前に「この修正が要件チェックリスト / 判定文言のどの項目を満たすか」を明示する（軸名から推測した修正は届かないことが多い。後述「修正の波及パターン」節）。
再評価: 新しい subagent で再度 2 → 5 を回す（同一 agent は再利用しない: 前回の改善を学習している）。並列度はイテレーションを進めても改善が頭打ちにならない場合に増やす。
収束判定: 目安「連続 2 イテレーションで新規の不明瞭点ゼロかつメトリクス改善が閾値以下（後述）」で停止。重要度が高いプロンプトは 3 連続にする。

Iteration 0 — Consistency Check Between Description and Body (static, no dispatch required)
- Read the trigger / use case stated in the frontmatter
```
description
```
- Read the scope covered in the body
- If there's a discrepancy, align the description or body before proceeding to Iteration 1
- Example: Detect discrepancies like the description stating "navigation / form filling / data extraction" but the body only containing CLI references for
```
npx playwright test
```
- Skipping this step will cause subagents to "reinterpret" the body to match the description, leading to false positives where the skill seems accurate even though it doesn't meet requirements
Baseline Preparation: Finalize the target prompt and prepare the following two items:
- 2–3 evaluation scenarios (1 median case + 1–2 edge cases). Use realistic tasks that simulate actual application of the target prompt.
- Requirements checklist (for accuracy calculation). List 3–7 "requirements the deliverable must meet" per scenario. Accuracy % = number of met items / total items. Fix this in advance (don't adjust later).
Bias-Free Reading: Have a "blank slate" executor read the instructions. Dispatch a new subagent using the Task tool. Do not rely on self-review (it's structurally impossible to objectively view text you just wrote). If running multiple scenarios in parallel, call multiple agents within a single message. Refer to the "Environmental Constraints" section for handling environments where dispatch isn't possible.
Execution: Pass a prompt that follows the Subagent Activation Contract (described later) to the subagent, and have it execute the scenario. The executor will generate the implementation or output, and return a self-report at the end.
Dual-Perspective Evaluation: Record the following from the returned results:
- Executor Self-Report (extracted from the subagent's report body): Ambiguities / discretionary completions / points where the executor got stuck applying templates
- Instruction-Side Metrics (judgment rules are centrally defined in this section; refer to this section elsewhere):
  - Success/Failure: Success (○) only if all requirements tagged
```
[critical]
```
    are fully met (○). Failure (×) if even one is unmet (×) or partially met. Use only the binary labels ○ / ×.
  - Accuracy (achievement rate % of the requirements checklist. Sum ○ = full points, × = 0, partial = 0.5, then divide by total items)
  - Step Count (directly use
```
tool_uses
```
    from the usage metadata returned by the Task tool. Include Read / Grep, do not exclude)
  - Duration (use
```
duration_ms
```
    from the Task tool's usage metadata)
  - Retry Count (extracted from the subagent's self-report; cannot be measured by the instruction side)
  - For failures, add 1 line to the "Ambiguities" section of the presentation format stating "Which [critical] item failed" (for root cause tracking)
- The requirements checklist must include at least one item tagged
  [critical]
  (success judgment becomes vacuous if there are none). Do not add or remove
```
[critical]
```
  tags after the fact.
Apply Changes: Make minimal fixes to the prompt to resolve ambiguities. Focus on one theme per iteration (multiple related fixes are allowed; unrelated fixes should be deferred to the next iteration).
- Before making a fix, explicitly state "Which item in the requirements checklist / judgment wording this fix addresses" (fixes guessed from axis names often don't work. See the "Fix Ripple Patterns" section below).
Re-Evaluate: Repeat steps 2 → 5 with a new subagent (do not reuse the same agent: it may have learned from previous improvements). Increase parallelism if improvement doesn't plateau as iterations progress.
Convergence Judgment: Stop when the following are all met for 2 consecutive iterations (use 3 consecutive iterations for high-priority prompts):
- 0 new ambiguities
- Accuracy improvement vs. previous iteration: ≤ +3 percentage points (e.g., saturation from 5% to 8%)
- Step count change vs. previous iteration: ±10% or less
- Duration change vs. previous iteration: ±15% or less
- Overfitting Check: When judging convergence, add one unused hold-out scenario for evaluation. If accuracy drops by 15 percentage points or more from the recent average, overfitting has occurred. Return to baseline scenario design and add edge cases.

評価軸

Evaluation Axes

軸	取り方	意味
成功/失敗	実行者が意図した成果物を出したか（二値）	最低ライン
精度	成果物が要件を何 % 満たしたか	部分成功の程度
ステップ数	実行者が使ったツール呼び出し / 判断ステップ数	指示の無駄遣いの指標
所要時間	実行者の duration_ms	認知負荷の代替指標
再試行回数	同じ判断を何度やり直したか	指示の曖昧さのシグナル
不明瞭点（自己申告）	実行者が箇条書きで列挙	質的な改善材料
裁量補完箇所（自己申告）	指示で決まっていなかった判断	暗黙の仕様の炙り出し

重み付け: 質的（不明瞭点・裁量補完）を主、量的（時間・ステップ数）を補助とする。時間短縮だけ追いかけるとプロンプトが痩せすぎる。

Axis	Measurement Method	Meaning
Success/Failure	Did the executor produce the intended deliverable? (binary)	Minimum threshold
Accuracy	What percentage of requirements did the deliverable meet?	Degree of partial success
Step Count	Number of tool calls / decision steps used by the executor	Indicator of waste in instructions
Duration	Executor's duration_ms	Proxy indicator of cognitive load
Retry Count	Number of times the same judgment was repeated	Signal of instruction ambiguity
Ambiguities (Self-Report)	Bulleted list from the executor	Qualitative improvement material
Discretionary Completions (Self-Report)	Judgments not specified in the instructions	Reveals implicit specifications

Weighting: Prioritize qualitative factors (ambiguities, discretionary completions), use quantitative factors (time, step count) as supplements. Focusing only on time reduction can lead to overly sparse prompts.

tool_uses

の質的解釈

Qualitative Interpretation of

tool_uses

精度だけ見ると skill の問題が隠れる。

tool_uses

を シナリオ間の相対値 として使うと構造的欠陥が見える:

シナリオ間で他シナリオ比 3-5 倍以上 なら、その skill は decision-tree index 寄りで自己完結性が低い サイン。実行者が references descent を強いられている
典型例: 全シナリオ
```
tool_uses
```
が 1-3 なのに 1 シナリオだけ 15+ → そのシナリオ用の recipe が skill 内に無く、references/ を横断探索している
対処: iter 2 で「最小完成例 inline」や「いつ references を読むかの指針」を SKILL.md 冒頭に追加すると
```
tool_uses
```
は大幅低下する

精度 100% でも

tool_uses

の偏りがあれば iter 2 発動の根拠になる。「精度のみで判断して打ち切り」は構造的欠陥を見逃しがち。

Looking only at accuracy can hide skill issues. Using

tool_uses

as a relative value between scenarios reveals structural defects:

If
```
tool_uses
```
is 3–5x higher than other scenarios, it's a sign the skill is biased toward decision-tree indexing and lacks self-containment. The executor is being forced to descend into references.
Typical example: All scenarios have
```
tool_uses
```
of 1–3, but one scenario has 15+ → the skill lacks a recipe for that scenario, so the executor is cross-searching references/
Fix: In Iteration 2, add "minimal complete example inline" or "guidelines for when to read references" at the start of SKILL.md to significantly reduce
```
tool_uses
```

Even with 100% accuracy, a bias in

tool_uses

is grounds for triggering Iteration 2. "Judging only by accuracy and stopping" tends to miss structural defects.

修正の波及パターン (保守 / 上振れ / ゼロ振れ)

Fix Ripple Patterns (Conservative / Overperformance / No Impact)

修正→効果は線形ではない。事前見積もりは次の 3 パターンが起こりうる:

保守的に振れる (見積もり > 実測): 1 修正で複数軸狙ったが 1 軸しか動かなかった。「複数軸狙いは外しがち」
上振れ (見積もり < 実測): 1 つの構造的な情報 (例: コマンド + 設定 + 期待出力の組合せ) が複数軸の判定文言を同時に満たした。「情報の組合せが構造的に多軸に効く」
ゼロ振れ (見積もり > 0、実測 = 0): 軸名から推測した修正が、判定文言のどれにも届かなかった。「軸名と判定文言は別物」

これを安定させるには 差分適用前に subagent に「この修正が判定文言のどれを満たすか」を言語化させる。閾値文言レベルで紐付けないと見積もり精度が出ない。評価軸を新設するときも、各点の判定基準を閾値文言レベルまで具体化しておくこと（「全部明示」「動く最小構成全文」のように、何があれば 2 点になるか subagent が判定できる粒度）。

Fixes → effects are not linear. The following three patterns can occur in pre-estimates:

Conservative Ripple (Estimate > Actual): Targeted multiple axes with one fix, but only one axis improved. "Multi-axis targeting often misses the mark"
Overperformance Ripple (Estimate < Actual): One piece of structural information (e.g., combination of command + settings + expected output) satisfied multiple axis judgment criteria simultaneously. "Combined information structurally impacts multiple axes"
No Impact (Estimate > 0, Actual = 0): A fix guessed from an axis name didn't address any judgment wording. "Axis names and judgment wording are separate"

To stabilize this, have the subagent verbalize "Which judgment wording this fix addresses" before applying changes. Without linking to threshold-level wording, estimate accuracy will be low. When adding new evaluation axes, also specify judgment criteria down to the threshold wording level (granularity where subagents can judge, e.g., "all explicitly stated" or "full minimal working configuration").

subagent 起動契約

Subagent Activation Contract

実行者に渡すプロンプトは次の構造を取る。これが「両面評価」の入力契約。

あなたは <対象プロンプト名> を白紙で読む実行者です。

The prompt passed to the executor must follow the structure below. This is the input contract for "dual-perspective evaluation".

You are an executor reading <Target Prompt Name> with a blank slate.

対象プロンプト

Target Prompt

<対象プロンプトの本文を全文貼る or Read で読ませるパスを指定>

シナリオ

Scenario

<シナリオの状況設定 1 段落>

<1 paragraph of scenario context>

要件チェックリスト（成果物が満たすべき項目）

Requirements Checklist (Items the deliverable must meet)

[critical] <最低ラインに含む項目>
<通常項目>
<通常項目> ... （判定規則は「ワークフロー 4. 両面評価 / 指示側の計測」節に一元定義。[critical] は最低 1 つ必須。）

[critical] <Item included in the minimum threshold>
<Regular item>
<Regular item>

... (Judgment rules are centrally defined in the "Workflow 4. Dual-Perspective Evaluation / Instruction-Side Metrics" section. At least one [critical] item is required.)

タスク

Tasks

対象プロンプトに従ってシナリオを実行し、成果物を生成する。
終了時に下記レポート構造で返答する。

Execute the scenario according to the target prompt and generate a deliverable.
Respond using the report structure below upon completion.

レポート構造

Report Structure

成果物: <生成物 or 実行結果サマリ>
要件達成: 各項目について ○ / × / 部分的（理由付き）
不明瞭点: 対象プロンプトで詰まった箇所、解釈に迷った文言（箇条書き）
裁量補完: 指示で決まっておらず自分の判断で埋めた箇所（箇条書き）
再試行: 同じ判断をやり直した回数とその理由


呼び出し側はレポートから自己申告部分を抽出し、`tool_uses` / `duration_ms` を Agent tool の usage メタから取得して評価軸表を埋める。

Deliverable: <Generated output or summary of execution results>
Requirements Met: For each item, mark ○ / × / Partial (with reason)
Ambiguities: Points where you got stuck on the target prompt, or wording you struggled to interpret (bulleted list)
Discretionary Completions: Points where you filled in gaps with your own judgment because they weren't specified in the instructions (bulleted list)
Retries: Number of times you repeated the same judgment and the reason


The caller extracts the self-report portion from the report, and retrieves `tool_uses` / `duration_ms` from the Agent tool's usage metadata to populate the evaluation axis table.

環境制約

Environmental Constraints

新規 subagent を dispatch できない環境（既に subagent として動作している、Task tool が無効化されている等）では、本 skill は 適用しない。

代替案 1: 親セッションのユーザーに別 Claude Code セッションを起動して依頼してもらう
代替案 2: 評価を諦め、ユーザーに「empirical evaluation skipped: dispatch unavailable」と明示報告する
NG: 自己再読で代替する（バイアスが入るので評価結果を信じてはいけない）

構造審査モード: empirical 評価ではなく、skill / プロンプトの 記述の整合性・明瞭性だけ をチェックしたい場合は、構造審査モードとして明示的に切り分ける。subagent への依頼プロンプトに「今回は構造審査モード: 実行ではなくテキスト整合性チェック」と明記する。これにより subagent は環境制約節の skip 動作に引っかからず、静的レビューを返せる。構造審査は empirical の代替ではなく補助（連続クリア判定には使えない）。

Do not apply this skill in environments where new subagents cannot be dispatched (e.g., already operating as a subagent, Task tool disabled).

Alternative 1: Ask the user in the parent session to start a separate Claude Code session
Alternative 2: Abandon evaluation and explicitly report to the user: "empirical evaluation skipped: dispatch unavailable"
NG: Substitute with self-review (evaluation results cannot be trusted due to bias)

Structural Review Mode: If you only want to check consistency and clarity of skill / prompt descriptions (not empirical evaluation), explicitly switch to structural review mode. Add the note "This is structural review mode: text consistency check only, no execution" to the prompt sent to the subagent. This allows the subagent to return a static review without triggering the skip behavior in the environmental constraints section. Structural review is a supplement to empirical evaluation (not a replacement, and cannot be used for consecutive pass judgments).

反復の打ち切り基準

Iteration Termination Criteria

収束（停止）: 連続 2 回で次を全て満たす:
- 新規不明瞭点: 0 件
- 精度の前回比改善: +3 ポイント以下（5% → 8% のような飽和）
- ステップ数の前回比変動: ±10% 以内
- duration の前回比変動: ±15% 以内
- 過適合チェック: 収束判定時に、これまで使っていない hold-out シナリオ 1 本を追加して評価。精度が直近平均から 15 ポイント以上落ちたら過適合。baseline シナリオ設計に戻って edge を足す。
発散（設計を疑う）: 3 回以上イテレーションしても新規不明瞭点が減らない → プロンプトの設計方針自体が間違っている可能性。修正パッチで直すのをやめ、構造を書き直す
リソース打ち切り: 重要度と改善コストが釣り合わなくなったら止める（80 点で出す判断）

Convergence (Stop): All of the following are met for 2 consecutive iterations:
- 0 new ambiguities
- Accuracy improvement vs. previous iteration: ≤ +3 percentage points
- Step count change vs. previous iteration: ±10% or less
- Duration change vs. previous iteration: ±15% or less
- Overfitting Check: When judging convergence, add one unused hold-out scenario for evaluation. If accuracy drops by 15 percentage points or more from the recent average, overfitting has occurred. Return to baseline scenario design and add edge cases.
Divergence (Question Design): If new ambiguities don't decrease after 3+ iterations → the prompt's design approach may be fundamentally wrong. Stop applying patch fixes and rewrite the structure.
Resource Termination: Stop when priority and improvement costs are no longer balanced (judgment to release at 80 points)

提示フォーマット

Presentation Format

各イテレーションで次の形で記録・ユーザーに提示する:

undefined

Record and present to the user in the following format for each iteration:

undefined

Iteration N

変更点（前回差分）

Changes (Difference from Previous Iteration)

<修正内容 1 行>

<1 line describing the fix>

実行結果（シナリオ別）

Execution Results (By Scenario)

シナリオ	成功/失敗	精度	steps	duration	retries
A	○	90%	4	20s	0
B	×	60%	9	41s	2

Scenario	Success/Failure	Accuracy	Steps	Duration	Retries
A	○	90%	4	20s	0
B	×	60%	9	41s	2

不明瞭点（今回新出）

New Ambiguities (This Iteration)

<シナリオ B>: [critical] 項目 N が × — <落ちた理由 1 行> # 失敗時は必ず添える
<シナリオ B>: <その他の指摘 1 行>
<シナリオ A>: （新出なし）

<Scenario B>: [critical] Item N failed — <1 line reason for failure> # Mandatory for failures
<Scenario B>: <Other feedback, 1 line>
<Scenario A>: (No new ambiguities)

裁量補完（今回新出）

New Discretionary Completions (This Iteration)

<シナリオ B>: <補完内容>

<Scenario B>: <Completion content>

次の修正案

Next Proposed Fix

<最小修正 1 行>

（収束判定: 連続 X 回クリア / 停止条件まであと Y 回）

undefined

<1 line minimal fix>

(Convergence Judgment: X consecutive passes / Y iterations remaining until stop condition)

undefined

Red flags（合理化に注意）

Red Flags (Caution on Rationalization)

出てくる合理化	実態
「自分で読み直せば同じ効果がある」	直前に書いた文章を "客観視" はできない。必ず新規 subagent を dispatch する。
「1 シナリオで充分」	1 シナリオは過適合する。最低 2、できれば 3。
「不明瞭点ゼロが 1 回出たから終わり」	偶然なこともある。連続 2 回で確定判定。
「複数の不明瞭点を一気に潰そう」	何が効いたか分からなくなる。1 イテレーション 1 テーマ。
「関連する微修正も純粋に 1 件ずつ別 iter に分けよう」	逆方向の罠。"1 テーマ" は意味単位。関連する 2-3 件の微修正は 1 iter にまとめて良い。分けすぎると iter 数が爆発する。
「メトリクスが良いから質的フィードバックは無視」	時間短縮は痩せすぎのサインにもなる。質的を主に。
「書き直した方が早い」	3 回以上不明瞭点が減らないなら正解。それ以前の段階では逃げ。
「同じ subagent を使い回そう」	前回の改善を学習している。毎回新規に dispatch する。

Rationalization	Reality
"Reading it myself will have the same effect"	You cannot "objectively view" text you just wrote. Always dispatch a new subagent.
"One scenario is enough"	One scenario leads to overfitting. Use at least 2, preferably 3.
"I can stop since there were 0 ambiguities once"	This could be a coincidence. Confirm with 2 consecutive iterations.
"Let's fix multiple ambiguities at once"	You won't know which fix worked. Focus on one theme per iteration.
"Let's split related minor fixes into separate iterations one by one"	This is the opposite trap. "One theme" means a meaningful unit. You can group 2–3 related minor fixes into one iteration. Splitting too much causes iteration count to explode.
"Metrics are good, so ignore qualitative feedback"	Time reduction can also be a sign of overly sparse prompts. Prioritize qualitative factors.
"Rewriting is faster"	Correct if ambiguities don't decrease after 3+ iterations. Otherwise, don't escape early.
"Let's reuse the same subagent"	It may have learned from previous improvements. Dispatch a new one every time.

よくある失敗

Common Failures

シナリオが楽すぎる / 難しすぎる: どちらもシグナルが出ない。現実の使用場面の中央値を 1 つ、edge を 1 つ
メトリクスだけ見る: 時間短縮しか追わないと、重要な説明が削られて脆くなる
イテレーションごとに変更多すぎ: 「あのときの修正のどれが効いたか」が追えなくなる。1 修正 1 イテレーション
シナリオを修正に合わせてチューニング: 不明瞭点が潰れたように見せるため、シナリオ側を簡単にする → 本末転倒

Scenarios are too easy / too hard: Neither produces meaningful signals. Use one median case and one edge case from real-world usage.
Only looking at metrics: Focusing only on time reduction leads to removal of important explanations, making the prompt fragile.
Too many changes per iteration: You won't be able to track which fix worked. One fix per iteration.
Tuning scenarios to match fixes: Simplifying scenarios to make it seem like ambiguities are resolved → putting the cart before the horse.

empirical-prompt-tuning

Original

Translation

Empirical Prompt Tuning

Empirical Prompt Tuning

いつ使うか

When to Use It

ワークフロー

Workflow

評価軸

Evaluation Axes

tool_uses の質的解釈

Qualitative Interpretation of tool_uses

修正の波及パターン (保守 / 上振れ / ゼロ振れ)

Fix Ripple Patterns (Conservative / Overperformance / No Impact)

subagent 起動契約

Subagent Activation Contract

対象プロンプト

Target Prompt

シナリオ

Scenario

要件チェックリスト（成果物が満たすべき項目）

Requirements Checklist (Items the deliverable must meet)

タスク

Tasks

レポート構造

Report Structure

環境制約

Environmental Constraints

反復の打ち切り基準

Iteration Termination Criteria

提示フォーマット

Presentation Format

Iteration N

Iteration N

変更点（前回差分）

Changes (Difference from Previous Iteration)

実行結果（シナリオ別）

Execution Results (By Scenario)

不明瞭点（今回新出）

New Ambiguities (This Iteration)

裁量補完（今回新出）

New Discretionary Completions (This Iteration)

次の修正案

Next Proposed Fix

Red flags（合理化に注意）

Red Flags (Caution on Rationalization)

よくある失敗

Common Failures

関連

Related

`tool_uses`
の質的解釈

Qualitative Interpretation of
`tool_uses`