review-llm-annotations-and-improve-prompt

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Improve Metric

优化指标

Analyze human review annotations against machine-generated scores for a single text LLM judge metric. Calculate agreement, identify patterns in disagreements, and propose an improved metric prompt based on transcript analysis and reviewer feedback.

针对单个文本LLM评估指标，分析人工评审标注与机器生成分数的差异。计算一致性，识别分歧模式，并基于对话记录分析和评审反馈提出优化后的指标提示词。

Critical Rules

核心规则

NEVER update the metric prompt without explicit user approval. Always present the proposed changes and wait for confirmation before running
```
coval metrics update
```
.
Only operate on one metric at a time. If the user provides a project with multiple metrics, ask which one to focus on.
Only support text LLM judge metrics. Valid types:
```
METRIC_LLM_BINARY
```
,
```
METRIC_CATEGORICAL
```
,
```
METRIC_NUMERICAL_LLM_JUDGE
```
. Reject audio metrics, toolcall, metadata, regex, pause, and composite types.
NEVER fabricate data. If a CLI command or API call fails, report the error — do not invent scores or transcripts.
Use the CLI first, fall back to curl only when the CLI does not return required fields (e.g., the LLM explanation under
```
result.llm.answer_explanation
```
).

未经用户明确批准，绝不能更新指标提示词。始终先展示拟议的修改内容，等待确认后再执行
```
coval metrics update
```
命令。
每次仅处理一个指标。如果用户提供的项目包含多个指标，请询问需聚焦的指标。
仅支持文本LLM评估指标。有效类型包括：
```
METRIC_LLM_BINARY
```
、
```
METRIC_CATEGORICAL
```
、
```
METRIC_NUMERICAL_LLM_JUDGE
```
。拒绝音频指标、工具调用、元数据、正则表达式、停顿和复合类型的指标。
绝不能编造数据。如果CLI命令或API调用失败，请报告错误——不要虚构分数或对话记录。
优先使用CLI，仅当CLI无法返回所需字段时才使用curl（例如
```
result.llm.answer_explanation
```
下的LLM解释内容）。

Usage

使用方式

/review-llm-annotations-and-improve-prompt <project_id> <metric_id>

Both arguments are required. If the user omits one, ask for it.

/review-llm-annotations-and-improve-prompt <project_id> <metric_id>

两个参数均为必填项。如果用户遗漏其中一个，请询问补充。

Phase 0: Validate Inputs

阶段0：验证输入

Step 1: Fetch the metric and validate its type

步骤1：获取指标并验证其类型

bash

coval metrics get <metric_id> --format json

Extract from the response:

```
metric_name
```
— the full display name

metric_type

— must be one of:

METRIC_LLM_BINARY

METRIC_CATEGORICAL

METRIC_NUMERICAL_LLM_JUDGE

```
prompt
```
— the current evaluation prompt
```
min_value
```
/
```
max_value
```
— for numerical metrics only
```
categories
```
— for categorical metrics only

If the metric type is not a text LLM judge, STOP and tell the user:

"This skill only supports text LLM judge metrics (binary, categorical, numerical). The metric
{metric_name}
is type
{metric_type}
, which is not supported."

bash

coval metrics get <metric_id> --format json

从响应中提取：

```
metric_name
```
—— 完整显示名称

metric_type

—— 必须为以下类型之一：

METRIC_LLM_BINARY

、

METRIC_CATEGORICAL

、

METRIC_NUMERICAL_LLM_JUDGE

```
prompt
```
—— 当前评估提示词
```
min_value
```
/
```
max_value
```
—— 仅针对数值型指标
```
categories
```
—— 仅针对分类型指标

如果指标类型不属于文本LLM评估指标，请停止操作并告知用户：

"该技能仅支持文本LLM评估指标（二元、分类、数值型）。指标
{metric_name}
的类型为
{metric_type}
，不在支持范围内。"

Step 2: Fetch the review project

步骤2：获取评审项目

bash

coval review-projects get <project_id> --format json

Verify:

The
```
linked_metric_ids
```
includes the target metric ID
Note the assignees and linked simulation IDs

If the metric is not linked to the project, STOP and tell the user.

bash

coval review-projects get <project_id> --format json

验证：

```
linked_metric_ids
```
包含目标指标ID
记录分配人员和关联的模拟ID

如果指标未关联到该项目，请停止操作并告知用户。

Step 3: Present a summary before proceeding

步骤3：在继续前展示摘要

Metric: {metric_name} (
{metric_id}
) Type: {metric_type} Current prompt: "{prompt}" Project: {project_display_name} ({N} simulations, {M} assignees)
Proceeding to gather annotations and calculate agreement.

指标： {metric_name} (
{metric_id}
) 类型： {metric_type} 当前提示词： "{prompt}" 项目： {project_display_name}（{N}个模拟任务，{M}名分配人员）
即将收集标注数据并计算一致性。

Phase 1: Gather Data

阶段1：收集数据

Step 1: Gather annotations with ground-truth values for this metric in the project

步骤1：收集项目中该指标的带真值标注数据

Fetch both completed and all annotations in parallel:

bash

undefined

并行获取已完成和全部标注数据：

bash

undefined

Completed annotations

已完成的标注

coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}" AND completion_status="COMPLETED"'
--format json
--page-size 100

All annotations (to find pending ones with ground-truth values)

全部标注（用于查找带真值的待处理标注）

coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}"'
--format json
--page-size 100


If there are more than 100 annotations in either list, paginate.

From the full list, identify **pending annotations that have a ground-truth value** — where `completion_status` is `PENDING` but `ground_truth_float_value` is not null OR `ground_truth_string_value` is not null. Reviewer notes are optional.

Present what was found:

> **Found {completed_count} completed annotations and {pending_with_value_count} pending annotations with ground-truth values.**

If there are pending annotations with values, ask:

> **{pending_with_value_count} pending annotations have ground-truth values but are not marked as completed. Would you like to include them in the analysis?** (yes/no)

- **If yes** — merge them into the dataset alongside completed annotations
- **If no** — proceed with only the completed annotations

**If zero usable annotations exist after this step, STOP:**
> "No annotations with ground-truth values found for this metric in this project. Reviewers need to submit ground-truth values before agreement can be calculated."

coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}"'
--format json
--page-size 100


如果任一列表中的标注数量超过100条，请分页获取。

从完整列表中识别**带真值的待处理标注**——即`completion_status`为`PENDING`但`ground_truth_float_value`不为空或`ground_truth_string_value`不为空的标注。评审备注为可选内容。

展示发现的结果：

> **找到{completed_count}条已完成标注和{pending_with_value_count}条带真值的待处理标注。**

如果存在带真值的待处理标注，请询问：

> **{pending_with_value_count}条待处理标注包含真值但未标记为已完成。是否要将它们纳入分析？**（是/否）

- **如果是** —— 将它们与已完成标注合并到数据集中
- **如果否** —— 仅使用已完成标注继续分析

**如果此步骤后没有可用标注，请停止操作：**
> "在该项目中未找到该指标的带真值标注数据。评审人员需要提交真值标注后才能计算一致性。"

Step 2: For each annotation, gather the full picture

步骤2：为每个标注收集完整信息

For each annotation, collect these four pieces of data:

针对每个标注，收集以下四类数据：

a) Ground truth and reviewer notes (already in the annotation)

a) 真值和评审备注（已包含在标注中）

ground_truth_float_value

ground_truth_string_value

— the human label

```
reviewer_notes
```
— the reviewer's reasoning (may be null)

ground_truth_float_value

或

ground_truth_string_value

—— 人工标签

```
reviewer_notes
```
—— 评审人员的推理说明（可能为空）

b) Machine score

b) 机器分数

Use the annotation's

simulation_output_id

to get the machine-generated value:

bash

coval simulations metrics <simulation_output_id> --format json

Find the metric output where

metric_id

matches the target metric. Extract

value

as the machine score.

使用标注的

simulation_output_id

获取机器生成的数值：

bash

coval simulations metrics <simulation_output_id> --format json

找到

metric_id

与目标指标匹配的指标输出，提取

value

作为机器分数。

c) Transcript

c) 对话记录

bash

coval simulations get <simulation_output_id> --format json

Extract the

transcript

array. Format it as readable text:

[role]: content
[role]: content
...

bash

coval simulations get <simulation_output_id> --format json

提取

transcript

数组，格式化为可读文本：

[角色]: 内容
[角色]: 内容
...

d) LLM explanation (via API if not in CLI output)

d) LLM解释（若CLI输出中没有则通过API获取）

The LLM's reasoning is stored in the metric output's

result

field under

result.llm.answer_explanation

. The CLI's

coval simulations metric-detail

does not return this field.

Attempt to fetch it via the API:

bash

curl -s "https://api.coval.dev/v1/simulations/{simulation_output_id}/outputs/{simulation_output_id}/metrics/{metric_output_id}" \
  -H "X-API-Key: $COVAL_API_KEY"

If the response includes

result.llm.answer_explanation

, use it. If not, proceed without the LLM explanation — the transcript and reviewer notes are sufficient.

Important: Get the API key from the CLI config:

bash

coval config get api-key

If that fails, check

~/.config/coval/config.json

or the

COVAL_API_KEY

environment variable.

LLM的推理说明存储在指标输出的

result

字段下的

result.llm.answer_explanation

中。CLI命令

coval simulations metric-detail

不会返回该字段。

尝试通过API获取：

bash

curl -s "https://api.coval.dev/v1/simulations/{simulation_output_id}/outputs/{simulation_output_id}/metrics/{metric_output_id}" \
  -H "X-API-Key: $COVAL_API_KEY"

如果响应包含

result.llm.answer_explanation

，则使用该内容。如果没有，则无需LLM解释继续分析——对话记录和评审备注已足够。

重要提示： 从CLI配置中获取API密钥：

bash

coval config get api-key

如果该命令失败，请检查

~/.config/coval/config.json

或

COVAL_API_KEY

环境变量。

Step 3: Build the analysis dataset

步骤3：构建分析数据集

For each annotation, you should have:

Field	Source
`simulation_id`	annotation.simulation_output_id
`human_label`	annotation.ground_truth_float_value OR ground_truth_string_value
`machine_label`	simulation metric output value
`reviewer_notes`	annotation.reviewer_notes
`transcript`	simulation.transcript
`llm_explanation`	metric output result.llm.answer_explanation (if available)
`agrees`	computed in Phase 2

针对每个标注，应包含以下字段：

字段	来源
`simulation_id`	annotation.simulation_output_id
`human_label`	annotation.ground_truth_float_value 或 ground_truth_string_value
`machine_label`	模拟任务指标输出的value
`reviewer_notes`	annotation.reviewer_notes
`transcript`	simulation.transcript
`llm_explanation`	指标输出的result.llm.answer_explanation（若可用）
`agrees`	在阶段2中计算

Phase 2: Calculate Agreement

阶段2：计算一致性

Agreement rules by metric type

按指标类型划分的一致性规则

Binary (

METRIC_LLM_BINARY

)

二元型（

METRIC_LLM_BINARY

）

Machine output is typically

1.0

(pass) or

0.0

(fail). Human ground truth is the same.

agrees = (human_label == machine_label)

Tolerance: 0 — exact match only.

机器输出通常为

1.0

（通过）或

0.0

（失败）。人工真值标注格式相同。

agrees = (human_label == machine_label)

容差：0 —— 仅完全匹配。

Categorical (

METRIC_CATEGORICAL

)

分类型（

METRIC_CATEGORICAL

）

Machine and human both produce a category string.

agrees = (human_label == machine_label)

Tolerance: 0 — exact match only.

机器和人工均输出分类字符串。

agrees = (human_label == machine_label)

容差：0 —— 仅完全匹配。

Numerical (

METRIC_NUMERICAL_LLM_JUDGE

)

数值型（

METRIC_NUMERICAL_LLM_JUDGE

）

Machine and human both produce a float in the range

[min_value, max_value]

tolerance = 0  # exact match by default
agrees = (abs(human_label - machine_label) <= tolerance)

Tolerance: 0 — exact match. If the agreement rate is very low with exact match, note this in the report and suggest the user may want to consider a wider tolerance. Do NOT automatically widen it.

机器和人工均输出

[min_value, max_value]

范围内的浮点数。

tolerance = 0  # 默认完全匹配
agrees = (abs(human_label - machine_label) <= tolerance)

容差：0 —— 完全匹配。如果完全匹配时一致性率极低，请在报告中说明，并建议用户考虑扩大容差范围。请勿自动扩大容差。

Compute and present the agreement report

计算并展示一致性报告

Present results in this format:

Agreement Report

Metric: {metric_name} ({metric_type}) Total completed annotations: {N} Agreement rate: {agree_count}/{N} ({percentage}%)

Breakdown

Count %
Agree {agree_count} {agree_pct}%
Disagree {disagree_count} {disagree_pct}%

{For numerical metrics only:} Mean absolute error: {mae} Score range: {min_value}–{max_value}

	Count	%
Agree	{agree_count}	{agree_pct}%
Disagree	{disagree_count}	{disagree_pct}%

按以下格式展示结果：

一致性报告

指标： {metric_name} ({metric_type}) 已完成标注总数： {N} 一致性率： {agree_count}/{N} ({percentage}%)

细分情况

数量占比
一致 {agree_count} {agree_pct}%
不一致 {disagree_count} {disagree_pct}%

{仅针对数值型指标：} 平均绝对误差： {mae} 分数范围： {min_value}–{max_value}

	数量	占比
一致	{agree_count}	{agree_pct}%
不一致	{disagree_count}	{disagree_pct}%

Phase 3: Analyze Patterns

阶段3：分析模式

This is the most critical phase. You are acting as a metric prompt engineer analyzing real disagreements to improve accuracy.

这是最关键的阶段。你将作为指标提示词工程师，分析实际分歧以提高准确性。

Step 1: Analyze disagreements

步骤1：分析不一致情况

For each annotation where

agrees == false

Read the transcript carefully — understand the full conversation
Read the reviewer notes — understand WHY the human disagreed with the machine
Read the LLM explanation (if available) — understand the machine's reasoning
Determine who was more likely correct and why the machine erred

Look for patterns across disagreements:

Systematic bias: Does the machine consistently score too high or too low?
Edge cases: Are there conversation patterns the prompt doesn't account for?
Ambiguity: Is the prompt vague about certain scenarios?
Missing context: Does the prompt fail to instruct the LLM about important nuances?
Misinterpretation: Does the LLM misunderstand what the prompt is asking?

针对每个

agrees == false

的标注：

仔细阅读对话记录 —— 理解完整对话内容
阅读评审备注 —— 理解人工与机器分歧的原因
阅读LLM解释（若可用） —— 理解机器的推理逻辑
判断哪一方更可能正确，以及机器出错的原因

寻找分歧中的共性模式：

系统性偏差：机器是否持续给出过高或过低的分数？
边缘场景：是否存在提示词未覆盖的对话模式？
模糊性：提示词在某些场景下是否表述模糊？
缺失上下文：提示词是否未指导LLM关注重要细节？
误解：LLM是否误解了提示词的要求？

Step 2: Analyze agreements

步骤2：分析一致情况

For each annotation where

agrees == true

Read a sample of reviewer notes (if any) — understand what the metric handles well
Identify what the current prompt does correctly — these strengths must be preserved

针对每个

agrees == true

的标注：

阅读部分评审备注（若有） —— 理解指标处理较好的场景
识别当前提示词的优势 —— 这些优势必须保留

Step 3: Synthesize findings

步骤3：总结发现

Present a structured analysis:

Analysis

What the metric gets right

{bullet points from agreement analysis}

Disagreement patterns

Pattern 1: {description} (seen in {N} annotations)

Example: Simulation {sim_id} — Human: {value}, Machine: {value}

Reviewer note: "{note}"

Root cause: {why the machine got this wrong}

Pattern 2: ...

Key insights from reviewer notes

{synthesized themes from reviewer feedback}

按结构化格式展示分析结果：

分析结果

指标的优势

{来自一致情况分析的要点}

分歧模式

模式1： {描述}（在{N}条标注中出现）

示例：模拟任务{sim_id} —— 人工：{value}，机器：{value}

评审备注："{note}"

根本原因：{机器出错的原因}

模式2： ...

评审备注的核心见解

{从评审反馈中提炼的主题}

Phase 3.5: Choose Resolution Path

阶段3.5：选择解决方案路径

After presenting the analysis, ask the user how they want to resolve the disagreements. The machine may be wrong (prompt needs fixing), OR the human labels may be wrong (labels need correcting), OR both.

Present the disagreements as a numbered list:

Disagreements

# Simulation Human Label Machine Label Reviewer Note
1 {sim_id} {human} {machine} {note or "—"}
2 {sim_id} {human} {machine} {note or "—"}
...

How would you like to resolve these?

Update the metric prompt — fix the prompt so the machine aligns with the human labels

Update human labels — correct the human labels to align with the machine scores

Both — update some labels and fix the prompt

If choosing option 2 or 3, specify which disagreements to relabel (e.g., "update labels for #1, #3, #5" or "all").

#	Simulation	Human Label	Machine Label	Reviewer Note
1	{sim_id}	{human}	{machine}	{note or "—"}
2	{sim_id}	{human}	{machine}	{note or "—"}
...

展示分析结果后，询问用户如何解决分歧。可能是机器出错（需要优化提示词），或者人工标签出错（需要修正标签），或者两者皆有。

将分歧情况按编号列表展示：

分歧列表

编号模拟任务人工标签机器标签评审备注
1 {sim_id} {human} {machine} {note或"——"}
2 {sim_id} {human} {machine} {note或"——"}
...

你希望如何解决这些分歧？

更新指标提示词 —— 优化提示词使机器结果与人工标签对齐

更新人工标签 —— 修正人工标签使其与机器分数对齐

两者兼顾 —— 更新部分标签并优化提示词

如果选择选项2或3，请指定需要重新标注的分歧（例如："更新编号1、3、5的标签"或"全部"）。

编号	模拟任务	人工标签	机器标签	评审备注
1	{sim_id}	{human}	{machine}	{note或"——"}
2	{sim_id}	{human}	{machine}	{note或"——"}
...

Handling label updates

处理标签更新

If the user chooses to update labels (option 2 or 3):

For each annotation the user selects, confirm the new ground-truth value:
"Annotation #{n} (Simulation {sim_id}): Current human label is
```
{human_value}
```
. Update to
```
{machine_value}
```
(the machine score)? Or enter a different value."

After confirmation, update each annotation:

bash

coval review-annotations update <annotation_id> \
  --ground-truth-float <new_value> \
  --notes "Label corrected during metric improvement review. Previous value: {old_value}"

For string-based metrics, use

--ground-truth-string

instead of

--ground-truth-float

After all label updates, recalculate the agreement rate with the corrected labels and present the updated report.
If the user chose option 2 (labels only), skip to the end — no prompt changes needed.
If the user chose option 3 (both), continue to Phase 4 with the remaining disagreements that were NOT resolved by label updates.

如果用户选择更新标签（选项2或3）：

针对用户选择的每个标注，确认新的真值：
"标注#{n}（模拟任务{sim_id}）：当前人工标签为
```
{human_value}
```
。是否更新为
```
{machine_value}
```
（机器分数）？或者输入其他值。"

确认后，更新每个标注：

bash

coval review-annotations update <annotation_id> \
  --ground-truth-float <new_value> \
  --notes "指标优化评审期间修正标签。原数值：{old_value}"

针对基于字符串的指标，使用

--ground-truth-string

替代

--ground-truth-float

。

完成所有标签更新后，使用修正后的标签重新计算一致性率并展示更新后的报告。
如果用户选择选项2（仅更新标签），则直接结束流程——无需修改提示词。
如果用户选择选项3（两者兼顾），则针对未通过标签更新解决的剩余分歧，继续进入阶段4。

Handling prompt updates

处理提示词更新

If the user chooses option 1 or 3, proceed to Phase 4.

如果用户选择选项1或3，则进入阶段4。

Phase 4: Propose Improved Prompt

阶段4：提出优化后的提示词

Step 1: Draft the improved prompt

步骤1：起草优化后的提示词

Based on the analysis in Phase 3, draft an improved version of the metric prompt that:

Preserves what works — keep language that leads to correct evaluations
Addresses disagreement patterns — add explicit instructions for edge cases
Incorporates reviewer feedback — use the language and reasoning from reviewer notes
Remains clear and concise — avoid bloating the prompt with excessive detail
Maintains the same evaluation format — the metric type and output format must not change

基于阶段3的分析结果，起草优化后的指标提示词，需满足：

保留优势 —— 保留能带来正确评估的表述
解决分歧模式 —— 针对边缘场景添加明确说明
融入评审反馈 —— 使用评审备注中的表述和推理逻辑
保持清晰简洁 —— 避免提示词过于冗长
维持相同的评估格式 —— 指标类型和输出格式不得改变

Step 2: Present the proposal

步骤2：展示提案

Show the current and proposed prompts side by side:

Proposed Prompt Update

Current prompt
{current_prompt}
Proposed prompt
{new_prompt}
Changes explained

{bullet point explaining each significant change and which disagreement pattern it addresses}

Would you like to apply this prompt update? (yes/no)

将当前提示词与拟议提示词并排展示：

拟议提示词更新

当前提示词
{current_prompt}
拟议提示词
{new_prompt}
修改说明

{每个重要修改的要点，以及对应的分歧模式}

是否应用此提示词更新？（是/否）

Step 3: Handle user response

步骤3：处理用户反馈

If yes → proceed to Phase 5
If no → ask what they'd like to change. Iterate on the prompt until they approve or decide to stop.
If they want to edit manually — that's fine too. The analysis is the primary value.

如果是 → 进入阶段5
如果否 → 询问用户希望修改的内容。迭代优化提示词直到用户批准或决定停止。
如果用户希望手动编辑 —— 这也可以。分析结果是核心价值所在。

Phase 5: Update the Metric

阶段5：更新指标

Only after the user explicitly approves the prompt.

bash

coval metrics update <metric_id> --prompt "<approved_prompt>"

Verify the update:

bash

coval metrics get <metric_id> --format json

Confirm the prompt matches what was approved.

Metric prompt updated successfully. You can re-run simulations and create a new review project to validate the improvement.

仅在用户明确批准提示词后执行。

bash

coval metrics update <metric_id> --prompt "<approved_prompt>"

验证更新结果：

bash

coval metrics get <metric_id> --format json

确认提示词与批准的内容一致。

指标提示词已成功更新。你可以重新运行模拟任务并创建新的评审项目来验证优化效果。

Reference

参考资料

CLI Commands Used

使用的CLI命令

Command	Purpose
`coval metrics get <id>`	Fetch metric definition (name, type, prompt)
`coval review-projects get <id>`	Fetch project details
`coval review-annotations list --filter '...'`	List completed annotations
`coval simulations metrics <sim_id>`	Get machine scores for a simulation
`coval simulations get <sim_id>`	Get transcript
`coval metrics update <id> --prompt "..."`	Update metric prompt (after approval)

命令	用途
`coval metrics get <id>`	获取指标定义（名称、类型、提示词）
`coval review-projects get <id>`	获取项目详情
`coval review-annotations list --filter '...'`	列出已完成标注
`coval simulations metrics <sim_id>`	获取模拟任务的机器分数
`coval simulations get <sim_id>`	获取对话记录
`coval metrics update <id> --prompt "..."`	更新指标提示词（需批准后执行）

Supported Metric Types

支持的指标类型

Type	Machine Output	Human Output	Agreement
`METRIC_LLM_BINARY`	`1.0` (pass) / `0.0` (fail)	Same	Exact match
`METRIC_CATEGORICAL`	Category string	Same	Exact match
`METRIC_NUMERICAL_LLM_JUDGE`	Float in [min, max]	Same	Exact match (tolerance = 0)

类型	机器输出	人工输出	一致性规则
`METRIC_LLM_BINARY`	`1.0` （通过）/ `0.0` （失败）	格式相同	完全匹配
`METRIC_CATEGORICAL`	分类字符串	格式相同	完全匹配
`METRIC_NUMERICAL_LLM_JUDGE`	[min, max]范围内的浮点数	格式相同	完全匹配（容差=0）

Data Flow

数据流

Review Project
  └─ Annotations (filtered by metric_id, completion_status=COMPLETED)
       ├─ ground_truth_float_value / ground_truth_string_value  ← human label
       ├─ reviewer_notes                                         ← human reasoning
       └─ simulation_output_id
            ├─ coval simulations metrics → value                 ← machine label
            ├─ coval simulations get → transcript                ← conversation
            └─ API: result.llm.answer_explanation                ← machine reasoning

评审项目
  └─ 标注（按metric_id、completion_status=COMPLETED筛选）
       ├─ ground_truth_float_value / ground_truth_string_value  ← 人工标签
       ├─ reviewer_notes                                         ← 人工推理
       └─ simulation_output_id
            ├─ coval simulations metrics → value                 ← 机器标签
            ├─ coval simulations get → transcript                ← 对话内容
            └─ API: result.llm.answer_explanation                ← 机器推理

review-llm-annotations-and-improve-prompt

Original

Translation

Improve Metric

优化指标

Critical Rules

核心规则

Usage

使用方式

Phase 0: Validate Inputs

阶段0：验证输入

Step 1: Fetch the metric and validate its type

步骤1：获取指标并验证其类型

Step 2: Fetch the review project

步骤2：获取评审项目

Step 3: Present a summary before proceeding

步骤3：在继续前展示摘要

Phase 1: Gather Data

阶段1：收集数据

Step 1: Gather annotations with ground-truth values for this metric in the project

步骤1：收集项目中该指标的带真值标注数据

Completed annotations

已完成的标注

All annotations (to find pending ones with ground-truth values)

全部标注（用于查找带真值的待处理标注）

Step 2: For each annotation, gather the full picture

步骤2：为每个标注收集完整信息

a) Ground truth and reviewer notes (already in the annotation)

a) 真值和评审备注（已包含在标注中）

b) Machine score

b) 机器分数

c) Transcript

c) 对话记录

d) LLM explanation (via API if not in CLI output)

d) LLM解释（若CLI输出中没有则通过API获取）

Step 3: Build the analysis dataset

步骤3：构建分析数据集

Phase 2: Calculate Agreement

阶段2：计算一致性

Agreement rules by metric type

按指标类型划分的一致性规则

Binary (METRIC_LLM_BINARY)

二元型（METRIC_LLM_BINARY）

Categorical (METRIC_CATEGORICAL)

分类型（METRIC_CATEGORICAL）

Numerical (METRIC_NUMERICAL_LLM_JUDGE)

数值型（METRIC_NUMERICAL_LLM_JUDGE）

Compute and present the agreement report

计算并展示一致性报告

Agreement Report

Breakdown

一致性报告

细分情况

Phase 3: Analyze Patterns

阶段3：分析模式

Step 1: Analyze disagreements

步骤1：分析不一致情况

Step 2: Analyze agreements

步骤2：分析一致情况

Step 3: Synthesize findings

步骤3：总结发现

Analysis

What the metric gets right

Disagreement patterns

Key insights from reviewer notes

分析结果

指标的优势

分歧模式

评审备注的核心见解

Phase 3.5: Choose Resolution Path

阶段3.5：选择解决方案路径

Disagreements

分歧列表

Handling label updates

处理标签更新

Handling prompt updates

处理提示词更新

Phase 4: Propose Improved Prompt

阶段4：提出优化后的提示词

Step 1: Draft the improved prompt

Binary (
`METRIC_LLM_BINARY`
)

二元型（
`METRIC_LLM_BINARY`
）

Categorical (
`METRIC_CATEGORICAL`
)

分类型（
`METRIC_CATEGORICAL`
）

Numerical (
`METRIC_NUMERICAL_LLM_JUDGE`
)

数值型（
`METRIC_NUMERICAL_LLM_JUDGE`
）