review-llm-annotations-and-improve-prompt

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Improve Metric

优化指标

Analyze human review annotations against machine-generated scores for a single text LLM judge metric. Calculate agreement, identify patterns in disagreements, and propose an improved metric prompt based on transcript analysis and reviewer feedback.
针对单个文本LLM评估指标,分析人工评审标注与机器生成分数的差异。计算一致性,识别分歧模式,并基于对话记录分析和评审反馈提出优化后的指标提示词。

Critical Rules

核心规则

  1. NEVER update the metric prompt without explicit user approval. Always present the proposed changes and wait for confirmation before running
    coval metrics update
    .
  2. Only operate on one metric at a time. If the user provides a project with multiple metrics, ask which one to focus on.
  3. Only support text LLM judge metrics. Valid types:
    METRIC_LLM_BINARY
    ,
    METRIC_CATEGORICAL
    ,
    METRIC_NUMERICAL_LLM_JUDGE
    . Reject audio metrics, toolcall, metadata, regex, pause, and composite types.
  4. NEVER fabricate data. If a CLI command or API call fails, report the error — do not invent scores or transcripts.
  5. Use the CLI first, fall back to curl only when the CLI does not return required fields (e.g., the LLM explanation under
    result.llm.answer_explanation
    ).
  1. 未经用户明确批准,绝不能更新指标提示词。始终先展示拟议的修改内容,等待确认后再执行
    coval metrics update
    命令。
  2. 每次仅处理一个指标。如果用户提供的项目包含多个指标,请询问需聚焦的指标。
  3. 仅支持文本LLM评估指标。有效类型包括:
    METRIC_LLM_BINARY
    METRIC_CATEGORICAL
    METRIC_NUMERICAL_LLM_JUDGE
    。拒绝音频指标、工具调用、元数据、正则表达式、停顿和复合类型的指标。
  4. 绝不能编造数据。如果CLI命令或API调用失败,请报告错误——不要虚构分数或对话记录。
  5. 优先使用CLI,仅当CLI无法返回所需字段时才使用curl(例如
    result.llm.answer_explanation
    下的LLM解释内容)。

Usage

使用方式

/review-llm-annotations-and-improve-prompt <project_id> <metric_id>
Both arguments are required. If the user omits one, ask for it.

/review-llm-annotations-and-improve-prompt <project_id> <metric_id>
两个参数均为必填项。如果用户遗漏其中一个,请询问补充。

Phase 0: Validate Inputs

阶段0:验证输入

Step 1: Fetch the metric and validate its type

步骤1:获取指标并验证其类型

bash
coval metrics get <metric_id> --format json
Extract from the response:
  • metric_name
    — the full display name
  • metric_type
    — must be one of:
    METRIC_LLM_BINARY
    ,
    METRIC_CATEGORICAL
    ,
    METRIC_NUMERICAL_LLM_JUDGE
  • prompt
    — the current evaluation prompt
  • min_value
    /
    max_value
    — for numerical metrics only
  • categories
    — for categorical metrics only
If the metric type is not a text LLM judge, STOP and tell the user:
"This skill only supports text LLM judge metrics (binary, categorical, numerical). The metric
{metric_name}
is type
{metric_type}
, which is not supported."
bash
coval metrics get <metric_id> --format json
从响应中提取:
  • metric_name
    —— 完整显示名称
  • metric_type
    —— 必须为以下类型之一:
    METRIC_LLM_BINARY
    METRIC_CATEGORICAL
    METRIC_NUMERICAL_LLM_JUDGE
  • prompt
    —— 当前评估提示词
  • min_value
    /
    max_value
    —— 仅针对数值型指标
  • categories
    —— 仅针对分类型指标
如果指标类型不属于文本LLM评估指标,请停止操作并告知用户:
"该技能仅支持文本LLM评估指标(二元、分类、数值型)。指标
{metric_name}
的类型为
{metric_type}
,不在支持范围内。"

Step 2: Fetch the review project

步骤2:获取评审项目

bash
coval review-projects get <project_id> --format json
Verify:
  • The
    linked_metric_ids
    includes the target metric ID
  • Note the assignees and linked simulation IDs
If the metric is not linked to the project, STOP and tell the user.
bash
coval review-projects get <project_id> --format json
验证:
  • linked_metric_ids
    包含目标指标ID
  • 记录分配人员和关联的模拟ID
如果指标未关联到该项目,请停止操作并告知用户。

Step 3: Present a summary before proceeding

步骤3:在继续前展示摘要

Metric: {metric_name} (
{metric_id}
) Type: {metric_type} Current prompt: "{prompt}" Project: {project_display_name} ({N} simulations, {M} assignees)
Proceeding to gather annotations and calculate agreement.

指标: {metric_name} (
{metric_id}
) 类型: {metric_type} 当前提示词: "{prompt}" 项目: {project_display_name}({N}个模拟任务,{M}名分配人员)
即将收集标注数据并计算一致性。

Phase 1: Gather Data

阶段1:收集数据

Step 1: Gather annotations with ground-truth values for this metric in the project

步骤1:收集项目中该指标的带真值标注数据

Fetch both completed and all annotations in parallel:
bash
undefined
并行获取已完成和全部标注数据:
bash
undefined

Completed annotations

已完成的标注

coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}" AND completion_status="COMPLETED"'
--format json
--page-size 100
coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}" AND completion_status="COMPLETED"'
--format json
--page-size 100

All annotations (to find pending ones with ground-truth values)

全部标注(用于查找带真值的待处理标注)

coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}"'
--format json
--page-size 100

If there are more than 100 annotations in either list, paginate.

From the full list, identify **pending annotations that have a ground-truth value** — where `completion_status` is `PENDING` but `ground_truth_float_value` is not null OR `ground_truth_string_value` is not null. Reviewer notes are optional.

Present what was found:

> **Found {completed_count} completed annotations and {pending_with_value_count} pending annotations with ground-truth values.**

If there are pending annotations with values, ask:

> **{pending_with_value_count} pending annotations have ground-truth values but are not marked as completed. Would you like to include them in the analysis?** (yes/no)

- **If yes** — merge them into the dataset alongside completed annotations
- **If no** — proceed with only the completed annotations

**If zero usable annotations exist after this step, STOP:**
> "No annotations with ground-truth values found for this metric in this project. Reviewers need to submit ground-truth values before agreement can be calculated."
coval review-annotations list
--filter 'project_id="{project_id}" AND metric_id="{metric_id}"'
--format json
--page-size 100

如果任一列表中的标注数量超过100条,请分页获取。

从完整列表中识别**带真值的待处理标注**——即`completion_status`为`PENDING`但`ground_truth_float_value`不为空或`ground_truth_string_value`不为空的标注。评审备注为可选内容。

展示发现的结果:

> **找到{completed_count}条已完成标注和{pending_with_value_count}条带真值的待处理标注。**

如果存在带真值的待处理标注,请询问:

> **{pending_with_value_count}条待处理标注包含真值但未标记为已完成。是否要将它们纳入分析?**(是/否)

- **如果是** —— 将它们与已完成标注合并到数据集中
- **如果否** —— 仅使用已完成标注继续分析

**如果此步骤后没有可用标注,请停止操作:**
> "在该项目中未找到该指标的带真值标注数据。评审人员需要提交真值标注后才能计算一致性。"

Step 2: For each annotation, gather the full picture

步骤2:为每个标注收集完整信息

For each annotation, collect these four pieces of data:
针对每个标注,收集以下四类数据:

a) Ground truth and reviewer notes (already in the annotation)

a) 真值和评审备注(已包含在标注中)

  • ground_truth_float_value
    or
    ground_truth_string_value
    — the human label
  • reviewer_notes
    — the reviewer's reasoning (may be null)
  • ground_truth_float_value
    ground_truth_string_value
    —— 人工标签
  • reviewer_notes
    —— 评审人员的推理说明(可能为空)

b) Machine score

b) 机器分数

Use the annotation's
simulation_output_id
to get the machine-generated value:
bash
coval simulations metrics <simulation_output_id> --format json
Find the metric output where
metric_id
matches the target metric. Extract
value
as the machine score.
使用标注的
simulation_output_id
获取机器生成的数值:
bash
coval simulations metrics <simulation_output_id> --format json
找到
metric_id
与目标指标匹配的指标输出,提取
value
作为机器分数。

c) Transcript

c) 对话记录

bash
coval simulations get <simulation_output_id> --format json
Extract the
transcript
array. Format it as readable text:
[role]: content
[role]: content
...
bash
coval simulations get <simulation_output_id> --format json
提取
transcript
数组,格式化为可读文本:
[角色]: 内容
[角色]: 内容
...

d) LLM explanation (via API if not in CLI output)

d) LLM解释(若CLI输出中没有则通过API获取)

The LLM's reasoning is stored in the metric output's
result
field under
result.llm.answer_explanation
. The CLI's
coval simulations metric-detail
does not return this field.
Attempt to fetch it via the API:
bash
curl -s "https://api.coval.dev/v1/simulations/{simulation_output_id}/outputs/{simulation_output_id}/metrics/{metric_output_id}" \
  -H "X-API-Key: $COVAL_API_KEY"
If the response includes
result.llm.answer_explanation
, use it. If not, proceed without the LLM explanation — the transcript and reviewer notes are sufficient.
Important: Get the API key from the CLI config:
bash
coval config get api-key
If that fails, check
~/.config/coval/config.json
or the
COVAL_API_KEY
environment variable.
LLM的推理说明存储在指标输出的
result
字段下的
result.llm.answer_explanation
中。CLI命令
coval simulations metric-detail
不会返回该字段。
尝试通过API获取:
bash
curl -s "https://api.coval.dev/v1/simulations/{simulation_output_id}/outputs/{simulation_output_id}/metrics/{metric_output_id}" \
  -H "X-API-Key: $COVAL_API_KEY"
如果响应包含
result.llm.answer_explanation
,则使用该内容。如果没有,则无需LLM解释继续分析——对话记录和评审备注已足够。
重要提示: 从CLI配置中获取API密钥:
bash
coval config get api-key
如果该命令失败,请检查
~/.config/coval/config.json
COVAL_API_KEY
环境变量。

Step 3: Build the analysis dataset

步骤3:构建分析数据集

For each annotation, you should have:
FieldSource
simulation_id
annotation.simulation_output_id
human_label
annotation.ground_truth_float_value OR ground_truth_string_value
machine_label
simulation metric output value
reviewer_notes
annotation.reviewer_notes
transcript
simulation.transcript
llm_explanation
metric output result.llm.answer_explanation (if available)
agrees
computed in Phase 2

针对每个标注,应包含以下字段:
字段来源
simulation_id
annotation.simulation_output_id
human_label
annotation.ground_truth_float_value 或 ground_truth_string_value
machine_label
模拟任务指标输出的value
reviewer_notes
annotation.reviewer_notes
transcript
simulation.transcript
llm_explanation
指标输出的result.llm.answer_explanation(若可用)
agrees
在阶段2中计算

Phase 2: Calculate Agreement

阶段2:计算一致性

Agreement rules by metric type

按指标类型划分的一致性规则

Binary (
METRIC_LLM_BINARY
)

二元型(
METRIC_LLM_BINARY

Machine output is typically
1.0
(pass) or
0.0
(fail). Human ground truth is the same.
agrees = (human_label == machine_label)
Tolerance: 0 — exact match only.
机器输出通常为
1.0
(通过)或
0.0
(失败)。人工真值标注格式相同。
agrees = (human_label == machine_label)
容差:0 —— 仅完全匹配。

Categorical (
METRIC_CATEGORICAL
)

分类型(
METRIC_CATEGORICAL

Machine and human both produce a category string.
agrees = (human_label == machine_label)
Tolerance: 0 — exact match only.
机器和人工均输出分类字符串。
agrees = (human_label == machine_label)
容差:0 —— 仅完全匹配。

Numerical (
METRIC_NUMERICAL_LLM_JUDGE
)

数值型(
METRIC_NUMERICAL_LLM_JUDGE

Machine and human both produce a float in the range
[min_value, max_value]
.
tolerance = 0  # exact match by default
agrees = (abs(human_label - machine_label) <= tolerance)
Tolerance: 0 — exact match. If the agreement rate is very low with exact match, note this in the report and suggest the user may want to consider a wider tolerance. Do NOT automatically widen it.
机器和人工均输出
[min_value, max_value]
范围内的浮点数。
tolerance = 0  # 默认完全匹配
agrees = (abs(human_label - machine_label) <= tolerance)
容差:0 —— 完全匹配。如果完全匹配时一致性率极低,请在报告中说明,并建议用户考虑扩大容差范围。请勿自动扩大容差。

Compute and present the agreement report

计算并展示一致性报告

Present results in this format:

Agreement Report

Metric: {metric_name} ({metric_type}) Total completed annotations: {N} Agreement rate: {agree_count}/{N} ({percentage}%)

Breakdown

Count%
Agree{agree_count}{agree_pct}%
Disagree{disagree_count}{disagree_pct}%
{For numerical metrics only:} Mean absolute error: {mae} Score range: {min_value}–{max_value}

按以下格式展示结果:

一致性报告

指标: {metric_name} ({metric_type}) 已完成标注总数: {N} 一致性率: {agree_count}/{N} ({percentage}%)

细分情况

数量占比
一致{agree_count}{agree_pct}%
不一致{disagree_count}{disagree_pct}%
{仅针对数值型指标:} 平均绝对误差: {mae} 分数范围: {min_value}–{max_value}

Phase 3: Analyze Patterns

阶段3:分析模式

This is the most critical phase. You are acting as a metric prompt engineer analyzing real disagreements to improve accuracy.
这是最关键的阶段。你将作为指标提示词工程师,分析实际分歧以提高准确性。

Step 1: Analyze disagreements

步骤1:分析不一致情况

For each annotation where
agrees == false
:
  1. Read the transcript carefully — understand the full conversation
  2. Read the reviewer notes — understand WHY the human disagreed with the machine
  3. Read the LLM explanation (if available) — understand the machine's reasoning
  4. Determine who was more likely correct and why the machine erred
Look for patterns across disagreements:
  • Systematic bias: Does the machine consistently score too high or too low?
  • Edge cases: Are there conversation patterns the prompt doesn't account for?
  • Ambiguity: Is the prompt vague about certain scenarios?
  • Missing context: Does the prompt fail to instruct the LLM about important nuances?
  • Misinterpretation: Does the LLM misunderstand what the prompt is asking?
针对每个
agrees == false
的标注:
  1. 仔细阅读对话记录 —— 理解完整对话内容
  2. 阅读评审备注 —— 理解人工与机器分歧的原因
  3. 阅读LLM解释(若可用) —— 理解机器的推理逻辑
  4. 判断哪一方更可能正确,以及机器出错的原因
寻找分歧中的共性模式:
  • 系统性偏差:机器是否持续给出过高或过低的分数?
  • 边缘场景:是否存在提示词未覆盖的对话模式?
  • 模糊性:提示词在某些场景下是否表述模糊?
  • 缺失上下文:提示词是否未指导LLM关注重要细节?
  • 误解:LLM是否误解了提示词的要求?

Step 2: Analyze agreements

步骤2:分析一致情况

For each annotation where
agrees == true
:
  1. Read a sample of reviewer notes (if any) — understand what the metric handles well
  2. Identify what the current prompt does correctly — these strengths must be preserved
针对每个
agrees == true
的标注:
  1. 阅读部分评审备注(若有) —— 理解指标处理较好的场景
  2. 识别当前提示词的优势 —— 这些优势必须保留

Step 3: Synthesize findings

步骤3:总结发现

Present a structured analysis:

Analysis

What the metric gets right

  • {bullet points from agreement analysis}

Disagreement patterns

  • Pattern 1: {description} (seen in {N} annotations)
    • Example: Simulation {sim_id} — Human: {value}, Machine: {value}
    • Reviewer note: "{note}"
    • Root cause: {why the machine got this wrong}
  • Pattern 2: ...

Key insights from reviewer notes

  • {synthesized themes from reviewer feedback}

按结构化格式展示分析结果:

分析结果

指标的优势

  • {来自一致情况分析的要点}

分歧模式

  • 模式1: {描述}(在{N}条标注中出现)
    • 示例:模拟任务{sim_id} —— 人工:{value},机器:{value}
    • 评审备注:"{note}"
    • 根本原因:{机器出错的原因}
  • 模式2: ...

评审备注的核心见解

  • {从评审反馈中提炼的主题}

Phase 3.5: Choose Resolution Path

阶段3.5:选择解决方案路径

After presenting the analysis, ask the user how they want to resolve the disagreements. The machine may be wrong (prompt needs fixing), OR the human labels may be wrong (labels need correcting), OR both.
Present the disagreements as a numbered list:

Disagreements

#SimulationHuman LabelMachine LabelReviewer Note
1{sim_id}{human}{machine}{note or "—"}
2{sim_id}{human}{machine}{note or "—"}
...
How would you like to resolve these?
  1. Update the metric prompt — fix the prompt so the machine aligns with the human labels
  2. Update human labels — correct the human labels to align with the machine scores
  3. Both — update some labels and fix the prompt
If choosing option 2 or 3, specify which disagreements to relabel (e.g., "update labels for #1, #3, #5" or "all").
展示分析结果后,询问用户如何解决分歧。可能是机器出错(需要优化提示词),或者人工标签出错(需要修正标签),或者两者皆有。
将分歧情况按编号列表展示:

分歧列表

编号模拟任务人工标签机器标签评审备注
1{sim_id}{human}{machine}{note或"——"}
2{sim_id}{human}{machine}{note或"——"}
...
你希望如何解决这些分歧?
  1. 更新指标提示词 —— 优化提示词使机器结果与人工标签对齐
  2. 更新人工标签 —— 修正人工标签使其与机器分数对齐
  3. 两者兼顾 —— 更新部分标签并优化提示词
如果选择选项2或3,请指定需要重新标注的分歧(例如:"更新编号1、3、5的标签"或"全部")。

Handling label updates

处理标签更新

If the user chooses to update labels (option 2 or 3):
  1. For each annotation the user selects, confirm the new ground-truth value:
    "Annotation #{n} (Simulation {sim_id}): Current human label is
    {human_value}
    . Update to
    {machine_value}
    (the machine score)? Or enter a different value."
  2. After confirmation, update each annotation:
    bash
    coval review-annotations update <annotation_id> \
      --ground-truth-float <new_value> \
      --notes "Label corrected during metric improvement review. Previous value: {old_value}"
    For string-based metrics, use
    --ground-truth-string
    instead of
    --ground-truth-float
    .
  3. After all label updates, recalculate the agreement rate with the corrected labels and present the updated report.
  4. If the user chose option 2 (labels only), skip to the end — no prompt changes needed.
  5. If the user chose option 3 (both), continue to Phase 4 with the remaining disagreements that were NOT resolved by label updates.
如果用户选择更新标签(选项2或3):
  1. 针对用户选择的每个标注,确认新的真值:
    "标注#{n}(模拟任务{sim_id}):当前人工标签为
    {human_value}
    。是否更新为
    {machine_value}
    (机器分数)?或者输入其他值。"
  2. 确认后,更新每个标注:
    bash
    coval review-annotations update <annotation_id> \
      --ground-truth-float <new_value> \
      --notes "指标优化评审期间修正标签。原数值:{old_value}"
    针对基于字符串的指标,使用
    --ground-truth-string
    替代
    --ground-truth-float
  3. 完成所有标签更新后,使用修正后的标签重新计算一致性率并展示更新后的报告。
  4. 如果用户选择选项2(仅更新标签),则直接结束流程——无需修改提示词。
  5. 如果用户选择选项3(两者兼顾),则针对未通过标签更新解决的剩余分歧,继续进入阶段4。

Handling prompt updates

处理提示词更新

If the user chooses option 1 or 3, proceed to Phase 4.

如果用户选择选项1或3,则进入阶段4。

Phase 4: Propose Improved Prompt

阶段4:提出优化后的提示词

Step 1: Draft the improved prompt

步骤1:起草优化后的提示词

Based on the analysis in Phase 3, draft an improved version of the metric prompt that:
  1. Preserves what works — keep language that leads to correct evaluations
  2. Addresses disagreement patterns — add explicit instructions for edge cases
  3. Incorporates reviewer feedback — use the language and reasoning from reviewer notes
  4. Remains clear and concise — avoid bloating the prompt with excessive detail
  5. Maintains the same evaluation format — the metric type and output format must not change
基于阶段3的分析结果,起草优化后的指标提示词,需满足:
  1. 保留优势 —— 保留能带来正确评估的表述
  2. 解决分歧模式 —— 针对边缘场景添加明确说明
  3. 融入评审反馈 —— 使用评审备注中的表述和推理逻辑
  4. 保持清晰简洁 —— 避免提示词过于冗长
  5. 维持相同的评估格式 —— 指标类型和输出格式不得改变

Step 2: Present the proposal

步骤2:展示提案

Show the current and proposed prompts side by side:

Proposed Prompt Update

Current prompt

{current_prompt}

Proposed prompt

{new_prompt}

Changes explained

  • {bullet point explaining each significant change and which disagreement pattern it addresses}
Would you like to apply this prompt update? (yes/no)
将当前提示词与拟议提示词并排展示:

拟议提示词更新

当前提示词

{current_prompt}

拟议提示词

{new_prompt}

修改说明

  • {每个重要修改的要点,以及对应的分歧模式}
是否应用此提示词更新?(是/否)

Step 3: Handle user response

步骤3:处理用户反馈

  • If yes → proceed to Phase 5
  • If no → ask what they'd like to change. Iterate on the prompt until they approve or decide to stop.
  • If they want to edit manually — that's fine too. The analysis is the primary value.

  • 如果是 → 进入阶段5
  • 如果否 → 询问用户希望修改的内容。迭代优化提示词直到用户批准或决定停止。
  • 如果用户希望手动编辑 —— 这也可以。分析结果是核心价值所在。

Phase 5: Update the Metric

阶段5:更新指标

Only after the user explicitly approves the prompt.
bash
coval metrics update <metric_id> --prompt "<approved_prompt>"
Verify the update:
bash
coval metrics get <metric_id> --format json
Confirm the prompt matches what was approved.
Metric prompt updated successfully. You can re-run simulations and create a new review project to validate the improvement.

仅在用户明确批准提示词后执行。
bash
coval metrics update <metric_id> --prompt "<approved_prompt>"
验证更新结果:
bash
coval metrics get <metric_id> --format json
确认提示词与批准的内容一致。
指标提示词已成功更新。你可以重新运行模拟任务并创建新的评审项目来验证优化效果。

Reference

参考资料

CLI Commands Used

使用的CLI命令

CommandPurpose
coval metrics get <id>
Fetch metric definition (name, type, prompt)
coval review-projects get <id>
Fetch project details
coval review-annotations list --filter '...'
List completed annotations
coval simulations metrics <sim_id>
Get machine scores for a simulation
coval simulations get <sim_id>
Get transcript
coval metrics update <id> --prompt "..."
Update metric prompt (after approval)
命令用途
coval metrics get <id>
获取指标定义(名称、类型、提示词)
coval review-projects get <id>
获取项目详情
coval review-annotations list --filter '...'
列出已完成标注
coval simulations metrics <sim_id>
获取模拟任务的机器分数
coval simulations get <sim_id>
获取对话记录
coval metrics update <id> --prompt "..."
更新指标提示词(需批准后执行)

Supported Metric Types

支持的指标类型

TypeMachine OutputHuman OutputAgreement
METRIC_LLM_BINARY
1.0
(pass) /
0.0
(fail)
SameExact match
METRIC_CATEGORICAL
Category stringSameExact match
METRIC_NUMERICAL_LLM_JUDGE
Float in [min, max]SameExact match (tolerance = 0)
类型机器输出人工输出一致性规则
METRIC_LLM_BINARY
1.0
(通过)/
0.0
(失败)
格式相同完全匹配
METRIC_CATEGORICAL
分类字符串格式相同完全匹配
METRIC_NUMERICAL_LLM_JUDGE
[min, max]范围内的浮点数格式相同完全匹配(容差=0)

Data Flow

数据流

Review Project
  └─ Annotations (filtered by metric_id, completion_status=COMPLETED)
       ├─ ground_truth_float_value / ground_truth_string_value  ← human label
       ├─ reviewer_notes                                         ← human reasoning
       └─ simulation_output_id
            ├─ coval simulations metrics → value                 ← machine label
            ├─ coval simulations get → transcript                ← conversation
            └─ API: result.llm.answer_explanation                ← machine reasoning
评审项目
  └─ 标注(按metric_id、completion_status=COMPLETED筛选)
       ├─ ground_truth_float_value / ground_truth_string_value  ← 人工标签
       ├─ reviewer_notes                                         ← 人工推理
       └─ simulation_output_id
            ├─ coval simulations metrics → value                 ← 机器标签
            ├─ coval simulations get → transcript                ← 对话内容
            └─ API: result.llm.answer_explanation                ← 机器推理