build-review-interface

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Build a Custom Annotation Interface

构建自定义标注界面

Overview

概述

Build an HTML page that loads traces from a data source (JSON/CSV file), displays one trace at a time with Pass/Fail buttons, a free-text notes field, and Next/Previous navigation. Save labels to a local file (CSV/SQLite/JSON). Then customize to the domain using the guidelines below.
构建一个HTML页面,从数据源(JSON/CSV文件)加载traces,每次显示一条trace,附带通过/失败按钮、自由文本备注字段,以及上一条/下一条导航功能。将标注保存到本地文件(CSV/SQLite/JSON)。然后按照以下指南针对特定领域进行自定义。

Data Display

数据展示

Format all data in the most human-readable representation for the domain. Emails should look like emails. Code should have syntax highlighting. Markdown should be rendered. Tables should be tables. JSON should be pretty-printed and collapsible.
  • Collapse repetitive elements. If every trace shares the same system prompt, put it in a
    <details>
    toggle.
  • Extract and surface key metadata. If traces contain a property name, client type, or session ID buried in the data, extract it and display it prominently as a header or badge.
  • Color-code by role or status. Use left-border colors to distinguish user messages, assistant messages, tool calls, and system prompts at a glance.
  • Group related elements visually. Tool calls and their responses should be visually linked (indentation, shared border).
  • Collapse what doesn't help judgment. Verbose tool response JSON, intermediate reasoning steps, and debugging context go behind toggles.
  • Highlight what matters most. Make the primary content reviewers judge visually dominant. Bold key entities (prices, dates, names). Use font size and spacing to create hierarchy.
  • Show the full trace. Include all intermediate steps (tool calls, retrieved context, reasoning), not just the final output. Collapse them by default but keep them accessible.
  • Sanitize rendered content. Strip raw HTML from LLM outputs before rendering. Disable images in rendered markdown if they could be tracking pixels.
将所有数据格式化为对应领域最易读的呈现形式。邮件要看起来像真实邮件,代码要有语法高亮,Markdown要被渲染,表格要以表格形式展示,JSON要格式化打印且支持折叠。
  • 折叠重复元素。 如果所有trace都包含相同的system prompt,将其放在
    <details>
    开关中。
  • 提取并展示关键元数据。 如果traces中包含深藏的属性名、客户端类型或会话ID,将其提取出来,作为标题或徽章放在显眼位置展示。
  • 按角色或状态进行颜色编码。 使用左侧边框颜色,让用户消息、助手消息、工具调用和system prompt一眼就能区分开。
  • 视觉上对相关元素进行分组。 工具调用及其响应应该在视觉上关联起来(缩进、共享边框)。
  • 折叠对判断无帮助的内容。 冗长的工具响应JSON、中间推理步骤和调试上下文放在开关后面。
  • 高亮最重要的内容。 让审核人员判断的核心内容在视觉上占主导地位。加粗关键实体(价格、日期、名称)。使用字体大小和间距创建层级结构。
  • 展示完整trace。 包含所有中间步骤(工具调用、检索到的上下文、推理过程),而不仅仅是最终输出。默认折叠,但保持可访问。
  • 对渲染内容进行安全处理。 渲染前剥离LLM输出中的原始HTML。如果渲染的Markdown中的图片可能是追踪像素,将其禁用。

Feedback Collection

反馈收集

Annotate at the trace level. The reviewer judges the whole trace, not individual spans.
  • Binary Pass/Fail buttons as the primary action.
  • Free-text notes field for the reviewer to describe what went wrong (or right).
  • Defer button for uncertain cases.
  • Auto-save on every action.
Once you have established failure categories from error analysis, you can later add predefined failure mode tags as clickable checkboxes, dropdowns or picklists so reviewers can select from known categories in addition to writing notes. But don't add these in the initial build.
在trace层级进行标注。审核人员判断整条trace,而非单个span。
  • 二元通过/失败按钮作为主要操作。
  • 自由文本备注字段,供审核人员描述出错(或表现良好)的点。
  • 延后按钮,用于不确定的情况。
  • 每次操作自动保存。
当你通过错误分析确定了失败类别后,后续可以添加预定义的失败模式标签,以可点击复选框、下拉菜单或选择器的形式呈现,这样审核人员除了写备注外,还可以从已知类别中选择。但初始构建时不要添加这些功能。

Navigation and Status

导航与状态

  • Next/Previous buttons and keyboard arrow keys.
  • Trace counter showing position and progress ("12 of 87 remaining").
  • Jump to specific trace by ID.
  • Counts of labeled vs unlabeled traces.
  • 上一条/下一条按钮和键盘方向键导航。
  • Trace计数器,显示当前位置和进度("共87条,剩余12条")。
  • 支持按ID跳转到指定trace。
  • 已标注和未标注trace的数量统计。

Keyboard Shortcuts

键盘快捷键

Arrow keys = Navigate traces
1 = Pass              2 = Fail
D = Defer             U = Undo last action
Cmd+S = Save          Cmd+Enter = Save and next
Arrow keys = Navigate traces
1 = Pass              2 = Fail
D = Defer             U = Undo last action
Cmd+S = Save          Cmd+Enter = Save and next

Selecting Traces to Load

选择待加载的Traces

Build the app to accept traces from any source (JSON/CSV file). Keep sampling logic outside the app in a separate script. Start with random sampling.
构建的应用需支持从任意来源(JSON/CSV文件)加载traces。将采样逻辑放在应用外的独立脚本中。初始使用随机采样。

Additional Features

额外功能

Reference panel: Toggle-able panel showing ground truth, expected answers, or rubric definitions alongside the trace.
Filtering: Filter traces by metadata dimensions relevant to the product (channel, user type, pipeline version).
Clustering: Group traces by metadata or semantic similarity. Show representative traces per cluster with drill-down.
参考面板: 可切换的面板,可在trace旁边展示ground truth、预期答案或评分规则定义。
筛选: 可按和产品相关的元数据维度(渠道、用户类型、管线版本)筛选traces。
聚类: 按元数据或语义相似度对traces分组。展示每个聚类的代表性trace,支持下钻查看详情。

Design Checklist

设计检查清单

  • Same layout, controls, and terminology on every trace
  • Pass and Fail buttons are visually distinct (color, size)
  • Keyboard shortcuts work for all primary actions
  • Full trace accessible even when sections are collapsed
  • Labels persist automatically without explicit save
  • Trace-level annotation (not span-level) as the default
  • All data rendered in its native format (markdown as HTML, code with highlighting, JSON pretty-printed, tables as HTML tables, URLs as clickable links)
  • 每条trace的布局、控件和术语保持一致
  • 通过和失败按钮在视觉上有明显区分(颜色、大小)
  • 所有主要操作都支持键盘快捷键
  • 即使部分区域折叠,整条trace仍然可访问
  • 标注自动持久化,无需显式保存
  • 默认采用trace层级标注(而非span层级)
  • 所有数据以原生格式渲染(Markdown渲染为HTML、代码带高亮、JSON格式化打印、表格渲染为HTML表格、URL为可点击链接)

Testing

测试

After building the interface, verify it with Playwright.
Visual review: Take screenshots of the interface with representative trace data loaded. Review each screenshot for:
  • Layout and spacing: is the visual hierarchy clear? Can you immediately see what matters?
  • Readability: is all data rendered in its native format? Are there any raw JSON blobs, unrendered markdown, or unstyled content?
  • Aesthetics: does the interface look professional and clean? Would a domain expert use this?
  • Responsiveness: does the layout hold at different window sizes?
Functional test: Write a Playwright script that performs a full annotation workflow:
  1. Load the app and verify traces are displayed
  2. Click Pass on a trace, verify the label is saved
  3. Click Fail on a trace, add a note, verify both are saved
  4. Click Defer, verify it is recorded
  5. Navigate forward and backward with buttons and keyboard shortcuts
  6. Verify the trace counter updates correctly
  7. Verify auto-save by reloading the page and checking labels persist
  8. Expand collapsed sections (system prompts, tool calls) and verify content is accessible
  9. Test that all keyboard shortcuts trigger the correct actions
构建完界面后,用Playwright进行验证。
视觉审核: 加载代表性trace数据后对界面截图。每张截图检查以下点:
  • 布局和间距:视觉层级是否清晰?你能不能立刻看到重点内容?
  • 可读性:所有数据是否都以原生格式渲染?有没有原始JSON块、未渲染的Markdown或者无样式的内容?
  • 美观性:界面看起来是否专业整洁?领域专家会愿意使用这个工具吗?
  • 响应式:不同窗口大小下布局是否保持正常?
功能测试: 编写Playwright脚本执行完整的标注工作流:
  1. 加载应用,验证traces正常显示
  2. 点击某条trace的通过按钮,验证标注已保存
  3. 点击某条trace的失败按钮,添加备注,验证两者都已保存
  4. 点击延后,验证操作已记录
  5. 使用按钮和键盘快捷键前后导航
  6. 验证trace计数器正确更新
  7. 重新加载页面检查标注仍然存在,验证自动保存生效
  8. 展开折叠区域(system prompt、工具调用)验证内容可访问
  9. 测试所有键盘快捷键都能触发正确操作