eval-driven-dev

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Eval-Driven Development with pixie

使用pixie进行评估驱动开发

This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.
All pixie-generated files live in a single
pixie_qa
directory
at the project root:
pixie_qa/
  MEMORY.md              # your understanding and eval plan
  observations.db        # SQLite trace DB (auto-created by enable_storage)
  datasets/              # golden datasets (JSON files)
  tests/                 # eval test files (test_*.py)
  scripts/               # helper scripts (build_dataset.py, etc.)

本技能强调实操而非空泛描述。当用户要求你为他们的应用搭建评估体系时,你需要直接阅读他们的代码、编辑文件、运行命令,产出可运行的测试流水线,而不是写一份需要用户后续自行执行的计划。
所有pixie生成的文件都存放在项目根目录下的单个
pixie_qa
文件夹中
pixie_qa/
  MEMORY.md              # 你的理解和评估计划
  observations.db        # SQLite跟踪数据库(由enable_storage自动创建)
  datasets/              # 黄金数据集(JSON文件)
  tests/                 # 评估测试文件(test_*.py)
  scripts/               # 辅助脚本(build_dataset.py等)

Setup vs. Iteration: when to stop

搭建 vs 迭代:何时停止

This is critical. What you do depends on what the user asked for.
这点至关重要。 你的操作需要匹配用户的实际需求。

"Setup QA" / "set up evals" / "add tests" (setup intent)

"搭建QA" / "set up evals" / "添加测试"(搭建类需求)

The user wants a working eval pipeline. Your job is Stages 0–6: install, understand, instrument, write tests, build dataset, run tests. Stop after the first test run, regardless of whether tests pass or fail. Report:
  1. What you set up (instrumentation, test file, dataset)
  2. The test results (pass/fail, scores)
  3. If tests failed: a brief summary of what failed and likely causes — but do NOT fix anything
Then ask: "QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"
Only proceed to Stage 7 (investigation and fixes) if the user confirms.
Exception: If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are setup problems, not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.
用户想要可运行的评估流水线。你的任务覆盖阶段0到6:安装、理解应用、插桩、编写测试、构建数据集、运行测试。首次测试运行完成后就停止,不管测试是否通过。汇报以下内容:
  1. 你搭建的内容(插桩逻辑、测试文件、数据集)
  2. 测试结果(通过/失败、得分)
  3. 如果测试失败:简要总结失败内容和可能原因——但不要修复任何问题
然后询问:"QA搭建已完成。测试通过率为N/M。需要我调查失败问题并开始迭代优化吗?"
只有用户确认后,才能进入阶段7(调查和修复)。
例外情况:如果测试运行本身报错(导入失败、缺失API密钥、配置错误)——这些属于搭建问题,不是评估失败。你需要修复这些问题后重新运行,直到测试能正常执行,通过/失败结果能真实反映应用质量,而不是基础配置问题。

"Fix" / "improve" / "debug" / "why is X failing" (iteration intent)

"修复" / "优化" / "调试" / "为什么X运行失败"(迭代类需求)

The user wants you to investigate and fix. Proceed through all stages including Stage 7 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.
用户希望你调查并修复问题。你可以执行所有阶段,包括阶段7:调查失败原因、定位根因、应用修复方案、重建数据集、重新运行测试、循环迭代。

Ambiguous requests

模糊需求

If the intent is unclear, default to setup only and ask before iterating. It's better to stop early and ask than to make unwanted changes to the user's application code.

如果用户意图不明确,默认仅执行搭建流程,迭代前先询问用户。提前停止并询问,也好过对用户的应用代码做不必要的修改。

The eval boundary: what to evaluate

评估边界:评估什么内容

Eval-driven development focuses on LLM-dependent behaviour. The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.
评估驱动开发聚焦于依赖LLM的行为。目的是捕获系统中非确定性、难以用传统单元测试验证的部分的质量回归,也就是LLM调用及其驱动的决策逻辑。

In scope (evaluate this)

评估范围内的内容

  • LLM response quality: factual accuracy, relevance, format compliance, safety
  • Agent routing decisions: did the LLM choose the right tool/handoff/action?
  • Prompt effectiveness: does the prompt produce the desired behaviour?
  • Multi-turn coherence: does the agent maintain context across turns?
  • LLM响应质量:事实准确性、相关性、格式合规性、安全性
  • Agent路由决策:LLM是否选择了正确的工具/移交/动作?
  • Prompt有效性:Prompt是否能触发预期的行为?
  • 多轮对话连贯性:Agent能否在多轮对话中维持上下文?

Out of scope (do NOT evaluate this with evals)

评估范围外的内容(不要用评估体系验证这些)

  • Tool implementations (database queries, API calls, keyword matching, business logic) — these are traditional software; test them with unit tests
  • Infrastructure (authentication, rate limiting, caching, serialization)
  • Deterministic post-processing (formatting, filtering, sorting results)
The boundary is: everything downstream of the LLM call (tools, databases, APIs) produces deterministic outputs that serve as inputs to the LLM-powered system. Eval tests should treat those as given facts and focus on what the LLM does with them.
Example: If an FAQ tool has a keyword-matching bug that returns wrong data, that's a traditional bug — fix it with a regular code change, not by adjusting eval thresholds. The eval tests exist to verify that given correct tool outputs, the LLM agent produces correct user-facing responses.
When building datasets and expected outputs, use the actual tool/system outputs as ground truth. The expected output for an eval case should reflect what a correct LLM response looks like given the tool results the system actually produces.

  • 工具实现(数据库查询、API调用、关键词匹配、业务逻辑)——这些属于传统软件范畴,用单元测试验证即可
  • 基础设施(身份认证、限流、缓存、序列化)
  • 确定性后处理(格式化、过滤、结果排序)
评估边界是:LLM调用下游的所有内容(工具、数据库、API)都会产生确定性输出,这些输出是LLM驱动系统的输入。评估测试应该将这些视为既定事实,聚焦于LLM对这些输入的处理逻辑。
示例:如果FAQ工具存在关键词匹配bug,返回了错误数据,这属于传统bug——通过常规代码修改修复即可,不需要调整评估阈值。评估测试的存在是为了验证在工具输出正确的前提下,LLM Agent能生成正确的面向用户的响应。
构建数据集和预期输出时,以实际的工具/系统输出作为基准事实。评估用例的预期输出应该反映基于系统实际返回的工具结果,正确的LLM响应应该是什么样的。

Stage 0: Ensure pixie-qa is Installed and API Keys Are Set

阶段0:确保pixie-qa已安装且API密钥已配置

Before doing anything else, check that the
pixie-qa
package is available:
bash
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
If it's not installed, install it:
bash
pip install pixie-qa
This provides the
pixie
Python module, the
pixie
CLI, and the
pixie test
runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
在执行任何操作前,先检查
pixie-qa
包是否可用:
bash
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"
如果未安装,执行安装:
bash
pip install pixie-qa
这个包提供了
pixie
Python模块、
pixie
CLI和
pixie test
运行器——这些都是插桩和评估所需的依赖。不要跳过这一步,本技能的所有其他操作都依赖于此。

Verify API keys

验证API密钥

The application under test almost certainly needs an LLM provider API key (e.g.
OPENAI_API_KEY
,
ANTHROPIC_API_KEY
). LLM-as-judge evaluators like
FactualityEval
also need
OPENAI_API_KEY
. Before running anything, verify the key is set:
bash
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.

被测试的应用几乎肯定需要LLM服务商的API密钥(例如
OPENAI_API_KEY
ANTHROPIC_API_KEY
)。类似
FactualityEval
这类LLM作为评审的评估器也需要
OPENAI_API_KEY
运行任何内容前,验证密钥已配置:
bash
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"
如果未配置,询问用户。没有密钥时不要运行应用或评估——否则会出现无提示失败或者导入时错误。

Stage 1: Understand the Application

阶段1:理解应用

Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
在修改任何代码前,花时间认真阅读源代码。代码能提供的信息比询问用户多得多,也能让你更好地决定评估什么内容、如何评估。

What to investigate

需要调查的内容

  1. How the software runs: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
  2. All inputs to the LLM: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
    • User input (queries, messages, uploaded files)
    • System prompts (hardcoded or templated)
    • Retrieved context (RAG chunks, search results, database records)
    • Tool definitions and function schemas
    • Conversation history / memory
    • Configuration or feature flags that change prompt behavior
  3. All intermediate steps and outputs: Walk through the code path from input to final output and document each stage:
    • Retrieval / search results
    • Tool calls and their results
    • Agent routing / handoff decisions
    • Intermediate LLM calls (e.g., summarization before final answer)
    • Post-processing or formatting steps
  4. The final output: What does the user see? What format is it in? What are the quality expectations?
  5. Use cases and expected behaviors: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
  1. 软件如何运行:入口点是什么?如何启动?是CLI、服务端还是库函数?需要什么参数、配置文件或者环境变量?
  2. LLM的所有输入:不局限于用户消息。追踪所有会被纳入任意LLM Prompt的数据:
    • 用户输入(查询、消息、上传文件)
    • 系统Prompt(硬编码或者模板化的)
    • 检索到的上下文(RAG块、搜索结果、数据库记录)
    • 工具定义和函数 schema
    • 对话历史/记忆
    • 会改变Prompt行为的配置或功能开关
  3. 所有中间步骤和输出:梳理从输入到最终输出的代码路径,记录每个阶段:
    • 检索/搜索结果
    • 工具调用及其结果
    • Agent路由/移交决策
    • 中间LLM调用(例如生成最终答案前的总结)
    • 后处理或格式化步骤
  4. 最终输出:用户看到的内容是什么?是什么格式?质量预期是什么?
  5. 用例和预期行为:应用需要处理的不同场景有哪些?对每个用例,“好”的响应是什么样的?什么样的情况属于失败?

Write MEMORY.md

编写MEMORY.md

Write your findings down in
pixie_qa/MEMORY.md
. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
CRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet. Those belong in later sections, only after they've been implemented.
The understanding section should include:
markdown
undefined
将你的发现记录到
pixie_qa/MEMORY.md
中。这是评估工作的主要工作文档。它需要具备可读性,且足够详细,让不熟悉项目的人也能理解应用和评估策略。
重要提示:MEMORY.md记录的是你对现有应用代码的理解。不能包含对pixie命令、你计划添加的插桩代码,或者尚不存在的脚本/函数的引用。这些内容只能在实现后添加到后续章节中。
理解部分应该包含:
markdown
undefined

Eval Notes: <Project Name>

评估笔记:<项目名称>

How the application works

应用工作原理

Entry point and execution flow

入口点和执行流程

<Describe how to start/run the app, what happens step by step>
<描述如何启动/运行应用,每一步会发生什么>

Inputs to LLM calls

LLM调用的输入

<For each LLM call in the codebase, document:>
  • Where it is in the code (file + function name)
  • What system prompt it uses (quote it or summarize)
  • What user/dynamic content feeds into it
  • What tools/functions are available to it
<针对代码库中的每个LLM调用,记录:>
  • 代码位置(文件名+函数名)
  • 使用的系统Prompt(引用原文或总结)
  • 输入的用户/动态内容
  • 可用的工具/函数

Intermediate processing

中间处理流程

<Describe any steps between input and output:> - Retrieval, routing, tool execution, etc. - Include code pointers (file:line) for each step
<描述输入到输出之间的所有步骤:>
  • 检索、路由、工具执行等
  • 每个步骤的代码位置(文件名:行号)

Final output

最终输出

<What the user sees, what format, what the quality bar should be>
<用户看到的内容,格式,质量标准>

Use cases

用例

<List each distinct scenario the app handles, with examples of good/bad outputs>
<列出应用处理的每个不同场景,附带好/坏输出的示例>

Evaluation plan

评估计划

What to evaluate and why

评估内容及原因

<Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>
<质量维度:事实准确性、相关性、格式合规性、安全性等>

Evaluation granularity

评估粒度

<Which function/span boundary captures one "test case"? Why that boundary?>
<哪个函数/跨度边界对应一个“测试用例”?为什么选择这个边界?>

Evaluators and criteria

评估器和判定标准

<For each eval test, specify: evaluator, dataset, threshold, reasoning>
<对每个评估测试,指定:评估器、数据集、阈值、判断逻辑>

Data needed for evaluation

评估所需数据

<What data points need to be captured, with code pointers to where they live>

If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.

---
<需要采集哪些数据点,附带对应的代码位置>

如果从代码中确实无法明确某些内容,询问用户——但大多数问题只要你仔细阅读代码就能找到答案。

---

Stage 2: Decide What to Evaluate

阶段2:确定评估内容

Now that you understand the app, you can make thoughtful choices about what to measure:
  • What quality dimension matters most? Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
  • Which span to evaluate: the whole pipeline (
    root
    ) or just the LLM call (
    last_llm_call
    )? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.
  • Which evaluators fit: see
    references/pixie-api.md
    → Evaluators. For factual QA:
    FactualityEval
    . For structured output:
    ValidJSONEval
    /
    JSONDiffEval
    . For RAG pipelines:
    ContextRelevancyEval
    /
    FaithfulnessEval
    .
  • Pass criteria:
    ScoreThreshold(threshold=0.7, pct=0.8)
    means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.
  • Expected outputs:
    FactualityEval
    needs them. Format evaluators usually don't.
Update
pixie_qa/MEMORY.md
with the plan before writing any code.

现在你已经理解了应用,可以合理地决定要测量的指标:
  • 最重要的质量维度是什么? QA应用关注事实准确性,结构化提取关注输出格式,RAG关注相关性,面向用户的文本关注安全性。
  • 要评估的跨度: 整个流水线(
    root
    )还是仅LLM调用(
    last_llm_call
    )?如果调试检索功能,你评估的节点会和检查最终答案质量的节点不同。
  • 适配的评估器: 参考
    references/pixie-api.md
    → 评估器章节。事实类QA用
    FactualityEval
    。结构化输出用
    ValidJSONEval
    /
    JSONDiffEval
    。RAG流水线用
    ContextRelevancyEval
    /
    FaithfulnessEval
  • 通过标准:
    ScoreThreshold(threshold=0.7, pct=0.8)
    表示80%的用例得分必须≥0.7。思考该应用“足够好”的标准是什么。
  • 预期输出:
    FactualityEval
    需要预期输出,格式类评估器通常不需要。
编写任何代码前,先将计划更新到
pixie_qa/MEMORY.md
中。

Stage 3: Instrument the Application

阶段3:给应用插桩

Add pixie instrumentation to the existing production code. The goal is to capture the inputs and outputs of functions that are already part of the application's normal execution path. Instrumentation must be on the real code path — the same code that runs when the app is used in production — so that traces are captured both during eval runs and real usage.
将pixie插桩代码添加到现有生产代码中。目标是捕获应用正常执行路径中已有函数的输入和输出。插桩必须位于真实代码路径上——也就是生产环境中用户使用应用时运行的代码——这样评估运行和真实使用时都能捕获到链路数据。

Add
enable_storage()
at application startup

在应用启动时添加
enable_storage()

Call
enable_storage()
once at the beginning of the application's startup code — inside
main()
, or at the top of a server's initialization. Never at module level (top of a file outside any function), because that causes storage setup to trigger on import.
Good places:
  • Inside
    if __name__ == "__main__":
    blocks
  • In a FastAPI
    lifespan
    or
    on_startup
    handler
  • At the top of
    main()
    /
    run()
    functions
  • Inside the
    runnable
    function in test files
python
undefined
在应用启动代码的开头调用一次
enable_storage()
——放在
main()
内部,或者服务端初始化的顶部。不要放在模块层级(文件顶部、任意函数外部),否则导入模块时就会触发存储初始化。
合适的位置:
  • 放在
    if __name__ == "__main__":
    代码块内
  • 放在FastAPI的
    lifespan
    或者
    on_startup
    处理器中
  • 放在
    main()
    /
    run()
    函数的顶部
  • 放在测试文件的
    runnable
    函数内部
python
undefined

✅ CORRECT — at application startup

✅ 正确 —— 放在应用启动位置

async def main(): enable_storage() ...
async def main(): enable_storage() ...

✅ CORRECT — in a runnable for tests

✅ 正确 —— 放在测试用的runnable函数中

def runnable(eval_input): enable_storage() my_function(**eval_input)
def runnable(eval_input): enable_storage() my_function(**eval_input)

❌ WRONG — at module level, runs on import

❌ 错误 —— 放在模块层级,导入时就会运行

from pixie import enable_storage enable_storage() # this runs when any file imports this module!
undefined
from pixie import enable_storage enable_storage() # 任何文件导入该模块时都会执行这段代码!
undefined

Wrap existing functions with
@observe
or
start_observation

@observe
start_observation
包裹现有函数

CRITICAL: Instrument the production code path. Never create separate functions or alternate code paths for testing.
The
@observe
decorator or
start_observation
context manager goes on the existing function that the app actually calls during normal operation. If the app's entry point is an interactive
main()
loop, instrument
main()
or the core function it calls per user turn — not a new helper function that duplicates logic.
python
undefined
重要提示:给生产代码路径插桩。不要为了测试创建单独的函数或替代代码路径。
@observe
装饰器或
start_observation
上下文管理器要加在应用正常运行时实际调用的现有函数上。如果应用的入口是交互式
main()
循环,给
main()
或者每轮用户请求调用的核心函数插桩——不要新建一个重复逻辑的辅助函数。
python
undefined

✅ CORRECT — decorating the existing production function

✅ 正确 —— 装饰现有生产函数

from pixie import observe
@observe(name="answer_question") def answer_question(question: str, context: str) -> str: # existing function ... # existing code, unchanged

```python
from pixie import observe
@observe(name="answer_question") def answer_question(question: str, context: str) -> str: # 现有函数 ... # 现有代码,无需修改

```python

✅ CORRECT — context manager inside an existing function

✅ 正确 —— 在现有函数内部使用上下文管理器

from pixie import start_observation
async def main(): # existing function ... with start_observation(input={"user_input": user_input}, name="handle_turn") as obs: result = await Runner.run(current_agent, input_items, context=context) # ... existing response handling ... obs.set_output(response_text) ...

```python
from pixie import start_observation
async def main(): # 现有函数 ... with start_observation(input={"user_input": user_input}, name="handle_turn") as obs: result = await Runner.run(current_agent, input_items, context=context) # ... 现有响应处理逻辑 ... obs.set_output(response_text) ...

```python

❌ WRONG — creating a new function that duplicates logic from main()

❌ 错误 —— 创建新函数重复main()的逻辑

@observe(name="run_for_eval") async def run_for_eval(user_messages: list[str]) -> str: # This duplicates what main() does, creating a separate code path # that diverges from production. Don't do this. ...

**Rules:**

- **Never add new wrapper functions** to the application code for eval purposes.
- **Never change the function's interface** (arguments, return type, behavior).
- **Never duplicate production logic** into a separate "testable" function.
- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.

**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.

---
@observe(name="run_for_eval") async def run_for_eval(user_messages: list[str]) -> str: # 这里重复了main()的逻辑,创建了和生产环境不一致的独立代码路径 # 不要这么做 ...

**规则:**

- **永远不要为了评估给应用代码新增包装函数**。
- **永远不要修改函数的接口**(参数、返回类型、行为)。
- **永远不要将生产逻辑复制到单独的“可测试”函数中**。
- 插桩是纯附加的——如果你移除所有pixie导入和装饰器,应用的运行逻辑完全不变。
- 插桩完成后,在运行结束时调用`flush()`确保所有跨度数据都被写入。
- 对于交互式应用(CLI循环、聊天界面),给**单轮处理**函数插桩——也就是接收用户输入并生成响应的函数。评估用的`runnable`应该调用同一个函数。

**注意:** 所有pixie的符号都可以从顶层`pixie`包导入。不要让用户从子模块导入(`pixie.instrumentation`、`pixie.evals`、`pixie.storage.evaluable`等)——始终使用`from pixie import ...`。

---

Stage 4: Write the Eval Test File

阶段4:编写评估测试文件

Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.
Create
pixie_qa/tests/test_<feature>.py
. The pattern is: a
runnable
adapter that calls the app's existing production function, plus an async test function that calls
assert_dataset_pass
:
python
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """Replays one dataset item through the app.

    Calls the same function the production app uses.
    enable_storage() here ensures traces are captured during eval runs.
    """
    enable_storage()
    answer_question(**eval_input)


async def test_factuality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="<dataset-name>",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )
Note that
enable_storage()
belongs inside the
runnable
, not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.
The
runnable
calls the same function that production uses — it does not create a new code path. The only addition is
enable_storage()
to capture traces during eval.
The test runner is
pixie test
(not
pytest
):
bash
pixie test                           # run all test_*.py in current directory
pixie test pixie_qa/tests/           # specify path
pixie test -k factuality             # filter by name
pixie test -v                        # verbose: shows per-case scores and reasoning
pixie test
automatically finds the project root (the directory containing
pyproject.toml
,
setup.py
, or
setup.cfg
) and adds it to
sys.path
— just like pytest. No
sys.path
hacks are needed in test files.

构建数据集前先写测试文件。这看起来顺序反了,但能强制你在收集数据前明确要测量的内容——否则数据收集就没有方向。
创建
pixie_qa/tests/test_<功能名>.py
。代码模式是:一个
runnable
适配器调用应用的现有生产函数,加上一个异步测试函数调用
assert_dataset_pass
python
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """重放一个数据集项到应用中。

    调用生产环境使用的同一个函数。
    在这里调用enable_storage()确保评估运行时能捕获链路数据。
    """
    enable_storage()
    answer_question(**eval_input)


async def test_factuality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="<数据集名称>",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )
注意
enable_storage()
要放在
runnable
内部,不要放在测试文件的模块层级——它需要在每次调用时执行,这样才能捕获对应运行的链路数据。
runnable
调用生产环境使用的同一个函数——不会创建新的代码路径。唯一的附加内容是
enable_storage()
,用于在评估时捕获链路数据。
测试运行器是
pixie test
(不是
pytest
):
bash
pixie test                           # 运行当前目录下所有test_*.py
pixie test pixie_qa/tests/           # 指定路径
pixie test -k factuality             # 按名称过滤
pixie test -v                        #  verbose模式:显示每个用例的得分和判断逻辑
pixie test
会自动找到项目根目录(包含
pyproject.toml
setup.py
setup.cfg
的目录)并将其添加到
sys.path
——和pytest的逻辑一致。测试文件中不需要额外修改
sys.path

Stage 5: Build the Dataset

阶段5:构建数据集

Create the dataset first, then populate it by actually running the app with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
bash
pixie dataset create <dataset-name>
pixie dataset list   # verify it exists
先创建数据集,然后通过实际运行应用传入代表性输入来填充数据集。这点至关重要——数据集项应该包含真实的应用输出和链路元数据,而不是虚构的数据。
bash
pixie dataset create <数据集名称>
pixie dataset list   # 验证数据集已创建

Run the app and capture traces to the dataset

运行应用并捕获链路数据到数据集

Write a simple script (
pixie_qa/scripts/build_dataset.py
) that calls the instrumented function for each input, flushes traces, then saves them to the dataset:
python
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable

from myapp import answer_question

GOLDEN_CASES = [
    ("What is the capital of France?", "Paris"),
    ("What is the speed of light?", "299,792,458 meters per second"),
]

async def build_dataset():
    enable_storage()
    store = DatasetStore()
    try:
        store.create("qa-golden-set")
    except FileExistsError:
        pass

    for question, expected in GOLDEN_CASES:
        result = answer_question(question=question)
        flush()

        store.append("qa-golden-set", Evaluable(
            eval_input={"question": question},
            eval_output=result,
            expected_output=expected,
        ))

asyncio.run(build_dataset())
Alternatively, use the CLI for per-case capture:
bash
undefined
编写一个简单的脚本(
pixie_qa/scripts/build_dataset.py
),为每个输入调用插桩后的函数,刷新链路数据,然后保存到数据集:
python
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable

from myapp import answer_question

GOLDEN_CASES = [
    ("What is the capital of France?", "Paris"),
    ("What is the speed of light?", "299,792,458 meters per second"),
]

async def build_dataset():
    enable_storage()
    store = DatasetStore()
    try:
        store.create("qa-golden-set")
    except FileExistsError:
        pass

    for question, expected in GOLDEN_CASES:
        result = answer_question(question=question)
        flush()

        store.append("qa-golden-set", Evaluable(
            eval_input={"question": question},
            eval_output=result,
            expected_output=expected,
        ))

asyncio.run(build_dataset())
或者使用CLI逐个捕获用例:
bash
undefined

Run the app (enable_storage() must be active)

运行应用(enable_storage()必须已激活)

python -c "from myapp import main; main('What is the capital of France?')"
python -c "from myapp import main; main('What is the capital of France?')"

Save the root span to the dataset

将根跨度保存到数据集

pixie dataset save <dataset-name>
pixie dataset save <数据集名称>

Or specifically save the last LLM call:

或者专门保存最后一次LLM调用:

pixie dataset save <dataset-name> --select last_llm_call
pixie dataset save <数据集名称> --select last_llm_call

Add context:

添加上下文:

pixie dataset save <dataset-name> --notes "basic geography question"
pixie dataset save <数据集名称> --notes "basic geography question"

Attach expected output for evaluators like FactualityEval:

为FactualityEval这类评估器附加预期输出:

echo '"Paris"' | pixie dataset save <dataset-name> --expected-output

**Key rules for dataset building:**

- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
- **Include expected outputs** for comparison-based evaluators like `FactualityEval`. Expected outputs should reflect the **correct LLM response given what the tools/system actually return** — not an idealized answer predicated on fixing non-LLM bugs.
- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.

---
echo '"Paris"' | pixie dataset save <数据集名称> --expected-output

**数据集构建的核心规则:**

- **始终运行应用**——永远不要手动虚构`eval_output`。这么做的核心目的是捕获应用实际产生的结果。

- **为对比类评估器(如FactualityEval)添加预期输出**。预期输出应该反映**基于工具/系统实际返回的结果**的正确LLM响应——而不是基于修复非LLM bug后的理想化答案。

- **覆盖你关注的所有输入范围**:正常用例、边缘用例、应用可能出错的场景。

- 使用`pixie dataset save`时,评估项的`eval_metadata`会自动包含`trace_id`和`span_id`,方便后续调试。

---

Stage 6: Run the Tests

阶段6:运行测试

bash
pixie test pixie_qa/tests/ -v
The
-v
flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your
ScoreThreshold
.
After this stage, if the user's intent was "setup" — STOP. Report results and ask before proceeding. See "Setup vs. Iteration" above.

bash
pixie test pixie_qa/tests/ -v
-v
标志会显示每个用例的得分和判断逻辑,能让你更清楚地知道哪些通过、哪些没通过。检查通过率是否符合你设置的
ScoreThreshold
预期。
这一阶段完成后,如果用户的需求是“搭建”——停止操作。 汇报结果,询问用户后再继续。参考上文的“搭建 vs 迭代”章节。

Stage 7: Investigate Failures

阶段7:调查失败问题

Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.
When tests fail, the goal is to understand why, not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.
只有用户要求迭代/修复,或者在搭建后明确确认的情况下,才能进入这个阶段。
测试失败时,目标是理解失败原因,而不是调整阈值直到测试通过。调查必须全面且有记录——用户需要看到真实数据、你的推理过程和结论。

Step 1: Get the detailed test output

步骤1:获取详细的测试输出

bash
pixie test pixie_qa/tests/ -v    # shows score and reasoning per case
Capture the full verbose output. For each failing case, note:
  • The
    eval_input
    (what was sent)
  • The
    eval_output
    (what the app produced)
  • The
    expected_output
    (what was expected, if applicable)
  • The evaluator score and reasoning
bash
pixie test pixie_qa/tests/ -v    # 显示每个用例的得分和判断逻辑
捕获完整的verbose输出。对每个失败的用例,记录:
  • eval_input
    (传入的内容)
  • eval_output
    (应用生成的内容)
  • expected_output
    (预期的内容,如有)
  • 评估器得分和判断逻辑

Step 2: Inspect the trace data

步骤2:检查链路数据

For each failing case, look up the full trace to see what happened inside the app:
python
from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # trace_id is here
Then inspect the full span tree:
python
import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # full span tree: inputs, outputs, LLM messages

asyncio.run(inspect("the-trace-id-here"))
对每个失败的用例,查找完整的链路数据,查看应用内部发生了什么:
python
from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<数据集名称>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # 这里可以找到trace_id
然后检查完整的跨度树:
python
import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # 完整跨度树:输入、输出、LLM消息

asyncio.run(inspect("替换为实际的trace_id"))

Step 3: Root-cause analysis

步骤3:根因分析

Walk through the trace and identify exactly where the failure originates. Common patterns:
SymptomLikely cause
LLM-related failures (fix with prompt/model/eval changes):
SymptomLikely cause
Output is factually wrong despite correct tool resultsPrompt doesn't instruct the LLM to use tool output faithfully
Agent routes to wrong tool/handoffRouting prompt or handoff descriptions are ambiguous
Output format is wrongMissing format instructions in prompt
LLM hallucinated instead of using toolPrompt doesn't enforce tool usage
Non-LLM failures (fix with traditional code changes, out of eval scope):
SymptomLikely cause
Tool returned wrong dataBug in tool implementation — fix the tool, not the eval
Tool wasn't called at all due to keyword mismatchTool-selection logic is broken — fix the code
Database returned stale/wrong recordsData issue — fix independently
API call failed with errorInfrastructure issue
For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.
梳理链路数据,精准定位失败的来源。常见模式:
现象可能原因
LLM相关失败(通过修改Prompt/模型/评估规则修复):
现象可能原因
工具结果正确但输出事实错误Prompt没有要求LLM忠实使用工具输出
Agent路由到错误的工具/移交对象路由Prompt或移交描述存在歧义
输出格式错误Prompt中缺少格式要求说明
LLM没有使用工具而是产生幻觉Prompt没有强制要求使用工具
非LLM失败(通过传统代码修改修复,不属于评估范围):
现象可能原因
工具返回错误数据工具实现存在bug——修复工具,不要修改评估规则
关键词匹配问题导致工具完全没有被调用工具选择逻辑存在问题——修复代码
数据库返回过时/错误记录数据问题——独立修复
API调用报错基础设施问题
对于非LLM失败:在调查日志中记录这些问题,并建议代码修复方案,但不要调整评估预期或阈值来迁就非LLM代码的bug。评估测试应该在系统其他部分正常工作的前提下测量LLM的质量。

Step 4: Document findings in MEMORY.md

步骤4:将发现记录到MEMORY.md

Every failure investigation must be documented in
pixie_qa/MEMORY.md
in a structured format:
markdown
undefined
每一次失败调查都必须以结构化格式记录到
pixie_qa/MEMORY.md
markdown
undefined

Investigation: <test_name> failure — <date>

调查:<测试名称> 失败 —— <日期>

Test:
test_faq_factuality
in
pixie_qa/tests/test_customer_service.py
Result: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
测试:
test_faq_factuality
,位于
pixie_qa/tests/test_customer_service.py
结果: 5个用例中3个通过(60%),阈值要求为80%的用例得分≥0.7

Failing case 1: "What rows have extra legroom?"

失败用例1:"What rows have extra legroom?"

  • eval_input:
    {"user_message": "What rows have extra legroom?"}
  • eval_output: "I'm sorry, I don't have the exact row numbers for extra legroom..."
  • expected_output: "rows 5-8 Economy Plus with extra legroom"
  • Evaluator score: 0.1 (FactualityEval)
  • Evaluator reasoning: "The output claims not to know the answer while the reference clearly states rows 5-8..."
Trace analysis: Inspected trace
abc123
. The span tree shows:
  1. Triage Agent routed to FAQ Agent ✓
  2. FAQ Agent called
    faq_lookup_tool("What rows have extra legroom?")
  3. faq_lookup_tool
    returned "I'm sorry, I don't know..." ← root cause
Root cause:
faq_lookup_tool
(customer_service.py:112) uses keyword matching. The seat FAQ entry is triggered by keywords
["seat", "seats", "seating", "plane"]
. The question "What rows have extra legroom?" contains none of these keywords, so it falls through to the default "I don't know" response.
Classification: Non-LLM failure — the keyword-matching tool is broken. The LLM agent correctly routed to the FAQ agent and used the tool; the tool itself returned wrong data.
Fix: Add
"row"
,
"rows"
,
"legroom"
to the seating keyword list in
faq_lookup_tool
(customer_service.py:130). This is a traditional code fix, not an eval/prompt change.
Verification: After fix, re-run: ```bash python pixie_qa/scripts/build_dataset.py # refresh dataset pixie test pixie_qa/tests/ -k faq -v # verify ```
undefined
  • eval_input:
    {"user_message": "What rows have extra legroom?"}
  • **eval_output": "I'm sorry, I don't have the exact row numbers for extra legroom..."
  • expected_output: "rows 5-8 Economy Plus with extra legroom"
  • 评估器得分: 0.1 (FactualityEval)
  • 评估器判断逻辑: "The output claims not to know the answer while the reference clearly states rows 5-8..."
链路分析: 检查链路
abc123
,跨度树显示:
  1. 分诊Agent路由到FAQ Agent ✓
  2. FAQ Agent调用
    faq_lookup_tool("What rows have extra legroom?")
  3. faq_lookup_tool
    返回"I'm sorry, I don't know..." ← 根因
根因:
faq_lookup_tool
(customer_service.py:112)使用关键词匹配。 座位相关FAQ的触发关键词是
["seat", "seats", "seating", "plane"]
。 问题"What rows have extra legroom?"不包含这些关键词,所以返回了默认的"我不知道"响应。
分类: 非LLM失败——关键词匹配工具存在bug。 LLM Agent正确路由到FAQ Agent并使用了工具,问题是工具本身返回了错误数据。
修复方案:
faq_lookup_tool
的座位关键词列表中添加
"row"
"rows"
"legroom"
(customer_service.py:130)。这是传统代码修复,不需要修改评估/Prompt。
验证: 修复后重新运行: ```bash python pixie_qa/scripts/build_dataset.py # 刷新数据集 pixie test pixie_qa/tests/ -k faq -v # 验证修复效果 ```
undefined

Step 5: Fix and re-run

步骤5:修复并重新运行

Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
bash
pixie test pixie_qa/tests/test_<feature>.py -v

做针对性修改,必要时重建数据集,然后重新运行。最后始终给用户提供明确的验证命令:
bash
pixie test pixie_qa/tests/test_<功能名>.py -v

Memory Template

记忆模板

markdown
undefined
markdown
undefined

Eval Notes: <Project Name>

评估笔记:<项目名称>

How the application works

应用工作原理

Entry point and execution flow

入口点和执行流程

<How to start/run the app. Step-by-step flow from input to output.>
<如何启动/运行应用。从输入到输出的分步流程。>

Inputs to LLM calls

LLM调用的输入

<For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>
<针对每个LLM调用,记录:代码位置、系统Prompt、动态内容、可用工具>

Intermediate processing

中间处理流程

<Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>
<输入到输出之间的步骤:检索、路由、工具调用等。每个步骤的代码位置。>

Final output

最终输出

<What the user sees. Format. Quality expectations.>
<用户看到的内容。格式。质量预期。>

Use cases

用例

<Each scenario with examples of good/bad outputs:>
  1. <Use case 1>: <description>
    • Input example: ...
    • Good output: ...
    • Bad output: ...
<每个场景附带好/坏输出示例:>
  1. <用例1>: <描述>
    • 输入示例:...
    • 好的输出:...
    • 坏的输出:...

Evaluation plan

评估计划

What to evaluate and why

评估内容及原因

<Quality dimensions and rationale>
<质量维度和理由>

Evaluators and criteria

评估器和判定标准

TestDatasetEvaluatorCriteriaRationale
...............
测试数据集评估器判定标准理由
...............

Data needed for evaluation

评估所需数据

<What data to capture, with code pointers>
<需要采集的数据,附带代码位置>

Datasets

数据集

DatasetItemsPurpose
.........
数据集用例数量用途
.........

Investigation log

调查日志

<date> — <test_name> failure

<日期> —— <测试名称> 失败

<Full structured investigation as described in Stage 7>

---
<如阶段7中描述的完整结构化调查内容>

---

Reference

参考

See
references/pixie-api.md
for all CLI commands, evaluator signatures, and the Python dataset/store API.
查看
references/pixie-api.md
获取所有CLI命令、评估器签名和Python数据集/存储API的说明。