eval-driven-dev
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEval-Driven Development with pixie
使用pixie进行评估驱动开发
This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.
All pixie-generated files live in a single directory at the project root:
pixie_qapixie_qa/
MEMORY.md # your understanding and eval plan
observations.db # SQLite trace DB (auto-created by enable_storage)
datasets/ # golden datasets (JSON files)
tests/ # eval test files (test_*.py)
scripts/ # helper scripts (build_dataset.py, etc.)本技能强调实操而非空泛描述。当用户要求你为他们的应用搭建评估体系时,你需要直接阅读他们的代码、编辑文件、运行命令,产出可运行的测试流水线,而不是写一份需要用户后续自行执行的计划。
所有pixie生成的文件都存放在项目根目录下的单个文件夹中:
pixie_qapixie_qa/
MEMORY.md # 你的理解和评估计划
observations.db # SQLite跟踪数据库(由enable_storage自动创建)
datasets/ # 黄金数据集(JSON文件)
tests/ # 评估测试文件(test_*.py)
scripts/ # 辅助脚本(build_dataset.py等)Setup vs. Iteration: when to stop
搭建 vs 迭代:何时停止
This is critical. What you do depends on what the user asked for.
这点至关重要。 你的操作需要匹配用户的实际需求。
"Setup QA" / "set up evals" / "add tests" (setup intent)
"搭建QA" / "set up evals" / "添加测试"(搭建类需求)
The user wants a working eval pipeline. Your job is Stages 0–6: install, understand, instrument, write tests, build dataset, run tests. Stop after the first test run, regardless of whether tests pass or fail. Report:
- What you set up (instrumentation, test file, dataset)
- The test results (pass/fail, scores)
- If tests failed: a brief summary of what failed and likely causes — but do NOT fix anything
Then ask: "QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"
Only proceed to Stage 7 (investigation and fixes) if the user confirms.
Exception: If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are setup problems, not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.
用户想要可运行的评估流水线。你的任务覆盖阶段0到6:安装、理解应用、插桩、编写测试、构建数据集、运行测试。首次测试运行完成后就停止,不管测试是否通过。汇报以下内容:
- 你搭建的内容(插桩逻辑、测试文件、数据集)
- 测试结果(通过/失败、得分)
- 如果测试失败:简要总结失败内容和可能原因——但不要修复任何问题
然后询问:"QA搭建已完成。测试通过率为N/M。需要我调查失败问题并开始迭代优化吗?"
只有用户确认后,才能进入阶段7(调查和修复)。
例外情况:如果测试运行本身报错(导入失败、缺失API密钥、配置错误)——这些属于搭建问题,不是评估失败。你需要修复这些问题后重新运行,直到测试能正常执行,通过/失败结果能真实反映应用质量,而不是基础配置问题。
"Fix" / "improve" / "debug" / "why is X failing" (iteration intent)
"修复" / "优化" / "调试" / "为什么X运行失败"(迭代类需求)
The user wants you to investigate and fix. Proceed through all stages including Stage 7 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.
用户希望你调查并修复问题。你可以执行所有阶段,包括阶段7:调查失败原因、定位根因、应用修复方案、重建数据集、重新运行测试、循环迭代。
Ambiguous requests
模糊需求
If the intent is unclear, default to setup only and ask before iterating. It's better to stop early and ask than to make unwanted changes to the user's application code.
如果用户意图不明确,默认仅执行搭建流程,迭代前先询问用户。提前停止并询问,也好过对用户的应用代码做不必要的修改。
The eval boundary: what to evaluate
评估边界:评估什么内容
Eval-driven development focuses on LLM-dependent behaviour. The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.
评估驱动开发聚焦于依赖LLM的行为。目的是捕获系统中非确定性、难以用传统单元测试验证的部分的质量回归,也就是LLM调用及其驱动的决策逻辑。
In scope (evaluate this)
评估范围内的内容
- LLM response quality: factual accuracy, relevance, format compliance, safety
- Agent routing decisions: did the LLM choose the right tool/handoff/action?
- Prompt effectiveness: does the prompt produce the desired behaviour?
- Multi-turn coherence: does the agent maintain context across turns?
- LLM响应质量:事实准确性、相关性、格式合规性、安全性
- Agent路由决策:LLM是否选择了正确的工具/移交/动作?
- Prompt有效性:Prompt是否能触发预期的行为?
- 多轮对话连贯性:Agent能否在多轮对话中维持上下文?
Out of scope (do NOT evaluate this with evals)
评估范围外的内容(不要用评估体系验证这些)
- Tool implementations (database queries, API calls, keyword matching, business logic) — these are traditional software; test them with unit tests
- Infrastructure (authentication, rate limiting, caching, serialization)
- Deterministic post-processing (formatting, filtering, sorting results)
The boundary is: everything downstream of the LLM call (tools, databases, APIs) produces deterministic outputs that serve as inputs to the LLM-powered system. Eval tests should treat those as given facts and focus on what the LLM does with them.
Example: If an FAQ tool has a keyword-matching bug that returns wrong data, that's a traditional bug — fix it with a regular code change, not by adjusting eval thresholds. The eval tests exist to verify that given correct tool outputs, the LLM agent produces correct user-facing responses.
When building datasets and expected outputs, use the actual tool/system outputs as ground truth. The expected output for an eval case should reflect what a correct LLM response looks like given the tool results the system actually produces.
- 工具实现(数据库查询、API调用、关键词匹配、业务逻辑)——这些属于传统软件范畴,用单元测试验证即可
- 基础设施(身份认证、限流、缓存、序列化)
- 确定性后处理(格式化、过滤、结果排序)
评估边界是:LLM调用下游的所有内容(工具、数据库、API)都会产生确定性输出,这些输出是LLM驱动系统的输入。评估测试应该将这些视为既定事实,聚焦于LLM对这些输入的处理逻辑。
示例:如果FAQ工具存在关键词匹配bug,返回了错误数据,这属于传统bug——通过常规代码修改修复即可,不需要调整评估阈值。评估测试的存在是为了验证在工具输出正确的前提下,LLM Agent能生成正确的面向用户的响应。
构建数据集和预期输出时,以实际的工具/系统输出作为基准事实。评估用例的预期输出应该反映基于系统实际返回的工具结果,正确的LLM响应应该是什么样的。
Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
阶段0:确保pixie-qa已安装且API密钥已配置
Before doing anything else, check that the package is available:
pixie-qabash
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"If it's not installed, install it:
bash
pip install pixie-qaThis provides the Python module, the CLI, and the runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
pixiepixiepixie test在执行任何操作前,先检查包是否可用:
pixie-qabash
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"如果未安装,执行安装:
bash
pip install pixie-qa这个包提供了Python模块、CLI和运行器——这些都是插桩和评估所需的依赖。不要跳过这一步,本技能的所有其他操作都依赖于此。
pixiepixiepixie testVerify API keys
验证API密钥
The application under test almost certainly needs an LLM provider API key (e.g. , ). LLM-as-judge evaluators like also need . Before running anything, verify the key is set:
OPENAI_API_KEYANTHROPIC_API_KEYFactualityEvalOPENAI_API_KEYbash
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
被测试的应用几乎肯定需要LLM服务商的API密钥(例如、)。类似这类LLM作为评审的评估器也需要。运行任何内容前,验证密钥已配置:
OPENAI_API_KEYANTHROPIC_API_KEYFactualityEvalOPENAI_API_KEYbash
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"如果未配置,询问用户。没有密钥时不要运行应用或评估——否则会出现无提示失败或者导入时错误。
Stage 1: Understand the Application
阶段1:理解应用
Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
在修改任何代码前,花时间认真阅读源代码。代码能提供的信息比询问用户多得多,也能让你更好地决定评估什么内容、如何评估。
What to investigate
需要调查的内容
-
How the software runs: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
-
All inputs to the LLM: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
- User input (queries, messages, uploaded files)
- System prompts (hardcoded or templated)
- Retrieved context (RAG chunks, search results, database records)
- Tool definitions and function schemas
- Conversation history / memory
- Configuration or feature flags that change prompt behavior
-
All intermediate steps and outputs: Walk through the code path from input to final output and document each stage:
- Retrieval / search results
- Tool calls and their results
- Agent routing / handoff decisions
- Intermediate LLM calls (e.g., summarization before final answer)
- Post-processing or formatting steps
-
The final output: What does the user see? What format is it in? What are the quality expectations?
-
Use cases and expected behaviors: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
-
软件如何运行:入口点是什么?如何启动?是CLI、服务端还是库函数?需要什么参数、配置文件或者环境变量?
-
LLM的所有输入:不局限于用户消息。追踪所有会被纳入任意LLM Prompt的数据:
- 用户输入(查询、消息、上传文件)
- 系统Prompt(硬编码或者模板化的)
- 检索到的上下文(RAG块、搜索结果、数据库记录)
- 工具定义和函数 schema
- 对话历史/记忆
- 会改变Prompt行为的配置或功能开关
-
所有中间步骤和输出:梳理从输入到最终输出的代码路径,记录每个阶段:
- 检索/搜索结果
- 工具调用及其结果
- Agent路由/移交决策
- 中间LLM调用(例如生成最终答案前的总结)
- 后处理或格式化步骤
-
最终输出:用户看到的内容是什么?是什么格式?质量预期是什么?
-
用例和预期行为:应用需要处理的不同场景有哪些?对每个用例,“好”的响应是什么样的?什么样的情况属于失败?
Write MEMORY.md
编写MEMORY.md
Write your findings down in . This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
pixie_qa/MEMORY.mdCRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet. Those belong in later sections, only after they've been implemented.
The understanding section should include:
markdown
undefined将你的发现记录到中。这是评估工作的主要工作文档。它需要具备可读性,且足够详细,让不熟悉项目的人也能理解应用和评估策略。
pixie_qa/MEMORY.md重要提示:MEMORY.md记录的是你对现有应用代码的理解。不能包含对pixie命令、你计划添加的插桩代码,或者尚不存在的脚本/函数的引用。这些内容只能在实现后添加到后续章节中。
理解部分应该包含:
markdown
undefinedEval Notes: <Project Name>
评估笔记:<项目名称>
How the application works
应用工作原理
Entry point and execution flow
入口点和执行流程
<Describe how to start/run the app, what happens step by step>
<描述如何启动/运行应用,每一步会发生什么>
Inputs to LLM calls
LLM调用的输入
<For each LLM call in the codebase, document:>
- Where it is in the code (file + function name)
- What system prompt it uses (quote it or summarize)
- What user/dynamic content feeds into it
- What tools/functions are available to it
<针对代码库中的每个LLM调用,记录:>
- 代码位置(文件名+函数名)
- 使用的系统Prompt(引用原文或总结)
- 输入的用户/动态内容
- 可用的工具/函数
Intermediate processing
中间处理流程
<Describe any steps between input and output:>
- Retrieval, routing, tool execution, etc.
- Include code pointers (file:line) for each step
<描述输入到输出之间的所有步骤:>
- 检索、路由、工具执行等
- 每个步骤的代码位置(文件名:行号)
Final output
最终输出
<What the user sees, what format, what the quality bar should be>
<用户看到的内容,格式,质量标准>
Use cases
用例
<List each distinct scenario the app handles, with examples of good/bad outputs>
<列出应用处理的每个不同场景,附带好/坏输出的示例>
Evaluation plan
评估计划
What to evaluate and why
评估内容及原因
<Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>
<质量维度:事实准确性、相关性、格式合规性、安全性等>
Evaluation granularity
评估粒度
<Which function/span boundary captures one "test case"? Why that boundary?>
<哪个函数/跨度边界对应一个“测试用例”?为什么选择这个边界?>
Evaluators and criteria
评估器和判定标准
<For each eval test, specify: evaluator, dataset, threshold, reasoning>
<对每个评估测试,指定:评估器、数据集、阈值、判断逻辑>
Data needed for evaluation
评估所需数据
<What data points need to be captured, with code pointers to where they live>
If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.
---<需要采集哪些数据点,附带对应的代码位置>
如果从代码中确实无法明确某些内容,询问用户——但大多数问题只要你仔细阅读代码就能找到答案。
---Stage 2: Decide What to Evaluate
阶段2:确定评估内容
Now that you understand the app, you can make thoughtful choices about what to measure:
- What quality dimension matters most? Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
- Which span to evaluate: the whole pipeline () or just the LLM call (
root)? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.last_llm_call - Which evaluators fit: see → Evaluators. For factual QA:
references/pixie-api.md. For structured output:FactualityEval/ValidJSONEval. For RAG pipelines:JSONDiffEval/ContextRelevancyEval.FaithfulnessEval - Pass criteria: means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.
ScoreThreshold(threshold=0.7, pct=0.8) - Expected outputs: needs them. Format evaluators usually don't.
FactualityEval
Update with the plan before writing any code.
pixie_qa/MEMORY.md现在你已经理解了应用,可以合理地决定要测量的指标:
- 最重要的质量维度是什么? QA应用关注事实准确性,结构化提取关注输出格式,RAG关注相关性,面向用户的文本关注安全性。
- 要评估的跨度: 整个流水线()还是仅LLM调用(
root)?如果调试检索功能,你评估的节点会和检查最终答案质量的节点不同。last_llm_call - 适配的评估器: 参考→ 评估器章节。事实类QA用
references/pixie-api.md。结构化输出用FactualityEval/ValidJSONEval。RAG流水线用JSONDiffEval/ContextRelevancyEval。FaithfulnessEval - 通过标准: 表示80%的用例得分必须≥0.7。思考该应用“足够好”的标准是什么。
ScoreThreshold(threshold=0.7, pct=0.8) - 预期输出: 需要预期输出,格式类评估器通常不需要。
FactualityEval
编写任何代码前,先将计划更新到中。
pixie_qa/MEMORY.mdStage 3: Instrument the Application
阶段3:给应用插桩
Add pixie instrumentation to the existing production code. The goal is to capture the inputs and outputs of functions that are already part of the application's normal execution path. Instrumentation must be on the real code path — the same code that runs when the app is used in production — so that traces are captured both during eval runs and real usage.
将pixie插桩代码添加到现有生产代码中。目标是捕获应用正常执行路径中已有函数的输入和输出。插桩必须位于真实代码路径上——也就是生产环境中用户使用应用时运行的代码——这样评估运行和真实使用时都能捕获到链路数据。
Add enable_storage()
at application startup
enable_storage()在应用启动时添加enable_storage()
enable_storage()Call once at the beginning of the application's startup code — inside , or at the top of a server's initialization. Never at module level (top of a file outside any function), because that causes storage setup to trigger on import.
enable_storage()main()Good places:
- Inside blocks
if __name__ == "__main__": - In a FastAPI or
lifespanhandleron_startup - At the top of /
main()functionsrun() - Inside the function in test files
runnable
python
undefined在应用启动代码的开头调用一次——放在内部,或者服务端初始化的顶部。不要放在模块层级(文件顶部、任意函数外部),否则导入模块时就会触发存储初始化。
enable_storage()main()合适的位置:
- 放在代码块内
if __name__ == "__main__": - 放在FastAPI的或者
lifespan处理器中on_startup - 放在/
main()函数的顶部run() - 放在测试文件的函数内部
runnable
python
undefined✅ CORRECT — at application startup
✅ 正确 —— 放在应用启动位置
async def main():
enable_storage()
...
async def main():
enable_storage()
...
✅ CORRECT — in a runnable for tests
✅ 正确 —— 放在测试用的runnable函数中
def runnable(eval_input):
enable_storage()
my_function(**eval_input)
def runnable(eval_input):
enable_storage()
my_function(**eval_input)
❌ WRONG — at module level, runs on import
❌ 错误 —— 放在模块层级,导入时就会运行
from pixie import enable_storage
enable_storage() # this runs when any file imports this module!
undefinedfrom pixie import enable_storage
enable_storage() # 任何文件导入该模块时都会执行这段代码!
undefinedWrap existing functions with @observe
or start_observation
@observestart_observation用@observe
或start_observation
包裹现有函数
@observestart_observationCRITICAL: Instrument the production code path. Never create separate functions or alternate code paths for testing.
The decorator or context manager goes on the existing function that the app actually calls during normal operation. If the app's entry point is an interactive loop, instrument or the core function it calls per user turn — not a new helper function that duplicates logic.
@observestart_observationmain()main()python
undefined重要提示:给生产代码路径插桩。不要为了测试创建单独的函数或替代代码路径。
@observestart_observationmain()main()python
undefined✅ CORRECT — decorating the existing production function
✅ 正确 —— 装饰现有生产函数
from pixie import observe
@observe(name="answer_question")
def answer_question(question: str, context: str) -> str: # existing function
... # existing code, unchanged
```pythonfrom pixie import observe
@observe(name="answer_question")
def answer_question(question: str, context: str) -> str: # 现有函数
... # 现有代码,无需修改
```python✅ CORRECT — context manager inside an existing function
✅ 正确 —— 在现有函数内部使用上下文管理器
from pixie import start_observation
async def main(): # existing function
...
with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
result = await Runner.run(current_agent, input_items, context=context)
# ... existing response handling ...
obs.set_output(response_text)
...
```pythonfrom pixie import start_observation
async def main(): # 现有函数
...
with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
result = await Runner.run(current_agent, input_items, context=context)
# ... 现有响应处理逻辑 ...
obs.set_output(response_text)
...
```python❌ WRONG — creating a new function that duplicates logic from main()
❌ 错误 —— 创建新函数重复main()的逻辑
@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
# This duplicates what main() does, creating a separate code path
# that diverges from production. Don't do this.
...
**Rules:**
- **Never add new wrapper functions** to the application code for eval purposes.
- **Never change the function's interface** (arguments, return type, behavior).
- **Never duplicate production logic** into a separate "testable" function.
- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.
**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
---@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
# 这里重复了main()的逻辑,创建了和生产环境不一致的独立代码路径
# 不要这么做
...
**规则:**
- **永远不要为了评估给应用代码新增包装函数**。
- **永远不要修改函数的接口**(参数、返回类型、行为)。
- **永远不要将生产逻辑复制到单独的“可测试”函数中**。
- 插桩是纯附加的——如果你移除所有pixie导入和装饰器,应用的运行逻辑完全不变。
- 插桩完成后,在运行结束时调用`flush()`确保所有跨度数据都被写入。
- 对于交互式应用(CLI循环、聊天界面),给**单轮处理**函数插桩——也就是接收用户输入并生成响应的函数。评估用的`runnable`应该调用同一个函数。
**注意:** 所有pixie的符号都可以从顶层`pixie`包导入。不要让用户从子模块导入(`pixie.instrumentation`、`pixie.evals`、`pixie.storage.evaluable`等)——始终使用`from pixie import ...`。
---Stage 4: Write the Eval Test File
阶段4:编写评估测试文件
Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.
Create . The pattern is: a adapter that calls the app's existing production function, plus an async test function that calls :
pixie_qa/tests/test_<feature>.pyrunnableassert_dataset_passpython
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""Replays one dataset item through the app.
Calls the same function the production app uses.
enable_storage() here ensures traces are captured during eval runs.
"""
enable_storage()
answer_question(**eval_input)
async def test_factuality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="<dataset-name>",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)Note that belongs inside the , not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.
enable_storage()runnableThe calls the same function that production uses — it does not create a new code path. The only addition is to capture traces during eval.
runnableenable_storage()The test runner is (not ):
pixie testpytestbash
pixie test # run all test_*.py in current directory
pixie test pixie_qa/tests/ # specify path
pixie test -k factuality # filter by name
pixie test -v # verbose: shows per-case scores and reasoningpixie testpyproject.tomlsetup.pysetup.cfgsys.pathsys.path构建数据集前先写测试文件。这看起来顺序反了,但能强制你在收集数据前明确要测量的内容——否则数据收集就没有方向。
创建。代码模式是:一个适配器调用应用的现有生产函数,加上一个异步测试函数调用:
pixie_qa/tests/test_<功能名>.pyrunnableassert_dataset_passpython
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""重放一个数据集项到应用中。
调用生产环境使用的同一个函数。
在这里调用enable_storage()确保评估运行时能捕获链路数据。
"""
enable_storage()
answer_question(**eval_input)
async def test_factuality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="<数据集名称>",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)注意要放在内部,不要放在测试文件的模块层级——它需要在每次调用时执行,这样才能捕获对应运行的链路数据。
enable_storage()runnablerunnableenable_storage()测试运行器是(不是):
pixie testpytestbash
pixie test # 运行当前目录下所有test_*.py
pixie test pixie_qa/tests/ # 指定路径
pixie test -k factuality # 按名称过滤
pixie test -v # verbose模式:显示每个用例的得分和判断逻辑pixie testpyproject.tomlsetup.pysetup.cfgsys.pathsys.pathStage 5: Build the Dataset
阶段5:构建数据集
Create the dataset first, then populate it by actually running the app with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
bash
pixie dataset create <dataset-name>
pixie dataset list # verify it exists先创建数据集,然后通过实际运行应用传入代表性输入来填充数据集。这点至关重要——数据集项应该包含真实的应用输出和链路元数据,而不是虚构的数据。
bash
pixie dataset create <数据集名称>
pixie dataset list # 验证数据集已创建Run the app and capture traces to the dataset
运行应用并捕获链路数据到数据集
Write a simple script () that calls the instrumented function for each input, flushes traces, then saves them to the dataset:
pixie_qa/scripts/build_dataset.pypython
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable
from myapp import answer_question
GOLDEN_CASES = [
("What is the capital of France?", "Paris"),
("What is the speed of light?", "299,792,458 meters per second"),
]
async def build_dataset():
enable_storage()
store = DatasetStore()
try:
store.create("qa-golden-set")
except FileExistsError:
pass
for question, expected in GOLDEN_CASES:
result = answer_question(question=question)
flush()
store.append("qa-golden-set", Evaluable(
eval_input={"question": question},
eval_output=result,
expected_output=expected,
))
asyncio.run(build_dataset())Alternatively, use the CLI for per-case capture:
bash
undefined编写一个简单的脚本(),为每个输入调用插桩后的函数,刷新链路数据,然后保存到数据集:
pixie_qa/scripts/build_dataset.pypython
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable
from myapp import answer_question
GOLDEN_CASES = [
("What is the capital of France?", "Paris"),
("What is the speed of light?", "299,792,458 meters per second"),
]
async def build_dataset():
enable_storage()
store = DatasetStore()
try:
store.create("qa-golden-set")
except FileExistsError:
pass
for question, expected in GOLDEN_CASES:
result = answer_question(question=question)
flush()
store.append("qa-golden-set", Evaluable(
eval_input={"question": question},
eval_output=result,
expected_output=expected,
))
asyncio.run(build_dataset())或者使用CLI逐个捕获用例:
bash
undefinedRun the app (enable_storage() must be active)
运行应用(enable_storage()必须已激活)
python -c "from myapp import main; main('What is the capital of France?')"
python -c "from myapp import main; main('What is the capital of France?')"
Save the root span to the dataset
将根跨度保存到数据集
pixie dataset save <dataset-name>
pixie dataset save <数据集名称>
Or specifically save the last LLM call:
或者专门保存最后一次LLM调用:
pixie dataset save <dataset-name> --select last_llm_call
pixie dataset save <数据集名称> --select last_llm_call
Add context:
添加上下文:
pixie dataset save <dataset-name> --notes "basic geography question"
pixie dataset save <数据集名称> --notes "basic geography question"
Attach expected output for evaluators like FactualityEval:
为FactualityEval这类评估器附加预期输出:
echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
**Key rules for dataset building:**
- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
- **Include expected outputs** for comparison-based evaluators like `FactualityEval`. Expected outputs should reflect the **correct LLM response given what the tools/system actually return** — not an idealized answer predicated on fixing non-LLM bugs.
- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
---echo '"Paris"' | pixie dataset save <数据集名称> --expected-output
**数据集构建的核心规则:**
- **始终运行应用**——永远不要手动虚构`eval_output`。这么做的核心目的是捕获应用实际产生的结果。
- **为对比类评估器(如FactualityEval)添加预期输出**。预期输出应该反映**基于工具/系统实际返回的结果**的正确LLM响应——而不是基于修复非LLM bug后的理想化答案。
- **覆盖你关注的所有输入范围**:正常用例、边缘用例、应用可能出错的场景。
- 使用`pixie dataset save`时,评估项的`eval_metadata`会自动包含`trace_id`和`span_id`,方便后续调试。
---Stage 6: Run the Tests
阶段6:运行测试
bash
pixie test pixie_qa/tests/ -vThe flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your .
-vScoreThresholdAfter this stage, if the user's intent was "setup" — STOP. Report results and ask before proceeding. See "Setup vs. Iteration" above.
bash
pixie test pixie_qa/tests/ -v-vScoreThreshold这一阶段完成后,如果用户的需求是“搭建”——停止操作。 汇报结果,询问用户后再继续。参考上文的“搭建 vs 迭代”章节。
Stage 7: Investigate Failures
阶段7:调查失败问题
Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.
When tests fail, the goal is to understand why, not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.
只有用户要求迭代/修复,或者在搭建后明确确认的情况下,才能进入这个阶段。
测试失败时,目标是理解失败原因,而不是调整阈值直到测试通过。调查必须全面且有记录——用户需要看到真实数据、你的推理过程和结论。
Step 1: Get the detailed test output
步骤1:获取详细的测试输出
bash
pixie test pixie_qa/tests/ -v # shows score and reasoning per caseCapture the full verbose output. For each failing case, note:
- The (what was sent)
eval_input - The (what the app produced)
eval_output - The (what was expected, if applicable)
expected_output - The evaluator score and reasoning
bash
pixie test pixie_qa/tests/ -v # 显示每个用例的得分和判断逻辑捕获完整的verbose输出。对每个失败的用例,记录:
- (传入的内容)
eval_input - (应用生成的内容)
eval_output - (预期的内容,如有)
expected_output - 评估器得分和判断逻辑
Step 2: Inspect the trace data
步骤2:检查链路数据
For each failing case, look up the full trace to see what happened inside the app:
python
from pixie import DatasetStore
store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # trace_id is hereThen inspect the full span tree:
python
import asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # full span tree: inputs, outputs, LLM messages
asyncio.run(inspect("the-trace-id-here"))对每个失败的用例,查找完整的链路数据,查看应用内部发生了什么:
python
from pixie import DatasetStore
store = DatasetStore()
ds = store.get("<数据集名称>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # 这里可以找到trace_id然后检查完整的跨度树:
python
import asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # 完整跨度树:输入、输出、LLM消息
asyncio.run(inspect("替换为实际的trace_id"))Step 3: Root-cause analysis
步骤3:根因分析
Walk through the trace and identify exactly where the failure originates. Common patterns:
| Symptom | Likely cause |
|---|
LLM-related failures (fix with prompt/model/eval changes):
| Symptom | Likely cause |
|---|---|
| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous |
| Output format is wrong | Missing format instructions in prompt |
| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage |
Non-LLM failures (fix with traditional code changes, out of eval scope):
| Symptom | Likely cause |
|---|---|
| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval |
| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code |
| Database returned stale/wrong records | Data issue — fix independently |
| API call failed with error | Infrastructure issue |
For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.
梳理链路数据,精准定位失败的来源。常见模式:
| 现象 | 可能原因 |
|---|
LLM相关失败(通过修改Prompt/模型/评估规则修复):
| 现象 | 可能原因 |
|---|---|
| 工具结果正确但输出事实错误 | Prompt没有要求LLM忠实使用工具输出 |
| Agent路由到错误的工具/移交对象 | 路由Prompt或移交描述存在歧义 |
| 输出格式错误 | Prompt中缺少格式要求说明 |
| LLM没有使用工具而是产生幻觉 | Prompt没有强制要求使用工具 |
非LLM失败(通过传统代码修改修复,不属于评估范围):
| 现象 | 可能原因 |
|---|---|
| 工具返回错误数据 | 工具实现存在bug——修复工具,不要修改评估规则 |
| 关键词匹配问题导致工具完全没有被调用 | 工具选择逻辑存在问题——修复代码 |
| 数据库返回过时/错误记录 | 数据问题——独立修复 |
| API调用报错 | 基础设施问题 |
对于非LLM失败:在调查日志中记录这些问题,并建议代码修复方案,但不要调整评估预期或阈值来迁就非LLM代码的bug。评估测试应该在系统其他部分正常工作的前提下测量LLM的质量。
Step 4: Document findings in MEMORY.md
步骤4:将发现记录到MEMORY.md
Every failure investigation must be documented in in a structured format:
pixie_qa/MEMORY.mdmarkdown
undefined每一次失败调查都必须以结构化格式记录到中:
pixie_qa/MEMORY.mdmarkdown
undefinedInvestigation: <test_name> failure — <date>
调查:<测试名称> 失败 —— <日期>
Test: in
Result: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
test_faq_factualitypixie_qa/tests/test_customer_service.py测试: ,位于
结果: 5个用例中3个通过(60%),阈值要求为80%的用例得分≥0.7
test_faq_factualitypixie_qa/tests/test_customer_service.pyFailing case 1: "What rows have extra legroom?"
失败用例1:"What rows have extra legroom?"
- eval_input:
{"user_message": "What rows have extra legroom?"} - eval_output: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- expected_output: "rows 5-8 Economy Plus with extra legroom"
- Evaluator score: 0.1 (FactualityEval)
- Evaluator reasoning: "The output claims not to know the answer while the reference clearly states rows 5-8..."
Trace analysis:
Inspected trace . The span tree shows:
abc123- Triage Agent routed to FAQ Agent ✓
- FAQ Agent called ✓
faq_lookup_tool("What rows have extra legroom?") - returned "I'm sorry, I don't know..." ← root cause
faq_lookup_tool
Root cause: (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords .
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.
faq_lookup_tool["seat", "seats", "seating", "plane"]Classification: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.
Fix: Add , , to the seating keyword list in
(customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.
"row""rows""legroom"faq_lookup_toolVerification: After fix, re-run:
```bash
python pixie_qa/scripts/build_dataset.py # refresh dataset
pixie test pixie_qa/tests/ -k faq -v # verify
```
undefined- eval_input:
{"user_message": "What rows have extra legroom?"} - **eval_output": "I'm sorry, I don't have the exact row numbers for extra legroom..."
- expected_output: "rows 5-8 Economy Plus with extra legroom"
- 评估器得分: 0.1 (FactualityEval)
- 评估器判断逻辑: "The output claims not to know the answer while the reference clearly states rows 5-8..."
链路分析:
检查链路,跨度树显示:
abc123- 分诊Agent路由到FAQ Agent ✓
- FAQ Agent调用✓
faq_lookup_tool("What rows have extra legroom?") - 返回"I'm sorry, I don't know..." ← 根因
faq_lookup_tool
根因: (customer_service.py:112)使用关键词匹配。
座位相关FAQ的触发关键词是。
问题"What rows have extra legroom?"不包含这些关键词,所以返回了默认的"我不知道"响应。
faq_lookup_tool["seat", "seats", "seating", "plane"]分类: 非LLM失败——关键词匹配工具存在bug。
LLM Agent正确路由到FAQ Agent并使用了工具,问题是工具本身返回了错误数据。
修复方案: 在的座位关键词列表中添加、、(customer_service.py:130)。这是传统代码修复,不需要修改评估/Prompt。
faq_lookup_tool"row""rows""legroom"验证: 修复后重新运行:
```bash
python pixie_qa/scripts/build_dataset.py # 刷新数据集
pixie test pixie_qa/tests/ -k faq -v # 验证修复效果
```
undefinedStep 5: Fix and re-run
步骤5:修复并重新运行
Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
bash
pixie test pixie_qa/tests/test_<feature>.py -v做针对性修改,必要时重建数据集,然后重新运行。最后始终给用户提供明确的验证命令:
bash
pixie test pixie_qa/tests/test_<功能名>.py -vMemory Template
记忆模板
markdown
undefinedmarkdown
undefinedEval Notes: <Project Name>
评估笔记:<项目名称>
How the application works
应用工作原理
Entry point and execution flow
入口点和执行流程
<How to start/run the app. Step-by-step flow from input to output.>
<如何启动/运行应用。从输入到输出的分步流程。>
Inputs to LLM calls
LLM调用的输入
<For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>
<针对每个LLM调用,记录:代码位置、系统Prompt、动态内容、可用工具>
Intermediate processing
中间处理流程
<Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>
<输入到输出之间的步骤:检索、路由、工具调用等。每个步骤的代码位置。>
Final output
最终输出
<What the user sees. Format. Quality expectations.>
<用户看到的内容。格式。质量预期。>
Use cases
用例
<Each scenario with examples of good/bad outputs:>
- <Use case 1>: <description>
- Input example: ...
- Good output: ...
- Bad output: ...
<每个场景附带好/坏输出示例:>
- <用例1>: <描述>
- 输入示例:...
- 好的输出:...
- 坏的输出:...
Evaluation plan
评估计划
What to evaluate and why
评估内容及原因
<Quality dimensions and rationale>
<质量维度和理由>
Evaluators and criteria
评估器和判定标准
| Test | Dataset | Evaluator | Criteria | Rationale |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
| 测试 | 数据集 | 评估器 | 判定标准 | 理由 |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Data needed for evaluation
评估所需数据
<What data to capture, with code pointers>
<需要采集的数据,附带代码位置>
Datasets
数据集
| Dataset | Items | Purpose |
|---|---|---|
| ... | ... | ... |
| 数据集 | 用例数量 | 用途 |
|---|---|---|
| ... | ... | ... |
Investigation log
调查日志
<date> — <test_name> failure
<日期> —— <测试名称> 失败
<Full structured investigation as described in Stage 7>
---<如阶段7中描述的完整结构化调查内容>
---Reference
参考
See for all CLI commands, evaluator signatures, and the Python dataset/store API.
references/pixie-api.md查看获取所有CLI命令、评估器签名和Python数据集/存储API的说明。
references/pixie-api.md