Loading...
Loading...
Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.
npx skill4agent add yiouli/pixie-qa eval-driven-devpixie_qapixie_qa/
MEMORY.md # your understanding and eval plan
observations.db # SQLite trace DB (auto-created by enable_storage)
datasets/ # golden datasets (JSON files)
tests/ # eval test files (test_*.py)
scripts/ # helper scripts (build_dataset.py, etc.)pixie-qapython -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"pip install pixie-qapixiepixiepixie testOPENAI_API_KEYANTHROPIC_API_KEYFactualityEvalOPENAI_API_KEY[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"pixie_qa/MEMORY.md# Eval Notes: <Project Name>
## How the application works
### Entry point and execution flow
<Describe how to start/run the app, what happens step by step>
### Inputs to LLM calls
<For each LLM call in the codebase, document:>
- Where it is in the code (file + function name)
- What system prompt it uses (quote it or summarize)
- What user/dynamic content feeds into it
- What tools/functions are available to it
### Intermediate processing
<Describe any steps between input and output:>
- Retrieval, routing, tool execution, etc.
- Include code pointers (file:line) for each step
### Final output
<What the user sees, what format, what the quality bar should be>
### Use cases
<List each distinct scenario the app handles, with examples of good/bad outputs>
## Evaluation plan
### What to evaluate and why
<Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>
### Evaluation granularity
<Which function/span boundary captures one "test case"? Why that boundary?>
### Evaluators and criteria
<For each eval test, specify: evaluator, dataset, threshold, reasoning>
### Data needed for evaluation
<What data points need to be captured, with code pointers to where they live>rootlast_llm_callreferences/pixie-api.mdFactualityEvalValidJSONEvalJSONDiffEvalContextRelevancyEvalFaithfulnessEvalScoreThreshold(threshold=0.7, pct=0.8)FactualityEvalpixie_qa/MEMORY.mdenable_storage()enable_storage()main()if __name__ == "__main__":lifespanon_startupmain()run()runnable# ✅ CORRECT — at application startup
async def main():
enable_storage()
...
# ✅ CORRECT — in a runnable for tests
def runnable(eval_input):
enable_storage()
my_function(**eval_input)
# ❌ WRONG — at module level, runs on import
from pixie import enable_storage
enable_storage() # this runs when any file imports this module!@observestart_observation@observestart_observationmain()main()# ✅ CORRECT — decorating the existing production function
from pixie import observe
@observe(name="answer_question")
def answer_question(question: str, context: str) -> str: # existing function
... # existing code, unchanged# ✅ CORRECT — context manager inside an existing function
from pixie import start_observation
async def main(): # existing function
...
with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
result = await Runner.run(current_agent, input_items, context=context)
# ... existing response handling ...
obs.set_output(response_text)
...# ❌ WRONG — creating a new function that duplicates logic from main()
@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
# This duplicates what main() does, creating a separate code path
# that diverges from production. Don't do this.
...flush()runnablepixiepixie.instrumentationpixie.evalspixie.storage.evaluablefrom pixie import ...pixie_qa/tests/test_<feature>.pyrunnableassert_dataset_passfrom pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""Replays one dataset item through the app.
Calls the same function the production app uses.
enable_storage() here ensures traces are captured during eval runs.
"""
enable_storage()
answer_question(**eval_input)
async def test_factuality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="<dataset-name>",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)enable_storage()runnablerunnableenable_storage()pixie testpytestpixie test # run all test_*.py in current directory
pixie test pixie_qa/tests/ # specify path
pixie test -k factuality # filter by name
pixie test -v # verbose: shows per-case scores and reasoningpixie testpyproject.tomlsetup.pysetup.cfgsys.pathsys.pathpixie dataset create <dataset-name>
pixie dataset list # verify it existspixie_qa/scripts/build_dataset.pyimport asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable
from myapp import answer_question
GOLDEN_CASES = [
("What is the capital of France?", "Paris"),
("What is the speed of light?", "299,792,458 meters per second"),
]
async def build_dataset():
enable_storage()
store = DatasetStore()
try:
store.create("qa-golden-set")
except FileExistsError:
pass
for question, expected in GOLDEN_CASES:
result = answer_question(question=question)
flush()
store.append("qa-golden-set", Evaluable(
eval_input={"question": question},
eval_output=result,
expected_output=expected,
))
asyncio.run(build_dataset())# Run the app (enable_storage() must be active)
python -c "from myapp import main; main('What is the capital of France?')"
# Save the root span to the dataset
pixie dataset save <dataset-name>
# Or specifically save the last LLM call:
pixie dataset save <dataset-name> --select last_llm_call
# Add context:
pixie dataset save <dataset-name> --notes "basic geography question"
# Attach expected output for evaluators like FactualityEval:
echo '"Paris"' | pixie dataset save <dataset-name> --expected-outputeval_outputFactualityEvalpixie dataset saveeval_metadatatrace_idspan_idpixie test pixie_qa/tests/ -v-vScoreThresholdpixie test pixie_qa/tests/ -v # shows score and reasoning per caseeval_inputeval_outputexpected_outputfrom pixie import DatasetStore
store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # trace_id is hereimport asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # full span tree: inputs, outputs, LLM messages
asyncio.run(inspect("the-trace-id-here"))| Symptom | Likely cause |
|---|
| Symptom | Likely cause |
|---|---|
| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous |
| Output format is wrong | Missing format instructions in prompt |
| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage |
| Symptom | Likely cause |
|---|---|
| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval |
| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code |
| Database returned stale/wrong records | Data issue — fix independently |
| API call failed with error | Infrastructure issue |
pixie_qa/MEMORY.md### Investigation: <test_name> failure — <date>
**Test**: `test_faq_factuality` in `pixie_qa/tests/test_customer_service.py`
**Result**: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
#### Failing case 1: "What rows have extra legroom?"
- **eval_input**: `{"user_message": "What rows have extra legroom?"}`
- **eval_output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- **expected_output**: "rows 5-8 Economy Plus with extra legroom"
- **Evaluator score**: 0.1 (FactualityEval)
- **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
**Trace analysis**:
Inspected trace `abc123`. The span tree shows:
1. Triage Agent routed to FAQ Agent ✓
2. FAQ Agent called `faq_lookup_tool("What rows have extra legroom?")` ✓
3. `faq_lookup_tool` returned "I'm sorry, I don't know..." ← **root cause**
**Root cause**: `faq_lookup_tool` (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords `["seat", "seats", "seating", "plane"]`.
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.
**Classification**: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.
**Fix**: Add `"row"`, `"rows"`, `"legroom"` to the seating keyword list in
`faq_lookup_tool` (customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.
**Verification**: After fix, re-run:
\`\`\`bash
python pixie_qa/scripts/build_dataset.py # refresh dataset
pixie test pixie_qa/tests/ -k faq -v # verify
\`\`\`pixie test pixie_qa/tests/test_<feature>.py -v# Eval Notes: <Project Name>
## How the application works
### Entry point and execution flow
<How to start/run the app. Step-by-step flow from input to output.>
### Inputs to LLM calls
<For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>
### Intermediate processing
<Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>
### Final output
<What the user sees. Format. Quality expectations.>
### Use cases
<Each scenario with examples of good/bad outputs:>
1. <Use case 1>: <description>
- Input example: ...
- Good output: ...
- Bad output: ...
## Evaluation plan
### What to evaluate and why
<Quality dimensions and rationale>
### Evaluators and criteria
| Test | Dataset | Evaluator | Criteria | Rationale |
| ---- | ------- | --------- | -------- | --------- |
| ... | ... | ... | ... | ... |
### Data needed for evaluation
<What data to capture, with code pointers>
## Datasets
| Dataset | Items | Purpose |
| ------- | ----- | ------- |
| ... | ... | ... |
## Investigation log
### <date> — <test_name> failure
<Full structured investigation as described in Stage 7>references/pixie-api.md