phoenix-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Phoenix Evals

Phoenix 评估器

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
为AI/LLM应用构建评估器。优先使用代码实现,借助LLM处理细微差异,基于人工验证结果校准。

Quick Reference

快速参考

TaskFiles
Setup
setup-python
,
setup-typescript
Build code evaluator
evaluators-code-{python|typescript}
Build LLM evaluator
evaluators-llm-{python|typescript}
,
evaluators-custom-templates
Run experiment
experiments-running-{python|typescript}
Create dataset
experiments-datasets-{python|typescript}
Validate evaluator
validation
,
validation-calibration-{python|typescript}
Analyze errors
error-analysis
,
axial-coding
RAG evals
evaluators-rag
Production
production-overview
,
production-guardrails
任务文件
环境搭建
setup-python
,
setup-typescript
构建代码评估器
evaluators-code-{python|typescript}
构建LLM评估器
evaluators-llm-{python|typescript}
,
evaluators-custom-templates
运行实验
experiments-running-{python|typescript}
创建数据集
experiments-datasets-{python|typescript}
验证评估器
validation
,
validation-calibration-{python|typescript}
错误分析
error-analysis
,
axial-coding
RAG评估
evaluators-rag
生产环境部署
production-overview
,
production-guardrails

Workflows

工作流

Starting Fresh:
observe-tracing-setup
error-analysis
axial-coding
evaluators-overview
Building Evaluator:
fundamentals
evaluators-{code\|llm}-{python\|typescript}
validation-calibration-{python\|typescript}
RAG Systems:
evaluators-rag
evaluators-code-*
(retrieval) →
evaluators-llm-*
(faithfulness)
Production:
production-overview
production-guardrails
production-continuous
从零开始:
observe-tracing-setup
error-analysis
axial-coding
evaluators-overview
构建评估器:
fundamentals
evaluators-{code\|llm}-{python\|typescript}
validation-calibration-{python\|typescript}
RAG系统:
evaluators-rag
evaluators-code-*
(检索) →
evaluators-llm-*
(忠实度)
生产环境:
production-overview
production-guardrails
production-continuous

Rule Categories

规则分类

PrefixDescription
fundamentals-*
Types, scores, anti-patterns
observe-*
Tracing, sampling
error-analysis-*
Finding failures
axial-coding-*
Categorizing failures
evaluators-*
Code, LLM, RAG evaluators
experiments-*
Datasets, running experiments
validation-*
Calibrating judges
production-*
CI/CD, monitoring
前缀描述
fundamentals-*
类型、评分、反模式
observe-*
追踪、采样
error-analysis-*
定位故障
axial-coding-*
故障分类
evaluators-*
代码、LLM、RAG评估器
experiments-*
数据集、运行实验
validation-*
校准评估基准
production-*
CI/CD、监控

Key Principles

核心原则

PrincipleAction
Error analysis firstCan't automate what you haven't observed
Custom > genericBuild from your failures
Code firstDeterministic before LLM
Validate judges>80% TPR/TNR
Binary > LikertPass/fail, not 1-5
原则行动指南
先做错误分析无法自动化未被观测到的内容
自定义优于通用基于自身故障场景构建
优先使用代码实现先确定确定性逻辑,再借助LLM
验证评估基准实现>80%的真阳性率/真阴性率
二元判断优于李克特量表采用通过/不通过,而非1-5分制