Loading...
Loading...
Build and run evaluators for AI/LLM applications using Phoenix.
npx skill4agent add github/awesome-copilot phoenix-evals| Task | Files |
|---|---|
| Setup | setup-python, setup-typescript |
| Decide what to evaluate | evaluators-overview |
| Choose a judge model | fundamentals-model-selection |
| Use pre-built evaluators | evaluators-pre-built |
| Build code evaluator | evaluators-code-python, evaluators-code-typescript |
| Build LLM evaluator | evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates |
| Batch evaluate DataFrame | evaluate-dataframe-python |
| Run experiment | experiments-running-python, experiments-running-typescript |
| Create dataset | experiments-datasets-python, experiments-datasets-typescript |
| Generate synthetic data | experiments-synthetic-python, experiments-synthetic-typescript |
| Validate evaluator accuracy | validation, validation-evaluators-python, validation-evaluators-typescript |
| Sample traces for review | observe-sampling-python, observe-sampling-typescript |
| Analyze errors | error-analysis, error-analysis-multi-turn, axial-coding |
| RAG evals | evaluators-rag |
| Avoid common mistakes | common-mistakes-python, fundamentals-anti-patterns |
| Production | production-overview, production-guardrails, production-continuous |
| Prefix | Description |
|---|---|
| Types, scores, anti-patterns |
| Tracing, sampling |
| Finding failures |
| Categorizing failures |
| Code, LLM, RAG evaluators |
| Datasets, running experiments |
| Validating evaluator accuracy against human labels |
| CI/CD, monitoring |
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |