Arize Evaluator Skill
This skill covers designing, creating, and running LLM-as-judge evaluators on Arize. An evaluator defines the judge; a task is how you run it against real data.
Prerequisites
Three things are needed:
CLI, an API key (env var or profile), and a space ID.
If
is not installed, not on PATH, or below version
, see ax-setup.md.
macOS/Linux:
bash
ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1
Windows (PowerShell):
powershell
ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "--- profiles ---"; ax profiles show 2>&1
Proceed immediately if env var or profile has an API key. Only ask the user if both are missing.
- No API key in env and no profile → AskQuestion: "Arize API key (https://app.arize.com/admin > API Keys)"
- Space ID unknown → run to list all accessible spaces, or AskQuestion
Concepts
What is an Evaluator?
An evaluator is an LLM-as-judge definition. It contains:
| Field | Description |
|---|
| Template | The judge prompt. Uses placeholders (e.g. , , ) that get filled in at run time via a task's column mappings. |
| Classification choices | The set of allowed output labels (e.g. / ). Binary is the default and most common. Each choice can optionally carry a numeric score. |
| AI Integration | Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model. |
| Model | The specific judge model (e.g. , ). |
| Invocation params | Optional JSON of model settings like . Low temperature is recommended for reproducibility. |
| Optimization direction | Whether higher scores are better () or worse (). Sets how the UI renders trends. |
| Data granularity | Whether the evaluator runs at the span, trace, or session level. Most evaluators run at the span level. |
Evaluators are versioned — every prompt or model change creates a new immutable version. The most recent version is active.
What is a Task?
A task is how you run one or more evaluators against real data. Tasks are attached to a project (live traces/spans) or a dataset (experiment runs). A task contains:
| Field | Description |
|---|
| Evaluators | List of evaluators to run. You can run multiple in one task. |
| Column mappings | Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. "input" → "attributes.input.value"
). This is what makes evaluators portable across projects and experiments. |
| Query filter | SQL-style expression to select which spans/runs to evaluate (e.g. ). Optional but important for precision. |
| Continuous | For project tasks: whether to automatically score new spans as they arrive. |
| Sampling rate | For continuous project tasks: fraction of new spans to evaluate (0–1). |
Basic CRUD
AI Integrations
AI integrations store the LLM provider credentials the evaluator uses. For full CRUD — listing, creating for all providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Gemini, NVIDIA NIM, custom), updating, and deleting — use the arize-ai-provider-integration skill.
Quick reference for the common case (OpenAI):
bash
# Check for an existing integration first
ax ai-integrations list --space-id SPACE_ID
# Create if none exists
ax ai-integrations create \
--name "My OpenAI Integration" \
--provider openAI \
--api-key "sk-..."
Copy the returned integration ID — it is required for
ax evaluators create --ai-integration-id
.
Evaluators
bash
# List / Get
ax evaluators list --space-id SPACE_ID
ax evaluators get EVALUATOR_ID
ax evaluators list-versions EVALUATOR_ID
ax evaluators get-version VERSION_ID
# Create (creates the evaluator and its first version)
ax evaluators create \
--name "Answer Correctness" \
--space-id SPACE_ID \
--description "Judges if the model answer is correct" \
--template-name "correctness" \
--commit-message "Initial version" \
--ai-integration-id INT_ID \
--model-name "gpt-4o" \
--include-explanations \
--use-function-calling \
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.
User question: {input}
Model response: {output}
Respond with exactly one of these labels: correct, incorrect'
# Create a new version (for prompt or model changes — versions are immutable)
ax evaluators create-version EVALUATOR_ID \
--commit-message "Added context grounding" \
--template-name "correctness" \
--ai-integration-id INT_ID \
--model-name "gpt-4o" \
--include-explanations \
--template 'Updated prompt...
{input} / {output} / {context}'
# Update metadata only (name, description — not prompt)
ax evaluators update EVALUATOR_ID \
--name "New Name" \
--description "Updated description"
# Delete (permanent — removes all versions)
ax evaluators delete EVALUATOR_ID
| Flag | Required | Description |
|---|
| yes | Evaluator name (unique within space) |
| yes | Space to create in |
| yes | Eval column name — alphanumeric, spaces, hyphens, underscores |
| yes | Description of this version |
| yes | AI integration ID (from above) |
| yes | Judge model (e.g. ) |
| yes | Prompt with placeholders (single-quoted in bash) |
| no | Human-readable description |
| no | Include reasoning alongside the label |
| no | Prefer structured function-call output |
| no | JSON of model params e.g. |
Tasks
bash
# List / Get
ax tasks list --space-id SPACE_ID
ax tasks list --project-id PROJ_ID
ax tasks list --dataset-id DATASET_ID
ax tasks get TASK_ID
# Create (project — continuous)
ax tasks create \
--name "Correctness Monitor" \
--task-type template_evaluation \
--project-id PROJ_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--is-continuous \
--sampling-rate 0.1
# Create (project — one-time / backfill)
ax tasks create \
--name "Correctness Backfill" \
--task-type template_evaluation \
--project-id PROJ_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--no-continuous
# Create (experiment / dataset)
ax tasks create \
--name "Experiment Scoring" \
--task-type template_evaluation \
--dataset-id DATASET_ID \
--experiment-ids "EXP_ID_1,EXP_ID_2" \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
--no-continuous
# Trigger a run (project task — use data window)
ax tasks trigger-run TASK_ID \
--data-start-time "2026-03-20T00:00:00" \
--data-end-time "2026-03-21T23:59:59" \
--wait
# Trigger a run (experiment task — use experiment IDs)
ax tasks trigger-run TASK_ID \
--experiment-ids "EXP_ID_1" \
--wait
# Monitor
ax tasks list-runs TASK_ID
ax tasks get-run RUN_ID
ax tasks wait-for-run RUN_ID --timeout 300
ax tasks cancel-run RUN_ID --force
Time format for trigger-run: — no trailing
.
Additional trigger-run flags:
| Flag | Description |
|---|
| Cap processed spans (default 10,000) |
| Re-score spans that already have labels |
| / | Block until the run finishes |
| Seconds to wait with (default 600) |
| Poll interval in seconds when waiting (default 5) |
Run status guide:
| Status | Meaning |
|---|
| , 0 spans | No spans in eval index for that window — widen time range |
| ~1s | Integration credentials invalid |
| ~3min | Found spans but LLM call failed — check model name or key |
| , N > 0 | Success — check scores in UI |
Workflow A: Create an evaluator for a project
Use this when the user says something like "create an evaluator for my Playground Traces project".
Step 1: Resolve the project name to an ID
requires a project
ID, not a name — passing a name causes a validation error. Always look up the ID first:
bash
ax projects list --space-id SPACE_ID -o json
Find the entry whose
matches (case-insensitive). Copy its
(a base64 string).
Step 2: Understand what to evaluate
If the user specified the evaluator type (hallucination, correctness, relevance, etc.) → skip to Step 3.
If not, sample recent spans to base the evaluator on actual data:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdout
Inspect
,
, span kinds, and any existing annotations. Identify failure modes (e.g. hallucinated facts, off-topic answers, missing context) and propose
1–3 concrete evaluator ideas. Let the user pick.
Each suggestion must include: the evaluator name (bold), a one-sentence description of what it judges, and the binary label pair in parentheses. Format each like:
- Name — Description of what is being judged. ( / )
Example:
- Response Correctness — Does the agent's response correctly address the user's financial query? ( / )
- Hallucination — Does the response fabricate facts not grounded in retrieved context? ( / )
Step 3: Confirm or create an AI integration
bash
ax ai-integrations list --space-id SPACE_ID -o json
If a suitable integration exists, note its ID. If not, create one using the arize-ai-provider-integration skill. Ask the user which provider/model they want for the judge.
Step 4: Create the evaluator
Use the template design best practices below. Keep the evaluator name and variables
generic — the task (Step 6) handles project-specific wiring via
.
bash
ax evaluators create \
--name "Hallucination" \
--space-id SPACE_ID \
--template-name "hallucination" \
--commit-message "Initial version" \
--ai-integration-id INT_ID \
--model-name "gpt-4o" \
--include-explanations \
--use-function-calling \
--template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.
User question: {input}
Model response: {output}
Respond with exactly one of these labels: hallucinated, factual'
Step 5: Ask — backfill, continuous, or both?
Before creating the task, ask:
"Would you like to:
(a) Run a backfill on historical spans (one-time)?
(b) Set up continuous evaluation on new spans going forward?
(c) Both — backfill now and keep scoring new spans automatically?"
Step 6: Determine column mappings from real span data
Do not guess paths. Pull a sample and inspect what fields are actually present:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdout
For each template variable (
,
,
), find the matching JSON path. Common starting points —
always verify on your actual data before using:
| Template var | LLM span | CHAIN span |
|---|
| | |
| attributes.llm.output_messages.0.message.content
| |
| attributes.retrieval.documents.contents
| — |
| (fallback) | |
Validate span kind alignment: If the evaluator prompt assumes LLM final text but the task targets CHAIN spans (or vice versa), runs can cancel or score the wrong text. Make sure the
on the task matches the span kind you mapped.
json
[
{
"evaluator_id": "EVAL_ID",
"query_filter": "span_kind = 'LLM'",
"column_mappings": {
"input": "attributes.input.value",
"output": "attributes.llm.output_messages.0.message.content",
"context": "attributes.retrieval.documents.contents"
}
}
]
Include a mapping for every variable the template references. Omitting one causes runs to produce no valid scores.
Step 7: Create the task
Backfill only (a):
bash
ax tasks create \
--name "Hallucination Backfill" \
--task-type template_evaluation \
--project-id PROJECT_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--no-continuous
Continuous only (b):
bash
ax tasks create \
--name "Hallucination Monitor" \
--task-type template_evaluation \
--project-id PROJECT_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--is-continuous \
--sampling-rate 0.1
Both (c): Use
on create, then also trigger a backfill run in Step 8.
Step 8: Trigger a backfill run (if requested)
First find what time range has data:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout # try last 24h first
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout # widen if empty
Use the
/
fields from real spans to set the window. Use the most recent data for your first test run.
bash
ax tasks trigger-run TASK_ID \
--data-start-time "2026-03-20T00:00:00" \
--data-end-time "2026-03-21T23:59:59" \
--wait
Workflow B: Create an evaluator for an experiment
Use this when the user says something like "create an evaluator for my experiment" or "evaluate my dataset runs".
If the user says "dataset" but doesn't have an experiment: A task must target an experiment (not a bare dataset). Ask:
"Evaluation tasks run against experiment runs, not datasets directly. Would you like help creating an experiment on that dataset first?"
If yes, use the arize-experiment skill to create one, then return here.
Step 1: Resolve dataset and experiment
bash
ax datasets list --space-id SPACE_ID -o json
ax experiments list --dataset-id DATASET_ID -o json
Note the dataset ID and the experiment ID(s) to score.
Step 2: Understand what to evaluate
If the user specified the evaluator type → skip to Step 3.
If not, inspect a recent experiment run to base the evaluator on actual data:
bash
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"
Look at the
,
,
, and
fields. Identify gaps (metrics the user cares about but doesn't have yet) and propose
1–3 evaluator ideas. Each suggestion must include: the evaluator name (bold), a one-sentence description, and the binary label pair in parentheses — same format as Workflow A, Step 2.
Step 3: Confirm or create an AI integration
Same as Workflow A, Step 3.
Step 4: Create the evaluator
Same as Workflow A, Step 4. Keep variables generic.
Step 5: Determine column mappings from real run data
Run data shape differs from span data. Inspect:
bash
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"
Common mapping for experiment runs:
- → (top-level field on each run)
- → check if it's on the run or embedded in the linked dataset examples
If
is not on the run JSON, export dataset examples to find the path:
bash
ax datasets export DATASET_ID --stdout | python3 -c "import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))"
Step 6: Create the task
bash
ax tasks create \
--name "Experiment Correctness" \
--task-type template_evaluation \
--dataset-id DATASET_ID \
--experiment-ids "EXP_ID" \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
--no-continuous
Step 7: Trigger and monitor
bash
ax tasks trigger-run TASK_ID \
--experiment-ids "EXP_ID" \
--wait
ax tasks list-runs TASK_ID
ax tasks get-run RUN_ID
Best Practices for Template Design
1. Use generic, portable variable names
Use
,
, and
— not names tied to a specific project or span attribute (e.g. do not use
). The evaluator itself stays abstract; the
task's is where you wire it to the actual fields in a specific project or experiment. This lets the same evaluator run across multiple projects and experiments without modification.
2. Default to binary labels
Use exactly two clear string labels (e.g.
/
,
/
,
/
). Binary labels are:
- Easiest for the judge model to produce consistently
- Most common in the industry
- Simplest to interpret in dashboards
If the user insists on more than two choices, that's fine — but recommend binary first and explain the tradeoff (more labels → more ambiguity → lower inter-rater reliability).
3. Be explicit about what the model must return
The template must tell the judge model to respond with only the label string — nothing else. The label strings in the prompt must exactly match the classification choices configured in the UI (same spelling, same casing).
Good:
Respond with exactly one of these labels: hallucinated, factual
Bad (too open-ended):
Is this hallucinated? Answer yes or no.
4. Keep temperature low
Pass
--invocation-params '{"temperature": 0}'
for reproducible scoring. Higher temperatures introduce noise into evaluation results.
5. Use for debugging
During initial setup, always include explanations so you can verify the judge is reasoning correctly before trusting the labels at scale.
6. Pass the template in single quotes in bash
Single quotes prevent the shell from interpolating
placeholders. Double quotes will cause issues:
bash
# Correct
--template 'Judge this: {input} → {output}'
# Wrong — shell may interpret { } or fail
--template "Judge this: {input} → {output}"
7. Validate that choices match your template labels
Reconcile the prompt labels and the Choices panel so they stay in sync. Mismatches cause runs to produce no valid scores.
Troubleshooting
| Problem | Solution |
|---|
| See ax-setup.md |
| API key may not have access to this space. Verify at https://app.arize.com/admin > API Keys |
| ax evaluators list --space-id SPACE_ID
|
| ax ai-integrations list --space-id SPACE_ID
|
| ax tasks list --space-id SPACE_ID
|
project-id and dataset-id are mutually exclusive
| Use only one when creating a task |
experiment-ids required for dataset tasks
| Add to and |
sampling-rate only valid for project tasks
| Remove from dataset tasks |
| Validation error on | Pass project ID (base64), not project name — look up via |
| Template validation errors | Use single-quoted in bash; single braces , not double |
| Run stuck in | ; then ax tasks cancel-run RUN_ID
|
| Run ~1s | Integration credentials invalid — check AI integration |
| Run ~3min | Found spans but LLM call failed — wrong model name or bad key |
| Run , 0 spans | Widen time window; eval index may not cover older data |
| No scores in UI | Fix to match real paths on your spans/runs |
| Scores look wrong | Add and inspect judge reasoning on a few samples |
| Evaluator cancels on wrong span kind | Match and to LLM vs CHAIN spans |
| Time format error on | Use — no trailing |
Related Skills
- arize-ai-provider-integration: Full CRUD for LLM provider integrations (create, update, delete credentials)
- arize-trace: Export spans to discover column paths and time ranges
- arize-experiment: Create experiments and export runs for experiment column mappings
- arize-dataset: Export dataset examples to find input fields when runs omit them
- arize-link: Deep links to evaluators and tasks in the Arize UI
Save Credentials for Future Use
At the end of the session, if the user manually provided any credentials during this conversation and those values were NOT already loaded from a saved profile or environment variable, offer to save them.
Skip this entirely if:
- The API key was already loaded from an existing profile or env var
- The space ID was already set via env var
How to offer: Use
AskQuestion:
"Would you like to save your Arize credentials so you don't have to enter them next time?" with options
/
.
If the user says yes:
-
API key — See ax-profiles.md. Run
to check the current state, then use
or
with the appropriate flags to save the key (and region if relevant).
-
Space ID — See ax-profiles.md (Space ID section) to persist it as an environment variable.