arize-evaluator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseArize Evaluator Skill
Arize 评估器 Skill
This skill covers designing, creating, and running LLM-as-judge evaluators on Arize. An evaluator defines the judge; a task is how you run it against real data.
本Skill涵盖在Arize平台上设计、创建和运行LLM-as-judge评估器的相关操作。评估器(evaluator)定义了评判规则;任务(task)则是将评估器应用于真实数据的执行方式。
Prerequisites
前置条件
Three things are needed: CLI, an API key (env var or profile), and a space ID.
axIf is not installed, not on PATH, or below version , see ax-setup.md.
ax0.8.0macOS/Linux:
bash
ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1Windows (PowerShell):
powershell
ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "--- profiles ---"; ax profiles show 2>&1Proceed immediately if env var or profile has an API key. Only ask the user if both are missing.
- No API key in env and no profile → AskQuestion: "Arize API key (https://app.arize.com/admin > API Keys)"
- Space ID unknown → run to list all accessible spaces, or AskQuestion
ax spaces list -o json
需要准备三项内容:ax CLI、API密钥(环境变量或配置文件)以及space ID。
若未安装ax CLI、其未在PATH路径中,或版本低于,请查看ax-setup.md文档。
0.8.0macOS/Linux系统:
bash
ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (已设置)"; else echo "ARIZE_API_KEY: (未设置)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(未设置)}" && echo "--- profiles ---" && ax profiles show 2>&1Windows系统(PowerShell):
powershell
ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(已设置)' } else { '(未设置)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "--- profiles ---"; ax profiles show 2>&1若环境变量或配置文件中已存在API密钥,可直接继续操作。仅当两者均缺失时,才需询问用户。
- 环境变量中无API密钥且无配置文件 → 询问用户:"请提供Arize API密钥(获取地址:https://app.arize.com/admin > API Keys)"
- 未知space ID → 运行列出所有可访问的space,或询问用户
ax spaces list -o json
Concepts
核心概念
What is an Evaluator?
什么是Evaluator?
An evaluator is an LLM-as-judge definition. It contains:
| Field | Description |
|---|---|
| Template | The judge prompt. Uses |
| Classification choices | The set of allowed output labels (e.g. |
| AI Integration | Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model. |
| Model | The specific judge model (e.g. |
| Invocation params | Optional JSON of model settings like |
| Optimization direction | Whether higher scores are better ( |
| Data granularity | Whether the evaluator runs at the span, trace, or session level. Most evaluators run at the span level. |
Evaluators are versioned — every prompt or model change creates a new immutable version. The most recent version is active.
Evaluator是LLM-as-judge的定义,包含以下字段:
| 字段 | 说明 |
|---|---|
| Template | 评判提示词,使用 |
| Classification choices | 允许的输出标签集合(如 |
| AI Integration | 存储评估器调用评判模型所需的LLM提供商凭证(OpenAI、Anthropic、Bedrock等)。 |
| Model | 具体的评判模型(如 |
| Invocation params | 模型设置的可选JSON参数,例如 |
| Optimization direction | 分数越高越好( |
| Data granularity | 评估器的运行粒度:span、trace或session级别。大多数评估器运行在span级别。 |
评估器支持版本化——每次修改提示词或模型都会创建一个新的不可变版本,最新版本为活跃版本。
What is a Task?
什么是Task?
A task is how you run one or more evaluators against real data. Tasks are attached to a project (live traces/spans) or a dataset (experiment runs). A task contains:
| Field | Description |
|---|---|
| Evaluators | List of evaluators to run. You can run multiple in one task. |
| Column mappings | Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. |
| Query filter | SQL-style expression to select which spans/runs to evaluate (e.g. |
| Continuous | For project tasks: whether to automatically score new spans as they arrive. |
| Sampling rate | For continuous project tasks: fraction of new spans to evaluate (0–1). |
Task是将一个或多个评估器应用于真实数据的执行方式。任务可关联到项目(实时traces/spans)或数据集(实验运行结果),包含以下字段:
| 字段 | 说明 |
|---|---|
| Evaluators | 要运行的评估器列表,可在一个任务中运行多个评估器。 |
| Column mappings | 将评估器模板中的变量映射到spans或实验运行结果中的实际字段路径(如 |
| Query filter | SQL风格的表达式,用于选择要评估的spans/运行结果(如 |
| Continuous | 针对项目任务:是否自动为新产生的spans打分。 |
| Sampling rate | 针对持续运行的项目任务:要评估的新spans的比例(0–1)。 |
Basic CRUD
基础CRUD操作
AI Integrations
AI集成
AI integrations store the LLM provider credentials the evaluator uses. For full CRUD — listing, creating for all providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Gemini, NVIDIA NIM, custom), updating, and deleting — use the arize-ai-provider-integration skill.
Quick reference for the common case (OpenAI):
bash
undefinedAI集成存储评估器调用LLM所需的提供商凭证。如需完整的CRUD操作——列出、创建(支持所有提供商:OpenAI、Anthropic、Azure、Bedrock、Vertex、Gemini、NVIDIA NIM、自定义)、更新和删除——请使用arize-ai-provider-integration Skill。
以下是常见场景(OpenAI)的快速参考:
bash
undefinedCheck for an existing integration first
先检查是否已存在集成
ax ai-integrations list --space-id SPACE_ID
ax ai-integrations list --space-id SPACE_ID
Create if none exists
若不存在则创建
ax ai-integrations create
--name "My OpenAI Integration"
--provider openAI
--api-key "sk-..."
--name "My OpenAI Integration"
--provider openAI
--api-key "sk-..."
Copy the returned integration ID — it is required for `ax evaluators create --ai-integration-id`.ax ai-integrations create
--name "My OpenAI Integration"
--provider openAI
--api-key "sk-..."
--name "My OpenAI Integration"
--provider openAI
--api-key "sk-..."
复制返回的集成ID——创建`ax evaluators create`时需要用到`--ai-integration-id`参数。Evaluators
评估器(Evaluators)
bash
undefinedbash
undefinedList / Get
列出/获取详情
ax evaluators list --space-id SPACE_ID
ax evaluators get EVALUATOR_ID
ax evaluators list-versions EVALUATOR_ID
ax evaluators get-version VERSION_ID
ax evaluators list --space-id SPACE_ID
ax evaluators get EVALUATOR_ID
ax evaluators list-versions EVALUATOR_ID
ax evaluators get-version VERSION_ID
Create (creates the evaluator and its first version)
创建评估器(同时创建其第一个版本)
ax evaluators create
--name "Answer Correctness"
--space-id SPACE_ID
--description "Judges if the model answer is correct"
--template-name "correctness"
--commit-message "Initial version"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--use-function-calling
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.
--name "Answer Correctness"
--space-id SPACE_ID
--description "Judges if the model answer is correct"
--template-name "correctness"
--commit-message "Initial version"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--use-function-calling
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.
User question: {input}
Model response: {output}
Respond with exactly one of these labels: correct, incorrect'
ax evaluators create
--name "Answer Correctness"
--space-id SPACE_ID
--description "判断模型回答是否正确"
--template-name "correctness"
--commit-message "初始版本"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--use-function-calling
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.
--name "Answer Correctness"
--space-id SPACE_ID
--description "判断模型回答是否正确"
--template-name "correctness"
--commit-message "初始版本"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--use-function-calling
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.
User question: {input}
Model response: {output}
Respond with exactly one of these labels: correct, incorrect'
Create a new version (for prompt or model changes — versions are immutable)
创建新版本(用于修改提示词或模型——版本不可变)
ax evaluators create-version EVALUATOR_ID
--commit-message "Added context grounding"
--template-name "correctness"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--template 'Updated prompt...
--commit-message "Added context grounding"
--template-name "correctness"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--template 'Updated prompt...
{input} / {output} / {context}'
ax evaluators create-version EVALUATOR_ID
--commit-message "添加上下文关联"
--template-name "correctness"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--template 'Updated prompt...
--commit-message "添加上下文关联"
--template-name "correctness"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--template 'Updated prompt...
{input} / {output} / {context}'
Update metadata only (name, description — not prompt)
仅更新元数据(名称、描述——不修改提示词)
ax evaluators update EVALUATOR_ID
--name "New Name"
--description "Updated description"
--name "New Name"
--description "Updated description"
ax evaluators update EVALUATOR_ID
--name "新名称"
--description "更新后的描述"
--name "新名称"
--description "更新后的描述"
Delete (permanent — removes all versions)
删除评估器(永久删除——移除所有版本)
ax evaluators delete EVALUATOR_ID
**Key flags for `create`:**
| Flag | Required | Description |
|------|----------|-------------|
| `--name` | yes | Evaluator name (unique within space) |
| `--space-id` | yes | Space to create in |
| `--template-name` | yes | Eval column name — alphanumeric, spaces, hyphens, underscores |
| `--commit-message` | yes | Description of this version |
| `--ai-integration-id` | yes | AI integration ID (from above) |
| `--model-name` | yes | Judge model (e.g. `gpt-4o`) |
| `--template` | yes | Prompt with `{variable}` placeholders (single-quoted in bash) |
| `--description` | no | Human-readable description |
| `--include-explanations` | no | Include reasoning alongside the label |
| `--use-function-calling` | no | Prefer structured function-call output |
| `--invocation-params` | no | JSON of model params e.g. `'{"temperature": 0}'` |ax evaluators delete EVALUATOR_ID
**`create`命令的关键参数:**
| 参数 | 是否必填 | 说明 |
|------|----------|-------------|
| `--name` | 是 | 评估器名称(在space内唯一) |
| `--space-id` | 是 | 要创建评估器的space ID |
| `--template-name` | 是 | 评估列名称——支持字母、数字、空格、连字符、下划线 |
| `--commit-message` | 是 | 此版本的描述信息 |
| `--ai-integration-id` | 是 | AI集成ID(来自上述步骤) |
| `--model-name` | 是 | 评判模型(如`gpt-4o`) |
| `--template` | 是 | 包含`{variable}`占位符的提示词(在bash中使用单引号包裹) |
| `--description` | 否 | 易读的描述信息 |
| `--include-explanations` | 否 | 在标签旁包含评判推理过程 |
| `--use-function-calling` | 否 | 优先使用结构化函数调用输出 |
| `--invocation-params` | 否 | 模型参数的JSON字符串,例如`'{"temperature": 0}'` |Tasks
任务(Tasks)
bash
undefinedbash
undefinedList / Get
列出/获取详情
ax tasks list --space-id SPACE_ID
ax tasks list --project-id PROJ_ID
ax tasks list --dataset-id DATASET_ID
ax tasks get TASK_ID
ax tasks list --space-id SPACE_ID
ax tasks list --project-id PROJ_ID
ax tasks list --dataset-id DATASET_ID
ax tasks get TASK_ID
Create (project — continuous)
创建项目任务(持续运行)
ax tasks create
--name "Correctness Monitor"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--is-continuous
--sampling-rate 0.1
--name "Correctness Monitor"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--is-continuous
--sampling-rate 0.1
ax tasks create
--name "Correctness Monitor"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--is-continuous
--sampling-rate 0.1
--name "Correctness Monitor"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--is-continuous
--sampling-rate 0.1
Create (project — one-time / backfill)
创建项目任务(一次性/回溯)
ax tasks create
--name "Correctness Backfill"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--no-continuous
--name "Correctness Backfill"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--no-continuous
ax tasks create
--name "Correctness Backfill"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--no-continuous
--name "Correctness Backfill"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--no-continuous
Create (experiment / dataset)
创建实验/数据集任务
ax tasks create
--name "Experiment Scoring"
--task-type template_evaluation
--dataset-id DATASET_ID
--experiment-ids "EXP_ID_1,EXP_ID_2"
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]'
--no-continuous
--name "Experiment Scoring"
--task-type template_evaluation
--dataset-id DATASET_ID
--experiment-ids "EXP_ID_1,EXP_ID_2"
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]'
--no-continuous
ax tasks create
--name "Experiment Scoring"
--task-type template_evaluation
--dataset-id DATASET_ID
--experiment-ids "EXP_ID_1,EXP_ID_2"
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]'
--no-continuous
--name "Experiment Scoring"
--task-type template_evaluation
--dataset-id DATASET_ID
--experiment-ids "EXP_ID_1,EXP_ID_2"
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]'
--no-continuous
Trigger a run (project task — use data window)
触发任务运行(项目任务——指定时间范围)
ax tasks trigger-run TASK_ID
--data-start-time "2026-03-20T00:00:00"
--data-end-time "2026-03-21T23:59:59"
--wait
--data-start-time "2026-03-20T00:00:00"
--data-end-time "2026-03-21T23:59:59"
--wait
ax tasks trigger-run TASK_ID
--data-start-time "2026-03-20T00:00:00"
--data-end-time "2026-03-21T23:59:59"
--wait
--data-start-time "2026-03-20T00:00:00"
--data-end-time "2026-03-21T23:59:59"
--wait
Trigger a run (experiment task — use experiment IDs)
触发任务运行(实验任务——指定实验ID)
ax tasks trigger-run TASK_ID
--experiment-ids "EXP_ID_1"
--wait
--experiment-ids "EXP_ID_1"
--wait
ax tasks trigger-run TASK_ID
--experiment-ids "EXP_ID_1"
--wait
--experiment-ids "EXP_ID_1"
--wait
Monitor
监控任务运行
ax tasks list-runs TASK_ID
ax tasks get-run RUN_ID
ax tasks wait-for-run RUN_ID --timeout 300
ax tasks cancel-run RUN_ID --force
**Time format for trigger-run:** `2026-03-21T09:00:00` — no trailing `Z`.
**Additional trigger-run flags:**
| Flag | Description |
|------|-------------|
| `--max-spans` | Cap processed spans (default 10,000) |
| `--override-evaluations` | Re-score spans that already have labels |
| `--wait` / `-w` | Block until the run finishes |
| `--timeout` | Seconds to wait with `--wait` (default 600) |
| `--poll-interval` | Poll interval in seconds when waiting (default 5) |
**Run status guide:**
| Status | Meaning |
|--------|---------|
| `completed`, 0 spans | No spans in eval index for that window — widen time range |
| `cancelled` ~1s | Integration credentials invalid |
| `cancelled` ~3min | Found spans but LLM call failed — check model name or key |
| `completed`, N > 0 | Success — check scores in UI |
---ax tasks list-runs TASK_ID
ax tasks get-run RUN_ID
ax tasks wait-for-run RUN_ID --timeout 300
ax tasks cancel-run RUN_ID --force
**trigger-run命令的时间格式:** `2026-03-21T09:00:00` — 不要添加末尾的`Z`。
**trigger-run命令的额外参数:**
| 参数 | 说明 |
|------|-------------|
| `--max-spans` | 限制处理的spans数量(默认10,000) |
| `--override-evaluations` | 重新为已打过分的spans打分 |
| `--wait` / `-w` | 阻塞等待任务运行完成 |
| `--timeout` | 使用`--wait`时的等待超时时间(默认600秒) |
| `--poll-interval` | 等待时的轮询间隔(默认5秒) |
**任务运行状态说明:**
| 状态 | 含义 |
|--------|---------|
| `completed`, 0 spans | 指定时间范围内无spans数据——扩大时间范围 |
| `cancelled` ~1秒 | 集成凭证无效 |
| `cancelled` ~3分钟 | 找到spans但LLM调用失败——检查模型名称或密钥 |
| `completed`, N > 0 | 运行成功——在UI中查看评分结果 |
---Workflow A: Create an evaluator for a project
工作流A:为项目创建评估器
Use this when the user says something like "create an evaluator for my Playground Traces project".
当用户提及类似*"为我的Playground Traces项目创建一个评估器"*时,使用此工作流。
Step 1: Resolve the project name to an ID
步骤1:将项目名称解析为ID
ax spans exportbash
ax projects list --space-id SPACE_ID -o jsonFind the entry whose matches (case-insensitive). Copy its (a base64 string).
"name""id"ax spans exportbash
ax projects list --space-id SPACE_ID -o json找到匹配的条目(不区分大小写),复制其(base64字符串)。
"name""id"Step 2: Understand what to evaluate
步骤2:明确评估目标
If the user specified the evaluator type (hallucination, correctness, relevance, etc.) → skip to Step 3.
If not, sample recent spans to base the evaluator on actual data:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdoutInspect , , span kinds, and any existing annotations. Identify failure modes (e.g. hallucinated facts, off-topic answers, missing context) and propose 1–3 concrete evaluator ideas. Let the user pick.
attributes.inputattributes.outputEach suggestion must include: the evaluator name (bold), a one-sentence description of what it judges, and the binary label pair in parentheses. Format each like:
- Name — Description of what is being judged. (/
label_a)label_b
Example:
- Response Correctness — Does the agent's response correctly address the user's financial query? (/
correct)incorrect - Hallucination — Does the response fabricate facts not grounded in retrieved context? (/
factual)hallucinated
若用户已指定评估器类型(幻觉、正确性、相关性等)→ 跳至步骤3。
若未指定,采样最近的spans数据,基于实际数据确定评估方向:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdout检查、、span类型以及现有标注。识别失败模式(如事实幻觉、答非所问、缺失上下文),并提出1–3个具体的评估器方案,让用户选择。
attributes.inputattributes.output每个建议需包含:评估器名称(加粗)、一句话说明评判内容,以及括号内的二元标签对。格式如下:
- 名称 — 评判内容说明。(/
标签a)标签b
示例:
- 回答正确性 — 智能体的回答是否正确回应用户的金融查询?(/
correct)incorrect - 幻觉检测 — 回答是否编造了未在检索上下文中提及的事实?(/
factual)hallucinated
Step 3: Confirm or create an AI integration
步骤3:确认或创建AI集成
bash
ax ai-integrations list --space-id SPACE_ID -o jsonIf a suitable integration exists, note its ID. If not, create one using the arize-ai-provider-integration skill. Ask the user which provider/model they want for the judge.
bash
ax ai-integrations list --space-id SPACE_ID -o json若存在合适的集成,记录其ID。若不存在,使用arize-ai-provider-integration Skill创建一个,询问用户希望使用哪个提供商/模型作为评判模型。
Step 4: Create the evaluator
步骤4:创建评估器
Use the template design best practices below. Keep the evaluator name and variables generic — the task (Step 6) handles project-specific wiring via .
column_mappingsbash
ax evaluators create \
--name "Hallucination" \
--space-id SPACE_ID \
--template-name "hallucination" \
--commit-message "Initial version" \
--ai-integration-id INT_ID \
--model-name "gpt-4o" \
--include-explanations \
--use-function-calling \
--template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.
User question: {input}
Model response: {output}
Respond with exactly one of these labels: hallucinated, factual'遵循以下模板设计最佳实践。保持评估器名称和变量的通用性——项目特定的关联由步骤6中的任务处理。
column_mappingsbash
ax evaluators create \
--name "Hallucination" \
--space-id SPACE_ID \
--template-name "hallucination" \
--commit-message "Initial version" \
--ai-integration-id INT_ID \
--model-name "gpt-4o" \
--include-explanations \
--use-function-calling \
--template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.
User question: {input}
Model response: {output}
Respond with exactly one of these labels: hallucinated, factual'Step 5: Ask — backfill, continuous, or both?
步骤5:询问用户需求:回溯、持续监控,或两者兼具?
Before creating the task, ask:
"Would you like to: (a) Run a backfill on historical spans (one-time)? (b) Set up continuous evaluation on new spans going forward? (c) Both — backfill now and keep scoring new spans automatically?"
创建任务前,询问用户:
"请问您需要: (a) 对历史spans执行回溯评估(一次性)? (b) 为新产生的spans设置持续监控? (c) 两者兼具——立即执行回溯评估,并自动为新spans打分?"
Step 6: Determine column mappings from real span data
步骤6:根据实际span数据确定列映射
Do not guess paths. Pull a sample and inspect what fields are actually present:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdoutFor each template variable (, , ), find the matching JSON path. Common starting points — always verify on your actual data before using:
{input}{output}{context}| Template var | LLM span | CHAIN span |
|---|---|---|
| | |
| | |
| | — |
| | |
Validate span kind alignment: If the evaluator prompt assumes LLM final text but the task targets CHAIN spans (or vice versa), runs can cancel or score the wrong text. Make sure the on the task matches the span kind you mapped.
query_filterFull example JSON:
--evaluatorsjson
[
{
"evaluator_id": "EVAL_ID",
"query_filter": "span_kind = 'LLM'",
"column_mappings": {
"input": "attributes.input.value",
"output": "attributes.llm.output_messages.0.message.content",
"context": "attributes.retrieval.documents.contents"
}
}
]Include a mapping for every variable the template references. Omitting one causes runs to produce no valid scores.
请勿猜测字段路径。拉取样本数据,检查实际存在的字段:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdout为每个模板变量(、、)找到对应的JSON路径。以下是常见起始路径——使用前请务必验证实际数据:
{input}{output}{context}| 模板变量 | LLM span | CHAIN span |
|---|---|---|
| | |
| | |
| | — |
| | |
验证span类型匹配: 若评估器提示词假设针对LLM最终文本,但任务目标是CHAIN spans(反之亦然),任务可能会取消或打分错误。确保任务的与列映射的span类型匹配。
query_filter完整的 JSON示例:
--evaluatorsjson
[
{
"evaluator_id": "EVAL_ID",
"query_filter": "span_kind = 'LLM'",
"column_mappings": {
"input": "attributes.input.value",
"output": "attributes.llm.output_messages.0.message.content",
"context": "attributes.retrieval.documents.contents"
}
}
]为模板中引用的每一个变量配置映射。遗漏变量会导致任务运行后无有效评分。
Step 7: Create the task
步骤7:创建任务
Backfill only (a):
bash
ax tasks create \
--name "Hallucination Backfill" \
--task-type template_evaluation \
--project-id PROJECT_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--no-continuousContinuous only (b):
bash
ax tasks create \
--name "Hallucination Monitor" \
--task-type template_evaluation \
--project-id PROJECT_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--is-continuous \
--sampling-rate 0.1Both (c): Use on create, then also trigger a backfill run in Step 8.
--is-continuous仅回溯评估(a):
bash
ax tasks create \
--name "Hallucination Backfill" \
--task-type template_evaluation \
--project-id PROJECT_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--no-continuous仅持续监控(b):
bash
ax tasks create \
--name "Hallucination Monitor" \
--task-type template_evaluation \
--project-id PROJECT_ID \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
--is-continuous \
--sampling-rate 0.1两者兼具(c): 创建任务时使用参数,然后在步骤8中触发回溯运行。
--is-continuousStep 8: Trigger a backfill run (if requested)
步骤8:触发回溯运行(若用户要求)
First find what time range has data:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout # try last 24h first
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout # widen if emptyUse the / fields from real spans to set the window. Use the most recent data for your first test run.
start_timeend_timebash
ax tasks trigger-run TASK_ID \
--data-start-time "2026-03-20T00:00:00" \
--data-end-time "2026-03-21T23:59:59" \
--wait首先确定数据的时间范围:
bash
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout # 先尝试最近24小时
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout # 若无数据则扩大范围使用实际spans数据中的 / 字段设置时间范围。首次测试运行建议使用最近的数据。
start_timeend_timebash
ax tasks trigger-run TASK_ID \
--data-start-time "2026-03-20T00:00:00" \
--data-end-time "2026-03-21T23:59:59" \
--waitWorkflow B: Create an evaluator for an experiment
工作流B:为实验创建评估器
Use this when the user says something like "create an evaluator for my experiment" or "evaluate my dataset runs".
If the user says "dataset" but doesn't have an experiment: A task must target an experiment (not a bare dataset). Ask:
"Evaluation tasks run against experiment runs, not datasets directly. Would you like help creating an experiment on that dataset first?"
If yes, use the arize-experiment skill to create one, then return here.
当用户提及类似*"为我的实验创建一个评估器"或"评估我的数据集运行结果"*时,使用此工作流。
若用户提及"dataset"但未创建实验: 评估任务必须针对实验(而非裸数据集)。请询问用户:
"评估任务需针对实验运行结果,而非直接针对数据集。是否需要先帮助您在该数据集上创建实验?"
若用户同意,使用arize-experiment Skill创建实验,然后返回此工作流。
Step 1: Resolve dataset and experiment
步骤1:解析dataset和experiment
bash
ax datasets list --space-id SPACE_ID -o json
ax experiments list --dataset-id DATASET_ID -o jsonNote the dataset ID and the experiment ID(s) to score.
bash
ax datasets list --space-id SPACE_ID -o json
ax experiments list --dataset-id DATASET_ID -o json记录dataset ID和要评分的experiment ID(s)。
Step 2: Understand what to evaluate
步骤2:明确评估目标
If the user specified the evaluator type → skip to Step 3.
If not, inspect a recent experiment run to base the evaluator on actual data:
bash
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"Look at the , , , and fields. Identify gaps (metrics the user cares about but doesn't have yet) and propose 1–3 evaluator ideas. Each suggestion must include: the evaluator name (bold), a one-sentence description, and the binary label pair in parentheses — same format as Workflow A, Step 2.
outputinputevaluationsmetadata若用户已指定评估器类型 → 跳至步骤3。
若未指定,检查最近的实验运行结果,基于实际数据确定评估方向:
bash
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"查看、、和字段。识别现有指标的缺口(用户关心但未覆盖的指标),并提出1–3个评估器方案。每个建议需包含:评估器名称(加粗)、一句话说明,以及括号内的二元标签对——格式与工作流A的步骤2相同。
outputinputevaluationsmetadataStep 3: Confirm or create an AI integration
步骤3:确认或创建AI集成
Same as Workflow A, Step 3.
与工作流A的步骤3相同。
Step 4: Create the evaluator
步骤4:创建评估器
Same as Workflow A, Step 4. Keep variables generic.
与工作流A的步骤4相同。保持变量的通用性。
Step 5: Determine column mappings from real run data
步骤5:根据实际运行数据确定列映射
Run data shape differs from span data. Inspect:
bash
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"Common mapping for experiment runs:
- →
output(top-level field on each run)"output" - → check if it's on the run or embedded in the linked dataset examples
input
If is not on the run JSON, export dataset examples to find the path:
inputbash
ax datasets export DATASET_ID --stdout | python3 -c "import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))"运行数据的结构与span数据不同。请检查:
bash
ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"实验运行结果的常见映射:
- →
output(每个运行结果的顶级字段)"output" - → 检查其是否在运行结果中,或嵌入在关联的dataset示例中
input
若运行结果JSON中无字段,导出dataset示例查找路径:
inputbash
ax datasets export DATASET_ID --stdout | python3 -c "import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))"Step 6: Create the task
步骤6:创建任务
bash
ax tasks create \
--name "Experiment Correctness" \
--task-type template_evaluation \
--dataset-id DATASET_ID \
--experiment-ids "EXP_ID" \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
--no-continuousbash
ax tasks create \
--name "Experiment Correctness" \
--task-type template_evaluation \
--dataset-id DATASET_ID \
--experiment-ids "EXP_ID" \
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
--no-continuousStep 7: Trigger and monitor
步骤7:触发并监控任务
bash
ax tasks trigger-run TASK_ID \
--experiment-ids "EXP_ID" \
--wait
ax tasks list-runs TASK_ID
ax tasks get-run RUN_IDbash
ax tasks trigger-run TASK_ID \
--experiment-ids "EXP_ID" \
--wait
ax tasks list-runs TASK_ID
ax tasks get-run RUN_IDBest Practices for Template Design
模板设计最佳实践
1. Use generic, portable variable names
1. 使用通用、可复用的变量名
Use , , and — not names tied to a specific project or span attribute (e.g. do not use ). The evaluator itself stays abstract; the task's is where you wire it to the actual fields in a specific project or experiment. This lets the same evaluator run across multiple projects and experiments without modification.
{input}{output}{context}{attributes_input_value}column_mappings使用、和——而非与特定项目或span属性绑定的名称(如请勿使用)。评估器本身保持抽象;**任务的**负责将其与特定项目或实验的实际字段关联。这样无需修改评估器,即可在多个项目和实验中复用。
{input}{output}{context}{attributes_input_value}column_mappings2. Default to binary labels
2. 默认使用二元标签
Use exactly two clear string labels (e.g. / , / , / ). Binary labels are:
hallucinatedfactualcorrectincorrectpassfail- Easiest for the judge model to produce consistently
- Most common in the industry
- Simplest to interpret in dashboards
If the user insists on more than two choices, that's fine — but recommend binary first and explain the tradeoff (more labels → more ambiguity → lower inter-rater reliability).
使用两个清晰的字符串标签(如 / 、 / 、 / )。二元标签的优势:
hallucinatedfactualcorrectincorrectpassfail- 评判模型最容易持续生成一致结果
- 行业最常用
- 仪表板中最易解读
若用户坚持使用两个以上标签也可,但需先推荐二元标签,并说明权衡(标签越多→歧义越多→评分一致性越低)。
3. Be explicit about what the model must return
3. 明确要求模型返回的内容
The template must tell the judge model to respond with only the label string — nothing else. The label strings in the prompt must exactly match the classification choices configured in the UI (same spelling, same casing).
Good:
Respond with exactly one of these labels: hallucinated, factualBad (too open-ended):
Is this hallucinated? Answer yes or no.模板必须明确告知评判模型,仅返回标签字符串——不得返回其他内容。提示词中的标签字符串必须与UI中配置的分类选项完全匹配(拼写、大小写一致)。
示例(良好):
Respond with exactly one of these labels: hallucinated, factual示例(不佳,过于开放):
Is this hallucinated? Answer yes or no.4. Keep temperature low
4. 设置较低的temperature值
Pass for reproducible scoring. Higher temperatures introduce noise into evaluation results.
--invocation-params '{"temperature": 0}'传递以保证评分可复现。较高的temperature会为评估结果引入噪声。
--invocation-params '{"temperature": 0}'5. Use --include-explanations
for debugging
--include-explanations5. 使用--include-explanations
用于调试
--include-explanationsDuring initial setup, always include explanations so you can verify the judge is reasoning correctly before trusting the labels at scale.
在初始设置阶段,始终启用解释功能,以便在大规模信任标签前,验证评判模型的推理是否正确。
6. Pass the template in single quotes in bash
6. 在bash中使用单引号包裹模板
Single quotes prevent the shell from interpolating placeholders. Double quotes will cause issues:
{variable}bash
undefined单引号可防止shell解析占位符。使用双引号会导致问题:
{variable}bash
undefinedCorrect
正确写法
--template 'Judge this: {input} → {output}'
--template 'Judge this: {input} → {output}'
Wrong — shell may interpret { } or fail
错误写法——shell可能解析{ }导致失败
--template "Judge this: {input} → {output}"
undefined--template "Judge this: {input} → {output}"
undefined7. Validate that choices match your template labels
7. 验证标签选项与模板中的标签一致
Reconcile the prompt labels and the Choices panel so they stay in sync. Mismatches cause runs to produce no valid scores.
确保提示词中的标签与UI中的Choices面板保持同步。不匹配会导致任务运行后无有效评分。
Troubleshooting
故障排查
| Problem | Solution |
|---|---|
| See ax-setup.md |
| API key may not have access to this space. Verify at https://app.arize.com/admin > API Keys |
| |
| |
| |
| Use only one when creating a task |
| Add |
| Remove |
Validation error on | Pass project ID (base64), not project name — look up via |
| Template validation errors | Use single-quoted |
Run stuck in | |
Run | Integration credentials invalid — check AI integration |
Run | Found spans but LLM call failed — wrong model name or bad key |
Run | Widen time window; eval index may not cover older data |
| No scores in UI | Fix |
| Scores look wrong | Add |
| Evaluator cancels on wrong span kind | Match |
Time format error on | Use |
| 问题 | 解决方案 |
|---|---|
| 查看ax-setup.md文档 |
| API密钥可能无此space的访问权限。在https://app.arize.com/admin > API Keys验证权限 |
| 运行 |
| 运行 |
| 运行 |
| 创建任务时仅使用其中一个参数 |
| 在 |
| 从dataset任务中移除 |
| 传入项目ID(base64格式),而非项目名称——通过 |
| 模板验证错误 | 在bash中使用单引号包裹 |
任务运行卡在 | 运行 |
任务运行 | 集成凭证无效——检查AI集成 |
任务运行 | 找到spans但LLM调用失败——检查模型名称或密钥 |
任务运行 | 扩大时间范围;评估索引可能未覆盖旧数据 |
| UI中无评分结果 | 修正 |
| 评分结果异常 | 添加 |
| 评估器因错误的span类型取消运行 | 使 |
| 使用 |
Related Skills
相关Skills
- arize-ai-provider-integration: Full CRUD for LLM provider integrations (create, update, delete credentials)
- arize-trace: Export spans to discover column paths and time ranges
- arize-experiment: Create experiments and export runs for experiment column mappings
- arize-dataset: Export dataset examples to find input fields when runs omit them
- arize-link: Deep links to evaluators and tasks in the Arize UI
- arize-ai-provider-integration: LLM提供商集成的完整CRUD操作(创建、更新、删除凭证)
- arize-trace: 导出spans数据以发现字段路径和时间范围
- arize-experiment: 创建实验并导出运行结果,用于实验列映射
- arize-dataset: 导出dataset示例,当运行结果中缺少input字段时查找对应路径
- arize-link: Arize UI中评估器和任务的深度链接
Save Credentials for Future Use
保存凭证以便后续使用
At the end of the session, if the user manually provided any credentials during this conversation and those values were NOT already loaded from a saved profile or environment variable, offer to save them.
Skip this entirely if:
- The API key was already loaded from an existing profile or env var
ARIZE_API_KEY - The space ID was already set via env var
ARIZE_SPACE_ID
How to offer: Use AskQuestion: "Would you like to save your Arize credentials so you don't have to enter them next time?" with options / .
"Yes, save them""No thanks"If the user says yes:
-
API key — See ax-profiles.md. Runto check the current state, then use
ax profiles showorax profiles createwith the appropriate flags to save the key (and region if relevant).ax profiles update -
Space ID — See ax-profiles.md (Space ID section) to persist it as an environment variable.
在会话结束时,若用户在此对话中手动提供了任何凭证,且这些值未从已保存的配置文件或环境变量中加载,可主动询问用户是否保存。
以下情况请跳过此步骤:
- API密钥已从现有配置文件或环境变量中加载
ARIZE_API_KEY - space ID已通过环境变量设置
ARIZE_SPACE_ID
询问方式: 使用AskQuestion:"是否需要保存您的Arize凭证?下次使用时无需重复输入。",提供选项 / 。
"是,保存""不用了,谢谢"若用户选择是:
-
API密钥 — 查看ax-profiles.md文档。运行检查当前状态,然后使用
ax profiles show或ax profiles create命令,通过相应参数保存密钥(若相关,同时保存区域信息)。ax profiles update -
Space ID — 查看ax-profiles.md文档的Space ID章节,将其设置为环境变量以持久化保存。