arize-evaluator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Arize Evaluator Skill

Arize 评估器 Skill

This skill covers designing, creating, and running LLM-as-judge evaluators on Arize. An evaluator defines the judge; a task is how you run it against real data.

本Skill涵盖在Arize平台上设计、创建和运行LLM-as-judge评估器的相关操作。评估器（evaluator）定义了评判规则；任务（task）则是将评估器应用于真实数据的执行方式。

Prerequisites

前置条件

Three things are needed:

ax

CLI, an API key (env var or profile), and a space ID.

ax

is not installed, not on PATH, or below version

0.8.0

, see ax-setup.md.

macOS/Linux:

bash

ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (set)"; else echo "ARIZE_API_KEY: (not set)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(not set)}" && echo "--- profiles ---" && ax profiles show 2>&1

Windows (PowerShell):

powershell

ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(set)' } else { '(not set)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "--- profiles ---"; ax profiles show 2>&1

Proceed immediately if env var or profile has an API key. Only ask the user if both are missing.

No API key in env and no profile → AskQuestion: "Arize API key (https://app.arize.com/admin > API Keys)"
Space ID unknown → run
```
ax spaces list -o json
```
to list all accessible spaces, or AskQuestion

需要准备三项内容：ax CLI、API密钥（环境变量或配置文件）以及space ID。

若未安装ax CLI、其未在PATH路径中，或版本低于

0.8.0

，请查看ax-setup.md文档。

macOS/Linux系统：

bash

ax --version && echo "--- env ---" && if [ -n "$ARIZE_API_KEY" ]; then echo "ARIZE_API_KEY: (已设置)"; else echo "ARIZE_API_KEY: (未设置)"; fi && echo "ARIZE_SPACE_ID: ${ARIZE_SPACE_ID:-(未设置)}" && echo "--- profiles ---" && ax profiles show 2>&1

Windows系统（PowerShell）：

powershell

ax --version; Write-Host "--- env ---"; Write-Host "ARIZE_API_KEY: $(if ($env:ARIZE_API_KEY) { '(已设置)' } else { '(未设置)' })"; Write-Host "ARIZE_SPACE_ID: $env:ARIZE_SPACE_ID"; Write-Host "--- profiles ---"; ax profiles show 2>&1

若环境变量或配置文件中已存在API密钥，可直接继续操作。仅当两者均缺失时，才需询问用户。

环境变量中无API密钥且无配置文件 → 询问用户："请提供Arize API密钥（获取地址：https://app.arize.com/admin > API Keys）"
未知space ID → 运行
```
ax spaces list -o json
```
列出所有可访问的space，或询问用户

Concepts

核心概念

What is an Evaluator?

什么是Evaluator？

An evaluator is an LLM-as-judge definition. It contains:

Field	Description
Template	The judge prompt. Uses `{variable}` placeholders (e.g. `{input}` , `{output}` , `{context}` ) that get filled in at run time via a task's column mappings.
Classification choices	The set of allowed output labels (e.g. `factual` / `hallucinated` ). Binary is the default and most common. Each choice can optionally carry a numeric score.
AI Integration	Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model.
Model	The specific judge model (e.g. `gpt-4o` , `claude-sonnet-4-5` ).
Invocation params	Optional JSON of model settings like `{"temperature": 0}` . Low temperature is recommended for reproducibility.
Optimization direction	Whether higher scores are better ( `maximize` ) or worse ( `minimize` ). Sets how the UI renders trends.
Data granularity	Whether the evaluator runs at the span, trace, or session level. Most evaluators run at the span level.

Evaluators are versioned — every prompt or model change creates a new immutable version. The most recent version is active.

Evaluator是LLM-as-judge的定义，包含以下字段：

字段	说明
Template	评判提示词，使用 `{variable}` 占位符（如 `{input}` 、 `{output}` 、 `{context}` ），在任务执行时通过任务的列映射填充实际值。
Classification choices	允许的输出标签集合（如 `factual` / `hallucinated` ）。默认且最常用的是二元标签。每个标签可选择性关联数值分数。
AI Integration	存储评估器调用评判模型所需的LLM提供商凭证（OpenAI、Anthropic、Bedrock等）。
Model	具体的评判模型（如 `gpt-4o` 、 `claude-sonnet-4-5` ）。
Invocation params	模型设置的可选JSON参数，例如 `{"temperature": 0}` 。为保证结果可复现，建议设置较低的temperature值。
Optimization direction	分数越高越好（ `maximize` ）还是越差（ `minimize` ），用于设置UI中的趋势展示方式。
Data granularity	评估器的运行粒度：span、trace或session级别。大多数评估器运行在span级别。

评估器支持版本化——每次修改提示词或模型都会创建一个新的不可变版本，最新版本为活跃版本。

What is a Task?

什么是Task？

A task is how you run one or more evaluators against real data. Tasks are attached to a project (live traces/spans) or a dataset (experiment runs). A task contains:

Field	Description
Evaluators	List of evaluators to run. You can run multiple in one task.
Column mappings	Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. `"input" → "attributes.input.value"` ). This is what makes evaluators portable across projects and experiments.
Query filter	SQL-style expression to select which spans/runs to evaluate (e.g. `"span_kind = 'LLM'"` ). Optional but important for precision.
Continuous	For project tasks: whether to automatically score new spans as they arrive.
Sampling rate	For continuous project tasks: fraction of new spans to evaluate (0–1).

Task是将一个或多个评估器应用于真实数据的执行方式。任务可关联到项目（实时traces/spans）或数据集（实验运行结果），包含以下字段：

字段	说明
Evaluators	要运行的评估器列表，可在一个任务中运行多个评估器。
Column mappings	将评估器模板中的变量映射到spans或实验运行结果中的实际字段路径（如 `"input" → "attributes.input.value"` ），这是评估器可跨项目和实验复用的关键。
Query filter	SQL风格的表达式，用于选择要评估的spans/运行结果（如 `"span_kind = 'LLM'"` ）。可选但对提升评估精度很重要。
Continuous	针对项目任务：是否自动为新产生的spans打分。
Sampling rate	针对持续运行的项目任务：要评估的新spans的比例（0–1）。

Basic CRUD

基础CRUD操作

AI Integrations

AI集成

AI integrations store the LLM provider credentials the evaluator uses. For full CRUD — listing, creating for all providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Gemini, NVIDIA NIM, custom), updating, and deleting — use the arize-ai-provider-integration skill.

Quick reference for the common case (OpenAI):

bash

undefined

AI集成存储评估器调用LLM所需的提供商凭证。如需完整的CRUD操作——列出、创建（支持所有提供商：OpenAI、Anthropic、Azure、Bedrock、Vertex、Gemini、NVIDIA NIM、自定义）、更新和删除——请使用arize-ai-provider-integration Skill。

以下是常见场景（OpenAI）的快速参考：

bash

undefined

Check for an existing integration first

先检查是否已存在集成

ax ai-integrations list --space-id SPACE_ID

Create if none exists

若不存在则创建

ax ai-integrations create
--name "My OpenAI Integration"
--provider openAI
--api-key "sk-..."


Copy the returned integration ID — it is required for `ax evaluators create --ai-integration-id`.

ax ai-integrations create
--name "My OpenAI Integration"
--provider openAI
--api-key "sk-..."


复制返回的集成ID——创建`ax evaluators create`时需要用到`--ai-integration-id`参数。

Evaluators

评估器（Evaluators）

bash

undefined

bash

undefined

List / Get

列出/获取详情

ax evaluators list --space-id SPACE_ID ax evaluators get EVALUATOR_ID ax evaluators list-versions EVALUATOR_ID ax evaluators get-version VERSION_ID

Create (creates the evaluator and its first version)

创建评估器（同时创建其第一个版本）

ax evaluators create
--name "Answer Correctness"
--space-id SPACE_ID
--description "Judges if the model answer is correct"
--template-name "correctness"
--commit-message "Initial version"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--use-function-calling
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.

User question: {input}

Model response: {output}

Respond with exactly one of these labels: correct, incorrect'

ax evaluators create
--name "Answer Correctness"
--space-id SPACE_ID
--description "判断模型回答是否正确"
--template-name "correctness"
--commit-message "初始版本"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--use-function-calling
--template 'You are an evaluator. Given the user question and the model response, decide if the response correctly answers the question.

User question: {input}

Model response: {output}

Respond with exactly one of these labels: correct, incorrect'

Create a new version (for prompt or model changes — versions are immutable)

创建新版本（用于修改提示词或模型——版本不可变）

ax evaluators create-version EVALUATOR_ID
--commit-message "Added context grounding"
--template-name "correctness"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--template 'Updated prompt...

{input} / {output} / {context}'

ax evaluators create-version EVALUATOR_ID
--commit-message "添加上下文关联"
--template-name "correctness"
--ai-integration-id INT_ID
--model-name "gpt-4o"
--include-explanations
--template 'Updated prompt...

{input} / {output} / {context}'

Update metadata only (name, description — not prompt)

仅更新元数据（名称、描述——不修改提示词）

ax evaluators update EVALUATOR_ID
--name "New Name"
--description "Updated description"

ax evaluators update EVALUATOR_ID
--name "新名称"
--description "更新后的描述"

Delete (permanent — removes all versions)

删除评估器（永久删除——移除所有版本）

ax evaluators delete EVALUATOR_ID


**Key flags for `create`:**

| Flag | Required | Description |
|------|----------|-------------|
| `--name` | yes | Evaluator name (unique within space) |
| `--space-id` | yes | Space to create in |
| `--template-name` | yes | Eval column name — alphanumeric, spaces, hyphens, underscores |
| `--commit-message` | yes | Description of this version |
| `--ai-integration-id` | yes | AI integration ID (from above) |
| `--model-name` | yes | Judge model (e.g. `gpt-4o`) |
| `--template` | yes | Prompt with `{variable}` placeholders (single-quoted in bash) |
| `--description` | no | Human-readable description |
| `--include-explanations` | no | Include reasoning alongside the label |
| `--use-function-calling` | no | Prefer structured function-call output |
| `--invocation-params` | no | JSON of model params e.g. `'{"temperature": 0}'` |

ax evaluators delete EVALUATOR_ID


**`create`命令的关键参数：**

| 参数 | 是否必填 | 说明 |
|------|----------|-------------|
| `--name` | 是 | 评估器名称（在space内唯一） |
| `--space-id` | 是 | 要创建评估器的space ID |
| `--template-name` | 是 | 评估列名称——支持字母、数字、空格、连字符、下划线 |
| `--commit-message` | 是 | 此版本的描述信息 |
| `--ai-integration-id` | 是 | AI集成ID（来自上述步骤） |
| `--model-name` | 是 | 评判模型（如`gpt-4o`） |
| `--template` | 是 | 包含`{variable}`占位符的提示词（在bash中使用单引号包裹） |
| `--description` | 否 | 易读的描述信息 |
| `--include-explanations` | 否 | 在标签旁包含评判推理过程 |
| `--use-function-calling` | 否 | 优先使用结构化函数调用输出 |
| `--invocation-params` | 否 | 模型参数的JSON字符串，例如`'{"temperature": 0}'` |

Tasks

任务（Tasks）

bash

undefined

bash

undefined

List / Get

列出/获取详情

ax tasks list --space-id SPACE_ID ax tasks list --project-id PROJ_ID ax tasks list --dataset-id DATASET_ID ax tasks get TASK_ID

Create (project — continuous)

创建项目任务（持续运行）

ax tasks create
--name "Correctness Monitor"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--is-continuous
--sampling-rate 0.1

Create (project — one-time / backfill)

创建项目任务（一次性/回溯）

ax tasks create
--name "Correctness Backfill"
--task-type template_evaluation
--project-id PROJ_ID
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]'
--no-continuous

Create (experiment / dataset)

创建实验/数据集任务

ax tasks create
--name "Experiment Scoring"
--task-type template_evaluation
--dataset-id DATASET_ID
--experiment-ids "EXP_ID_1,EXP_ID_2"
--evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]'
--no-continuous

Trigger a run (project task — use data window)

触发任务运行（项目任务——指定时间范围）

ax tasks trigger-run TASK_ID
--data-start-time "2026-03-20T00:00:00"
--data-end-time "2026-03-21T23:59:59"
--wait

Trigger a run (experiment task — use experiment IDs)

触发任务运行（实验任务——指定实验ID）

ax tasks trigger-run TASK_ID
--experiment-ids "EXP_ID_1"
--wait

Monitor

监控任务运行

ax tasks list-runs TASK_ID ax tasks get-run RUN_ID ax tasks wait-for-run RUN_ID --timeout 300 ax tasks cancel-run RUN_ID --force


**Time format for trigger-run:** `2026-03-21T09:00:00` — no trailing `Z`.

**Additional trigger-run flags:**

| Flag | Description |
|------|-------------|
| `--max-spans` | Cap processed spans (default 10,000) |
| `--override-evaluations` | Re-score spans that already have labels |
| `--wait` / `-w` | Block until the run finishes |
| `--timeout` | Seconds to wait with `--wait` (default 600) |
| `--poll-interval` | Poll interval in seconds when waiting (default 5) |

**Run status guide:**

| Status | Meaning |
|--------|---------|
| `completed`, 0 spans | No spans in eval index for that window — widen time range |
| `cancelled` ~1s | Integration credentials invalid |
| `cancelled` ~3min | Found spans but LLM call failed — check model name or key |
| `completed`, N > 0 | Success — check scores in UI |

---

ax tasks list-runs TASK_ID ax tasks get-run RUN_ID ax tasks wait-for-run RUN_ID --timeout 300 ax tasks cancel-run RUN_ID --force


**trigger-run命令的时间格式：** `2026-03-21T09:00:00` — 不要添加末尾的`Z`。

**trigger-run命令的额外参数：**

| 参数 | 说明 |
|------|-------------|
| `--max-spans` | 限制处理的spans数量（默认10,000） |
| `--override-evaluations` | 重新为已打过分的spans打分 |
| `--wait` / `-w` | 阻塞等待任务运行完成 |
| `--timeout` | 使用`--wait`时的等待超时时间（默认600秒） |
| `--poll-interval` | 等待时的轮询间隔（默认5秒） |

**任务运行状态说明：**

| 状态 | 含义 |
|--------|---------|
| `completed`, 0 spans | 指定时间范围内无spans数据——扩大时间范围 |
| `cancelled` ~1秒 | 集成凭证无效 |
| `cancelled` ~3分钟 | 找到spans但LLM调用失败——检查模型名称或密钥 |
| `completed`, N > 0 | 运行成功——在UI中查看评分结果 |

---

Workflow A: Create an evaluator for a project

工作流A：为项目创建评估器

Use this when the user says something like "create an evaluator for my Playground Traces project".

当用户提及类似*"为我的Playground Traces项目创建一个评估器"*时，使用此工作流。

Step 1: Resolve the project name to an ID

步骤1：将项目名称解析为ID

ax spans export

requires a project ID, not a name — passing a name causes a validation error. Always look up the ID first:

bash

ax projects list --space-id SPACE_ID -o json

Find the entry whose

"name"

matches (case-insensitive). Copy its

"id"

(a base64 string).

ax spans export

命令需要项目ID，而非项目名称——传入名称会导致验证错误。请务必先查询ID：

bash

ax projects list --space-id SPACE_ID -o json

找到

"name"

匹配的条目（不区分大小写），复制其

"id"

（base64字符串）。

Step 2: Understand what to evaluate

步骤2：明确评估目标

If the user specified the evaluator type (hallucination, correctness, relevance, etc.) → skip to Step 3.

If not, sample recent spans to base the evaluator on actual data:

bash

ax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdout

Inspect

attributes.input

attributes.output

, span kinds, and any existing annotations. Identify failure modes (e.g. hallucinated facts, off-topic answers, missing context) and propose 1–3 concrete evaluator ideas. Let the user pick.

Each suggestion must include: the evaluator name (bold), a one-sentence description of what it judges, and the binary label pair in parentheses. Format each like:

Name — Description of what is being judged. (
```
label_a
```
/
```
label_b
```
)

Example:

Response Correctness — Does the agent's response correctly address the user's financial query? (
```
correct
```
/
```
incorrect
```
)
Hallucination — Does the response fabricate facts not grounded in retrieved context? (
```
factual
```
/
```
hallucinated
```
)

若用户已指定评估器类型（幻觉、正确性、相关性等）→ 跳至步骤3。

若未指定，采样最近的spans数据，基于实际数据确定评估方向：

bash

ax spans export PROJECT_ID --space-id SPACE_ID -l 10 --days 30 --stdout

检查

attributes.input

、

attributes.output

、span类型以及现有标注。识别失败模式（如事实幻觉、答非所问、缺失上下文），并提出1–3个具体的评估器方案，让用户选择。

每个建议需包含：评估器名称（加粗）、一句话说明评判内容，以及括号内的二元标签对。格式如下：

名称 — 评判内容说明。(
```
标签a
```
/
```
标签b
```
)

示例：

回答正确性 — 智能体的回答是否正确回应用户的金融查询？(
```
correct
```
/
```
incorrect
```
)
幻觉检测 — 回答是否编造了未在检索上下文中提及的事实？(
```
factual
```
/
```
hallucinated
```
)

Step 3: Confirm or create an AI integration

步骤3：确认或创建AI集成

bash

ax ai-integrations list --space-id SPACE_ID -o json

If a suitable integration exists, note its ID. If not, create one using the arize-ai-provider-integration skill. Ask the user which provider/model they want for the judge.

bash

ax ai-integrations list --space-id SPACE_ID -o json

若存在合适的集成，记录其ID。若不存在，使用arize-ai-provider-integration Skill创建一个，询问用户希望使用哪个提供商/模型作为评判模型。

Step 4: Create the evaluator

步骤4：创建评估器

Use the template design best practices below. Keep the evaluator name and variables generic — the task (Step 6) handles project-specific wiring via

column_mappings

bash

ax evaluators create \
  --name "Hallucination" \
  --space-id SPACE_ID \
  --template-name "hallucination" \
  --commit-message "Initial version" \
  --ai-integration-id INT_ID \
  --model-name "gpt-4o" \
  --include-explanations \
  --use-function-calling \
  --template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.

User question: {input}

Model response: {output}

Respond with exactly one of these labels: hallucinated, factual'

遵循以下模板设计最佳实践。保持评估器名称和变量的通用性——项目特定的关联由步骤6中的任务

column_mappings

处理。

bash

ax evaluators create \
  --name "Hallucination" \
  --space-id SPACE_ID \
  --template-name "hallucination" \
  --commit-message "Initial version" \
  --ai-integration-id INT_ID \
  --model-name "gpt-4o" \
  --include-explanations \
  --use-function-calling \
  --template 'You are an evaluator. Given the user question and the model response, decide if the response is factual or contains unsupported claims.

User question: {input}

Model response: {output}

Respond with exactly one of these labels: hallucinated, factual'

Step 5: Ask — backfill, continuous, or both?

步骤5：询问用户需求：回溯、持续监控，或两者兼具？

Before creating the task, ask:

"Would you like to: (a) Run a backfill on historical spans (one-time)? (b) Set up continuous evaluation on new spans going forward? (c) Both — backfill now and keep scoring new spans automatically?"

创建任务前，询问用户：

"请问您需要： (a) 对历史spans执行回溯评估（一次性）？ (b) 为新产生的spans设置持续监控？ (c) 两者兼具——立即执行回溯评估，并自动为新spans打分？"

Step 6: Determine column mappings from real span data

步骤6：根据实际span数据确定列映射

Do not guess paths. Pull a sample and inspect what fields are actually present:

bash

ax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdout

For each template variable (

{input}

{output}

{context}

), find the matching JSON path. Common starting points — always verify on your actual data before using:

Template var	LLM span	CHAIN span
`input`	`attributes.input.value`	`attributes.input.value`
`output`	`attributes.llm.output_messages.0.message.content`	`attributes.output.value`
`context`	`attributes.retrieval.documents.contents`	—
`tool_output`	`attributes.input.value` (fallback)	`attributes.output.value`

Validate span kind alignment: If the evaluator prompt assumes LLM final text but the task targets CHAIN spans (or vice versa), runs can cancel or score the wrong text. Make sure the

query_filter

on the task matches the span kind you mapped.

Full example
--evaluators
JSON:

json

[
  {
    "evaluator_id": "EVAL_ID",
    "query_filter": "span_kind = 'LLM'",
    "column_mappings": {
      "input": "attributes.input.value",
      "output": "attributes.llm.output_messages.0.message.content",
      "context": "attributes.retrieval.documents.contents"
    }
  }
]

Include a mapping for every variable the template references. Omitting one causes runs to produce no valid scores.

请勿猜测字段路径。拉取样本数据，检查实际存在的字段：

bash

ax spans export PROJECT_ID --space-id SPACE_ID -l 5 --days 7 --stdout

为每个模板变量（

{input}

、

{output}

、

{context}

）找到对应的JSON路径。以下是常见起始路径——使用前请务必验证实际数据：

模板变量	LLM span	CHAIN span
`input`	`attributes.input.value`	`attributes.input.value`
`output`	`attributes.llm.output_messages.0.message.content`	`attributes.output.value`
`context`	`attributes.retrieval.documents.contents`	—
`tool_output`	`attributes.input.value` （备选）	`attributes.output.value`

验证span类型匹配： 若评估器提示词假设针对LLM最终文本，但任务目标是CHAIN spans（反之亦然），任务可能会取消或打分错误。确保任务的

query_filter

与列映射的span类型匹配。

完整的
--evaluators
JSON示例：

json

[
  {
    "evaluator_id": "EVAL_ID",
    "query_filter": "span_kind = 'LLM'",
    "column_mappings": {
      "input": "attributes.input.value",
      "output": "attributes.llm.output_messages.0.message.content",
      "context": "attributes.retrieval.documents.contents"
    }
  }
]

为模板中引用的每一个变量配置映射。遗漏变量会导致任务运行后无有效评分。

Step 7: Create the task

步骤7：创建任务

Backfill only (a):

bash

ax tasks create \
  --name "Hallucination Backfill" \
  --task-type template_evaluation \
  --project-id PROJECT_ID \
  --evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
  --no-continuous

Continuous only (b):

bash

ax tasks create \
  --name "Hallucination Monitor" \
  --task-type template_evaluation \
  --project-id PROJECT_ID \
  --evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
  --is-continuous \
  --sampling-rate 0.1

Both (c): Use

--is-continuous

on create, then also trigger a backfill run in Step 8.

仅回溯评估（a）：

bash

ax tasks create \
  --name "Hallucination Backfill" \
  --task-type template_evaluation \
  --project-id PROJECT_ID \
  --evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
  --no-continuous

仅持续监控（b）：

bash

ax tasks create \
  --name "Hallucination Monitor" \
  --task-type template_evaluation \
  --project-id PROJECT_ID \
  --evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"input": "attributes.input.value", "output": "attributes.output.value"}}]' \
  --is-continuous \
  --sampling-rate 0.1

两者兼具（c）： 创建任务时使用

--is-continuous

参数，然后在步骤8中触发回溯运行。

Step 8: Trigger a backfill run (if requested)

步骤8：触发回溯运行（若用户要求）

First find what time range has data:

bash

ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout   # try last 24h first
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout   # widen if empty

Use the

start_time

end_time

fields from real spans to set the window. Use the most recent data for your first test run.

bash

ax tasks trigger-run TASK_ID \
  --data-start-time "2026-03-20T00:00:00" \
  --data-end-time "2026-03-21T23:59:59" \
  --wait

首先确定数据的时间范围：

bash

ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 1 --stdout   # 先尝试最近24小时
ax spans export PROJECT_ID --space-id SPACE_ID -l 100 --days 7 --stdout   # 若无数据则扩大范围

使用实际spans数据中的

start_time

end_time

字段设置时间范围。首次测试运行建议使用最近的数据。

bash

ax tasks trigger-run TASK_ID \
  --data-start-time "2026-03-20T00:00:00" \
  --data-end-time "2026-03-21T23:59:59" \
  --wait

Workflow B: Create an evaluator for an experiment

工作流B：为实验创建评估器

Use this when the user says something like "create an evaluator for my experiment" or "evaluate my dataset runs".

If the user says "dataset" but doesn't have an experiment: A task must target an experiment (not a bare dataset). Ask:

"Evaluation tasks run against experiment runs, not datasets directly. Would you like help creating an experiment on that dataset first?"

If yes, use the arize-experiment skill to create one, then return here.

当用户提及类似*"为我的实验创建一个评估器"或"评估我的数据集运行结果"*时，使用此工作流。

若用户提及"dataset"但未创建实验： 评估任务必须针对实验（而非裸数据集）。请询问用户：

"评估任务需针对实验运行结果，而非直接针对数据集。是否需要先帮助您在该数据集上创建实验？"

若用户同意，使用arize-experiment Skill创建实验，然后返回此工作流。

Step 1: Resolve dataset and experiment

步骤1：解析dataset和experiment

bash

ax datasets list --space-id SPACE_ID -o json
ax experiments list --dataset-id DATASET_ID -o json

Note the dataset ID and the experiment ID(s) to score.

bash

ax datasets list --space-id SPACE_ID -o json
ax experiments list --dataset-id DATASET_ID -o json

记录dataset ID和要评分的experiment ID(s)。

Step 2: Understand what to evaluate

步骤2：明确评估目标

If the user specified the evaluator type → skip to Step 3.

If not, inspect a recent experiment run to base the evaluator on actual data:

bash

ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"

Look at the

output

input

evaluations

, and

metadata

fields. Identify gaps (metrics the user cares about but doesn't have yet) and propose 1–3 evaluator ideas. Each suggestion must include: the evaluator name (bold), a one-sentence description, and the binary label pair in parentheses — same format as Workflow A, Step 2.

若用户已指定评估器类型 → 跳至步骤3。

若未指定，检查最近的实验运行结果，基于实际数据确定评估方向：

bash

ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"

查看

output

、

input

、

evaluations

和

metadata

字段。识别现有指标的缺口（用户关心但未覆盖的指标），并提出1–3个评估器方案。每个建议需包含：评估器名称（加粗）、一句话说明，以及括号内的二元标签对——格式与工作流A的步骤2相同。

Step 3: Confirm or create an AI integration

步骤3：确认或创建AI集成

Same as Workflow A, Step 3.

与工作流A的步骤3相同。

Step 4: Create the evaluator

步骤4：创建评估器

Same as Workflow A, Step 4. Keep variables generic.

与工作流A的步骤4相同。保持变量的通用性。

Step 5: Determine column mappings from real run data

步骤5：根据实际运行数据确定列映射

Run data shape differs from span data. Inspect:

bash

ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"

Common mapping for experiment runs:

```
output
```
→
```
"output"
```
(top-level field on each run)
```
input
```
→ check if it's on the run or embedded in the linked dataset examples

input

is not on the run JSON, export dataset examples to find the path:

bash

ax datasets export DATASET_ID --stdout | python3 -c "import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))"

运行数据的结构与span数据不同。请检查：

bash

ax experiments export EXPERIMENT_ID --stdout | python3 -c "import sys,json; runs=json.load(sys.stdin); print(json.dumps(runs[0], indent=2))"

实验运行结果的常见映射：

```
output
```
→
```
"output"
```
（每个运行结果的顶级字段）
```
input
```
→ 检查其是否在运行结果中，或嵌入在关联的dataset示例中

若运行结果JSON中无

input

字段，导出dataset示例查找路径：

bash

ax datasets export DATASET_ID --stdout | python3 -c "import sys,json; ex=json.load(sys.stdin); print(json.dumps(ex[0], indent=2))"

Step 6: Create the task

步骤6：创建任务

bash

ax tasks create \
  --name "Experiment Correctness" \
  --task-type template_evaluation \
  --dataset-id DATASET_ID \
  --experiment-ids "EXP_ID" \
  --evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
  --no-continuous

bash

ax tasks create \
  --name "Experiment Correctness" \
  --task-type template_evaluation \
  --dataset-id DATASET_ID \
  --experiment-ids "EXP_ID" \
  --evaluators '[{"evaluator_id": "EVAL_ID", "column_mappings": {"output": "output"}}]' \
  --no-continuous

Step 7: Trigger and monitor

步骤7：触发并监控任务

bash

ax tasks trigger-run TASK_ID \
  --experiment-ids "EXP_ID" \
  --wait

ax tasks list-runs TASK_ID
ax tasks get-run RUN_ID

bash

ax tasks trigger-run TASK_ID \
  --experiment-ids "EXP_ID" \
  --wait

ax tasks list-runs TASK_ID
ax tasks get-run RUN_ID

Best Practices for Template Design

模板设计最佳实践

1. Use generic, portable variable names

1. 使用通用、可复用的变量名

Use

{input}

{output}

, and

{context}

— not names tied to a specific project or span attribute (e.g. do not use

{attributes_input_value}

). The evaluator itself stays abstract; the task's
column_mappings
is where you wire it to the actual fields in a specific project or experiment. This lets the same evaluator run across multiple projects and experiments without modification.

使用

{input}

、

{output}

和

{context}

——而非与特定项目或span属性绑定的名称（如请勿使用

{attributes_input_value}

）。评估器本身保持抽象；**任务的

column_mappings

**负责将其与特定项目或实验的实际字段关联。这样无需修改评估器，即可在多个项目和实验中复用。

2. Default to binary labels

2. 默认使用二元标签

Use exactly two clear string labels (e.g.

hallucinated

factual

correct

incorrect

pass

fail

). Binary labels are:

Easiest for the judge model to produce consistently
Most common in the industry
Simplest to interpret in dashboards

If the user insists on more than two choices, that's fine — but recommend binary first and explain the tradeoff (more labels → more ambiguity → lower inter-rater reliability).

使用两个清晰的字符串标签（如

hallucinated

factual

、

correct

incorrect

、

pass

fail

）。二元标签的优势：

评判模型最容易持续生成一致结果
行业最常用
仪表板中最易解读

若用户坚持使用两个以上标签也可，但需先推荐二元标签，并说明权衡（标签越多→歧义越多→评分一致性越低）。

3. Be explicit about what the model must return

3. 明确要求模型返回的内容

The template must tell the judge model to respond with only the label string — nothing else. The label strings in the prompt must exactly match the classification choices configured in the UI (same spelling, same casing).

Good:

Respond with exactly one of these labels: hallucinated, factual

Bad (too open-ended):

Is this hallucinated? Answer yes or no.

模板必须明确告知评判模型，仅返回标签字符串——不得返回其他内容。提示词中的标签字符串必须与UI中配置的分类选项完全匹配（拼写、大小写一致）。

示例（良好）：

Respond with exactly one of these labels: hallucinated, factual

示例（不佳，过于开放）：

Is this hallucinated? Answer yes or no.

4. Keep temperature low

4. 设置较低的temperature值

Pass

--invocation-params '{"temperature": 0}'

for reproducible scoring. Higher temperatures introduce noise into evaluation results.

传递

--invocation-params '{"temperature": 0}'

以保证评分可复现。较高的temperature会为评估结果引入噪声。

5. Use

--include-explanations

for debugging

5. 使用

--include-explanations

用于调试

During initial setup, always include explanations so you can verify the judge is reasoning correctly before trusting the labels at scale.

在初始设置阶段，始终启用解释功能，以便在大规模信任标签前，验证评判模型的推理是否正确。

6. Pass the template in single quotes in bash

6. 在bash中使用单引号包裹模板

Single quotes prevent the shell from interpolating

{variable}

placeholders. Double quotes will cause issues:

bash

undefined

单引号可防止shell解析

{variable}

占位符。使用双引号会导致问题：

bash

undefined

Correct

正确写法

--template 'Judge this: {input} → {output}'

Wrong — shell may interpret { } or fail

错误写法——shell可能解析{ }导致失败

--template "Judge this: {input} → {output}"

undefined

--template "Judge this: {input} → {output}"

undefined

7. Validate that choices match your template labels

7. 验证标签选项与模板中的标签一致

Reconcile the prompt labels and the Choices panel so they stay in sync. Mismatches cause runs to produce no valid scores.

确保提示词中的标签与UI中的Choices面板保持同步。不匹配会导致任务运行后无有效评分。

Troubleshooting

故障排查

Problem	Solution
`ax: command not found`	See ax-setup.md
`401 Unauthorized`	API key may not have access to this space. Verify at https://app.arize.com/admin > API Keys
`Evaluator not found`	`ax evaluators list --space-id SPACE_ID`
`Integration not found`	`ax ai-integrations list --space-id SPACE_ID`
`Task not found`	`ax tasks list --space-id SPACE_ID`
`project-id and dataset-id are mutually exclusive`	Use only one when creating a task
`experiment-ids required for dataset tasks`	Add `--experiment-ids` to `create` and `trigger-run`
`sampling-rate only valid for project tasks`	Remove `--sampling-rate` from dataset tasks
Validation error on `ax spans export`	Pass project ID (base64), not project name — look up via `ax projects list`
Template validation errors	Use single-quoted `--template '...'` in bash; single braces `{var}` , not double `{{var}}`
Run stuck in `pending`	`ax tasks get-run RUN_ID` ; then `ax tasks cancel-run RUN_ID`
Run `cancelled` ~1s	Integration credentials invalid — check AI integration
Run `cancelled` ~3min	Found spans but LLM call failed — wrong model name or bad key
Run `completed` , 0 spans	Widen time window; eval index may not cover older data
No scores in UI	Fix `column_mappings` to match real paths on your spans/runs
Scores look wrong	Add `--include-explanations` and inspect judge reasoning on a few samples
Evaluator cancels on wrong span kind	Match `query_filter` and `column_mappings` to LLM vs CHAIN spans
Time format error on `trigger-run`	Use `2026-03-21T09:00:00` — no trailing `Z`

问题	解决方案
`ax: command not found`	查看ax-setup.md文档
`401 Unauthorized`	API密钥可能无此space的访问权限。在https://app.arize.com/admin > API Keys验证权限
`Evaluator not found`	运行 `ax evaluators list --space-id SPACE_ID` 检查
`Integration not found`	运行 `ax ai-integrations list --space-id SPACE_ID` 检查
`Task not found`	运行 `ax tasks list --space-id SPACE_ID` 检查
`project-id and dataset-id are mutually exclusive`	创建任务时仅使用其中一个参数
`experiment-ids required for dataset tasks`	在 `create` 和 `trigger-run` 命令中添加 `--experiment-ids` 参数
`sampling-rate only valid for project tasks`	从dataset任务中移除 `--sampling-rate` 参数
`ax spans export` 命令验证错误	传入项目ID（base64格式），而非项目名称——通过 `ax projects list` 查询ID
模板验证错误	在bash中使用单引号包裹 `--template '...'` ；使用单大括号 `{var}` ，而非双大括号 `{{var}}`
任务运行卡在 `pending` 状态	运行 `ax tasks get-run RUN_ID` 查看详情；然后运行 `ax tasks cancel-run RUN_ID` 取消任务
任务运行 `cancelled` ~1秒	集成凭证无效——检查AI集成
任务运行 `cancelled` ~3分钟	找到spans但LLM调用失败——检查模型名称或密钥
任务运行 `completed` , 0 spans	扩大时间范围；评估索引可能未覆盖旧数据
UI中无评分结果	修正 `column_mappings` 以匹配spans/运行结果的实际字段路径
评分结果异常	添加 `--include-explanations` 参数，检查部分样本的评判推理过程
评估器因错误的span类型取消运行	使 `query_filter` 和 `column_mappings` 与LLM或CHAIN spans匹配
`trigger-run` 命令时间格式错误	使用 `2026-03-21T09:00:00` 格式——不要添加末尾的 `Z`

Related Skills

Save Credentials for Future Use

保存凭证以便后续使用

At the end of the session, if the user manually provided any credentials during this conversation and those values were NOT already loaded from a saved profile or environment variable, offer to save them.

Skip this entirely if:

The API key was already loaded from an existing profile or
```
ARIZE_API_KEY
```
env var
The space ID was already set via
```
ARIZE_SPACE_ID
```
env var

How to offer: Use AskQuestion: "Would you like to save your Arize credentials so you don't have to enter them next time?" with options

"Yes, save them"

"No thanks"

If the user says yes:

API key — See ax-profiles.md. Run
```
ax profiles show
```
to check the current state, then use
```
ax profiles create
```
or
```
ax profiles update
```
with the appropriate flags to save the key (and region if relevant).
Space ID — See ax-profiles.md (Space ID section) to persist it as an environment variable.

在会话结束时，若用户在此对话中手动提供了任何凭证，且这些值未从已保存的配置文件或环境变量中加载，可主动询问用户是否保存。

以下情况请跳过此步骤：

API密钥已从现有配置文件或
```
ARIZE_API_KEY
```
环境变量中加载
space ID已通过
```
ARIZE_SPACE_ID
```
环境变量设置

询问方式： 使用AskQuestion："是否需要保存您的Arize凭证？下次使用时无需重复输入。"，提供选项

"是，保存"

"不用了，谢谢"

。

若用户选择是：

API密钥 — 查看ax-profiles.md文档。运行
```
ax profiles show
```
检查当前状态，然后使用
```
ax profiles create
```
或
```
ax profiles update
```
命令，通过相应参数保存密钥（若相关，同时保存区域信息）。
Space ID — 查看ax-profiles.md文档的Space ID章节，将其设置为环境变量以持久化保存。