Phoenix - AI Observability Platform

Phoenix - AI可观测平台

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.

用于LLM应用的开源AI可观测与评估平台，支持链路追踪、评估、数据集管理、实验对比和实时监控。

When to use Phoenix

适用场景

Use Phoenix when:

Debugging LLM application issues with detailed traces
Running systematic evaluations on datasets
Monitoring production LLM systems in real-time
Building experiment pipelines for prompt/model comparison
Self-hosted observability without vendor lock-in

Key features:

Tracing: OpenTelemetry-based trace collection for any LLM framework
Evaluation: LLM-as-judge evaluators for quality assessment
Datasets: Versioned test sets for regression testing
Experiments: Compare prompts, models, and configurations
Playground: Interactive prompt testing with multiple models
Open-source: Self-hosted with PostgreSQL or SQLite

Use alternatives instead:

LangSmith: Managed platform with LangChain-first integration
Weights & Biases: Deep learning experiment tracking focus
Arize Cloud: Managed Phoenix with enterprise features
MLflow: General ML lifecycle, model registry focus

你可以在以下场景使用Phoenix：

通过详细链路排查LLM应用的问题
对数据集执行系统化评估
实时监控生产环境的LLM系统
搭建实验流水线用于提示词/模型对比
无需绑定服务商的自托管可观测方案

核心特性：

链路追踪：基于OpenTelemetry的链路采集能力，支持所有LLM框架
评估能力：基于LLM-as-judge的评估器，用于质量检测
数据集管理：带版本控制的测试集，用于回归测试
实验对比：对比不同提示词、模型和配置的效果
沙盒环境：支持多模型的交互式提示词测试
开源免费：可基于PostgreSQL或SQLite自托管部署

可选替代方案：

LangSmith：优先集成LangChain的托管平台
Weights & Biases：侧重深度学习实验追踪的工具
Arize Cloud：带企业级特性的Phoenix托管版本
MLflow：侧重通用ML生命周期、模型注册的平台

Quick start

快速开始

Installation

安装

bash

pip install arize-phoenix

bash

pip install arize-phoenix

With specific backends

pip install arize-phoenix[embeddings] # Embedding analysis pip install arize-phoenix-otel # OpenTelemetry config pip install arize-phoenix-evals # Evaluation framework pip install arize-phoenix-client # Lightweight REST client

undefined

pip install arize-phoenix[embeddings] # Embedding analysis pip install arize-phoenix-otel # OpenTelemetry config pip install arize-phoenix-evals # Evaluation framework pip install arize-phoenix-client # Lightweight REST client

undefined

Launch Phoenix server

启动Phoenix服务

python

import phoenix as px

python

import phoenix as px

Launch in notebook (ThreadServer mode)

session = px.launch_app()

View UI

session.view() # Embedded iframe print(session.url) # http://localhost:6006

undefined

session.view() # Embedded iframe print(session.url) # http://localhost:6006

undefined

Command-line server (production)

命令行启动服务（生产环境）

bash

undefined

bash

undefined

Start Phoenix server

phoenix serve

With PostgreSQL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db" phoenix serve --port 6006

undefined

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db" phoenix serve --port 6006

undefined

Basic tracing

基础链路追踪

python

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

python

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

Configure OpenTelemetry with Phoenix

tracer_provider = register( project_name="my-llm-app", endpoint="http://localhost:6006/v1/traces" )

Instrument OpenAI SDK

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

All OpenAI calls are now traced

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )

undefined

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )

undefined

Core concepts

核心概念

Traces and spans

链路与Span

A trace represents a complete execution flow, while spans are individual operations within that trace.

python

from phoenix.otel import register
from opentelemetry import trace

Trace代表完整的执行流程，而Span是该流程中的单个操作单元。

python

from phoenix.otel import register
from opentelemetry import trace

Setup tracing

tracer_provider = register(project_name="my-app") tracer = trace.get_tracer(name)

Create custom spans

with tracer.start_as_current_span("process_query") as span: span.set_attribute("input.value", query)

# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
    context = retriever.search(query)

with tracer.start_as_current_span("generate_response"):
    response = llm.generate(query, context)

span.set_attribute("output.value", response)

undefined

with tracer.start_as_current_span("process_query") as span: span.set_attribute("input.value", query)

# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
    context = retriever.search(query)

with tracer.start_as_current_span("generate_response"):
    response = llm.generate(query, context)

span.set_attribute("output.value", response)

undefined

Projects

项目

Projects organize related traces:

python

import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"

项目用于归类相关的链路数据：

python

import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"

Or per-trace

from phoenix.otel import register tracer_provider = register(project_name="experiment-v2")

undefined

from phoenix.otel import register tracer_provider = register(project_name="experiment-v2")

undefined

Framework instrumentation

框架埋点集成

OpenAI

python

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

python

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain

python

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

python

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

All LangChain operations traced

from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("Hello!")

undefined

from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("Hello!")

undefined

LlamaIndex

python

from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

python

from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic

python

from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

python

from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework

评估框架

Built-in evaluators

内置评估器

python

from phoenix.evals import (
    OpenAIModel,
    HallucinationEvaluator,
    RelevanceEvaluator,
    ToxicityEvaluator,
    llm_classify
)

python

from phoenix.evals import (
    OpenAIModel,
    HallucinationEvaluator,
    RelevanceEvaluator,
    ToxicityEvaluator,
    llm_classify
)

Setup model for evaluation

eval_model = OpenAIModel(model="gpt-4o")

Evaluate hallucination

hallucination_eval = HallucinationEvaluator(eval_model) results = hallucination_eval.evaluate( input="What is the capital of France?", output="The capital of France is Paris.", reference="Paris is the capital of France." )

undefined

hallucination_eval = HallucinationEvaluator(eval_model) results = hallucination_eval.evaluate( input="What is the capital of France?", output="The capital of France is Paris.", reference="Paris is the capital of France." )

undefined

Custom evaluators

自定义评估器

python

from phoenix.evals import llm_classify

python

from phoenix.evals import llm_classify

Define custom evaluation

def evaluate_helpfulness(input_text, output_text): template = """ Evaluate if the response is helpful for the given question.

Question: {input}
Response: {output}

Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""

result = llm_classify(
    model=eval_model,
    template=template,
    input=input_text,
    output=output_text,
    rails=["helpful", "not_helpful"]
)
return result

undefined

def evaluate_helpfulness(input_text, output_text): template = """ Evaluate if the response is helpful for the given question.

Question: {input}
Response: {output}

Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""

result = llm_classify(
    model=eval_model,
    template=template,
    input=input_text,
    output=output_text,
    rails=["helpful", "not_helpful"]
)
return result

undefined

Run evaluations on dataset

对数据集运行评估

python

from phoenix import Client
from phoenix.evals import run_evals

client = Client()

python

from phoenix import Client
from phoenix.evals import run_evals

client = Client()

Get spans to evaluate

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'" )

Run evaluations

eval_results = run_evals( dataframe=spans_df, evaluators=[ HallucinationEvaluator(eval_model), RelevanceEvaluator(eval_model) ], provide_explanation=True )

Log results back to Phoenix

client.log_evaluations(eval_results)

undefined

client.log_evaluations(eval_results)

undefined

Datasets and experiments

数据集与实验

Create dataset

创建数据集

python

from phoenix import Client

client = Client()

python

from phoenix import Client

client = Client()

Create dataset

dataset = client.create_dataset( name="qa-test-set", description="QA evaluation dataset" )

Add examples

client.add_examples_to_dataset( dataset_name="qa-test-set", examples=[ { "input": {"question": "What is Python?"}, "output": {"answer": "A programming language"} }, { "input": {"question": "What is ML?"}, "output": {"answer": "Machine learning"} } ] )

undefined

client.add_examples_to_dataset( dataset_name="qa-test-set", examples=[ { "input": {"question": "What is Python?"}, "output": {"answer": "A programming language"} }, { "input": {"question": "What is ML?"}, "output": {"answer": "Machine learning"} } ] )

undefined

Run experiment

运行实验

python

from phoenix import Client
from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data):
    """Your model function."""
    question = input_data["question"]
    return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected):
    """Custom evaluator."""
    return {
        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
    }

python

from phoenix import Client
from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data):
    """Your model function."""
    question = input_data["question"]
    return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected):
    """Custom evaluator."""
    return {
        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
    }

Run experiment

results = run_experiment( dataset_name="qa-test-set", task=my_model, evaluators=[accuracy_evaluator], experiment_name="baseline-v1" )

print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

undefined

results = run_experiment( dataset_name="qa-test-set", task=my_model, evaluators=[accuracy_evaluator], experiment_name="baseline-v1" )

print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

undefined

Client API

客户端API

Query traces and spans

查询链路与Span

python

from phoenix import Client

client = Client(endpoint="http://localhost:6006")

python

from phoenix import Client

client = Client(endpoint="http://localhost:6006")

Get spans as DataFrame

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'", limit=1000 )

Get specific span

span = client.get_span(span_id="abc123")

Get trace

trace = client.get_trace(trace_id="xyz789")

undefined

trace = client.get_trace(trace_id="xyz789")

undefined

Log feedback

上报反馈

python

from phoenix import Client

client = Client()

python

from phoenix import Client

client = Client()

Log user feedback

client.log_annotation( span_id="abc123", name="user_rating", annotator_kind="HUMAN", score=0.8, label="helpful", metadata={"comment": "Good response"} )

undefined

client.log_annotation( span_id="abc123", name="user_rating", annotator_kind="HUMAN", score=0.8, label="helpful", metadata={"comment": "Good response"} )

undefined

Export data

数据导出

python

undefined

python

undefined

Export to pandas

df = client.get_spans_dataframe(project_name="my-app")

Export traces

traces = client.list_traces(project_name="my-app")

undefined

traces = client.list_traces(project_name="my-app")

undefined

Production deployment

生产环境部署

Docker

bash

docker run -p 6006:6006 arizephoenix/phoenix:latest

bash

docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

搭配PostgreSQL使用

bash

undefined

bash

undefined

Set database URL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"

Start server

phoenix serve --host 0.0.0.0 --port 6006

undefined

phoenix serve --host 0.0.0.0 --port 6006

undefined

Environment variables

环境变量

Variable	Description	Default
`PHOENIX_PORT`	HTTP server port	`6006`
`PHOENIX_HOST`	Server bind address	`127.0.0.1`
`PHOENIX_GRPC_PORT`	gRPC/OTLP port	`4317`
`PHOENIX_SQL_DATABASE_URL`	Database connection	SQLite temp
`PHOENIX_WORKING_DIR`	Data storage directory	OS temp
`PHOENIX_ENABLE_AUTH`	Enable authentication	`false`
`PHOENIX_SECRET`	JWT signing secret	Required if auth enabled

变量名	描述	默认值
`PHOENIX_PORT`	HTTP服务端口	`6006`
`PHOENIX_HOST`	服务绑定地址	`127.0.0.1`
`PHOENIX_GRPC_PORT`	gRPC/OTLP端口	`4317`
`PHOENIX_SQL_DATABASE_URL`	数据库连接地址	临时SQLite
`PHOENIX_WORKING_DIR`	数据存储目录	系统临时目录
`PHOENIX_ENABLE_AUTH`	开启身份验证	`false`
`PHOENIX_SECRET`	JWT签名密钥	开启验证时必填
`PHOENIX_ADMIN_SECRET`	管理员初始化Token	开启验证时必填

With authentication

开启身份验证

bash

export PHOENIX_ENABLE_AUTH=true
export PHOENIX_SECRET="your-secret-key-min-32-chars"
export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve

bash

export PHOENIX_ENABLE_AUTH=true
export PHOENIX_SECRET="your-secret-key-min-32-chars"
export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve

Best practices

最佳实践

Use projects: Separate traces by environment (dev/staging/prod)
Add metadata: Include user IDs, session IDs for debugging
Evaluate regularly: Run automated evaluations in CI/CD
Version datasets: Track test set changes over time
Monitor costs: Track token usage via Phoenix dashboards
Self-host: Use PostgreSQL for production deployments

合理使用项目：按环境（开发/测试/生产）拆分链路数据
添加元数据：上报用户ID、会话ID便于排查问题
定期执行评估：在CI/CD中运行自动化评估
数据集版本控制：跟踪测试集的历史变更
成本监控：通过Phoenix看板跟踪Token使用量
自托管部署建议：生产环境使用PostgreSQL作为数据库

Common issues

常见问题

Traces not appearing:

python

from phoenix.otel import register

链路不显示：

python

from phoenix.otel import register

Verify endpoint

tracer_provider = register( project_name="my-app", endpoint="http://localhost:6006/v1/traces" # Correct endpoint )

Force flush

from opentelemetry import trace trace.get_tracer_provider().force_flush()


**High memory in notebook:**
```python

from opentelemetry import trace trace.get_tracer_provider().force_flush()


**Notebook内存占用过高：**
```python

Close session when done

session = px.launch_app()

... do work ...

session.close() px.close_app()


**Database connection issues:**
```bash

session.close() px.close_app()


**数据库连接问题：**
```bash

Verify PostgreSQL connection

psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"

Check Phoenix logs

phoenix serve --log-level debug

undefined

phoenix serve --log-level debug

undefined

References

参考资料

Advanced Usage - Custom evaluators, experiments, production setup
Troubleshooting - Common issues, debugging, performance

高阶用法 - 自定义评估器、实验、生产环境配置
故障排查 - 常见问题、调试、性能优化

Resources

资源

Documentation: https://docs.arize.com/phoenix
Repository: https://github.com/Arize-ai/phoenix
Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix
Version: 12.0.0+
License: Apache 2.0

官方文档：https://docs.arize.com/phoenix
代码仓库：https://github.com/Arize-ai/phoenix
Docker Hub地址：https://hub.docker.com/r/arizephoenix/phoenix
版本要求：12.0.0+
开源协议：Apache 2.0

phoenix-observability

Original

Translation

Phoenix - AI Observability Platform

Phoenix - AI可观测平台

When to use Phoenix

适用场景

Quick start

快速开始

Installation

安装

With specific backends

With specific backends

Launch Phoenix server

启动Phoenix服务

Launch in notebook (ThreadServer mode)

Launch in notebook (ThreadServer mode)

View UI

View UI

Command-line server (production)

命令行启动服务（生产环境）

Start Phoenix server

Start Phoenix server

With PostgreSQL

With PostgreSQL

Basic tracing

基础链路追踪

Configure OpenTelemetry with Phoenix

Configure OpenTelemetry with Phoenix

Instrument OpenAI SDK

Instrument OpenAI SDK

All OpenAI calls are now traced

All OpenAI calls are now traced

Core concepts

核心概念

Traces and spans

链路与Span

Setup tracing

Setup tracing

Create custom spans

Create custom spans

Projects

项目

Or per-trace

Or per-trace

Framework instrumentation

框架埋点集成

OpenAI

OpenAI

LangChain

LangChain

All LangChain operations traced

All LangChain operations traced

LlamaIndex

LlamaIndex

Anthropic

Anthropic

Evaluation framework

评估框架

Built-in evaluators

内置评估器

Setup model for evaluation

Setup model for evaluation

Evaluate hallucination

Evaluate hallucination

Custom evaluators

自定义评估器

Define custom evaluation

Define custom evaluation

Run evaluations on dataset

对数据集运行评估

Get spans to evaluate

Get spans to evaluate

Run evaluations

Run evaluations

Log results back to Phoenix

Log results back to Phoenix

Datasets and experiments

数据集与实验

Create dataset