phoenix-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Phoenix - AI Observability Platform

Phoenix - AI可观测平台

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.
用于LLM应用的开源AI可观测与评估平台,支持链路追踪、评估、数据集管理、实验对比和实时监控。

When to use Phoenix

适用场景

Use Phoenix when:
  • Debugging LLM application issues with detailed traces
  • Running systematic evaluations on datasets
  • Monitoring production LLM systems in real-time
  • Building experiment pipelines for prompt/model comparison
  • Self-hosted observability without vendor lock-in
Key features:
  • Tracing: OpenTelemetry-based trace collection for any LLM framework
  • Evaluation: LLM-as-judge evaluators for quality assessment
  • Datasets: Versioned test sets for regression testing
  • Experiments: Compare prompts, models, and configurations
  • Playground: Interactive prompt testing with multiple models
  • Open-source: Self-hosted with PostgreSQL or SQLite
Use alternatives instead:
  • LangSmith: Managed platform with LangChain-first integration
  • Weights & Biases: Deep learning experiment tracking focus
  • Arize Cloud: Managed Phoenix with enterprise features
  • MLflow: General ML lifecycle, model registry focus
你可以在以下场景使用Phoenix:
  • 通过详细链路排查LLM应用的问题
  • 对数据集执行系统化评估
  • 实时监控生产环境的LLM系统
  • 搭建实验流水线用于提示词/模型对比
  • 无需绑定服务商的自托管可观测方案
核心特性:
  • 链路追踪:基于OpenTelemetry的链路采集能力,支持所有LLM框架
  • 评估能力:基于LLM-as-judge的评估器,用于质量检测
  • 数据集管理:带版本控制的测试集,用于回归测试
  • 实验对比:对比不同提示词、模型和配置的效果
  • 沙盒环境:支持多模型的交互式提示词测试
  • 开源免费:可基于PostgreSQL或SQLite自托管部署
可选替代方案:
  • LangSmith:优先集成LangChain的托管平台
  • Weights & Biases:侧重深度学习实验追踪的工具
  • Arize Cloud:带企业级特性的Phoenix托管版本
  • MLflow:侧重通用ML生命周期、模型注册的平台

Quick start

快速开始

Installation

安装

bash
pip install arize-phoenix
bash
pip install arize-phoenix

With specific backends

With specific backends

pip install arize-phoenix[embeddings] # Embedding analysis pip install arize-phoenix-otel # OpenTelemetry config pip install arize-phoenix-evals # Evaluation framework pip install arize-phoenix-client # Lightweight REST client
undefined
pip install arize-phoenix[embeddings] # Embedding analysis pip install arize-phoenix-otel # OpenTelemetry config pip install arize-phoenix-evals # Evaluation framework pip install arize-phoenix-client # Lightweight REST client
undefined

Launch Phoenix server

启动Phoenix服务

python
import phoenix as px
python
import phoenix as px

Launch in notebook (ThreadServer mode)

Launch in notebook (ThreadServer mode)

session = px.launch_app()
session = px.launch_app()

View UI

View UI

session.view() # Embedded iframe print(session.url) # http://localhost:6006
undefined
session.view() # Embedded iframe print(session.url) # http://localhost:6006
undefined

Command-line server (production)

命令行启动服务(生产环境)

bash
undefined
bash
undefined

Start Phoenix server

Start Phoenix server

phoenix serve
phoenix serve

With PostgreSQL

With PostgreSQL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db" phoenix serve --port 6006
undefined
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db" phoenix serve --port 6006
undefined

Basic tracing

基础链路追踪

python
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
python
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

Configure OpenTelemetry with Phoenix

Configure OpenTelemetry with Phoenix

tracer_provider = register( project_name="my-llm-app", endpoint="http://localhost:6006/v1/traces" )
tracer_provider = register( project_name="my-llm-app", endpoint="http://localhost:6006/v1/traces" )

Instrument OpenAI SDK

Instrument OpenAI SDK

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

All OpenAI calls are now traced

All OpenAI calls are now traced

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )
undefined
from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )
undefined

Core concepts

核心概念

Traces and spans

链路与Span

A trace represents a complete execution flow, while spans are individual operations within that trace.
python
from phoenix.otel import register
from opentelemetry import trace
Trace代表完整的执行流程,而Span是该流程中的单个操作单元。
python
from phoenix.otel import register
from opentelemetry import trace

Setup tracing

Setup tracing

tracer_provider = register(project_name="my-app") tracer = trace.get_tracer(name)
tracer_provider = register(project_name="my-app") tracer = trace.get_tracer(name)

Create custom spans

Create custom spans

with tracer.start_as_current_span("process_query") as span: span.set_attribute("input.value", query)
# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
    context = retriever.search(query)

with tracer.start_as_current_span("generate_response"):
    response = llm.generate(query, context)

span.set_attribute("output.value", response)
undefined
with tracer.start_as_current_span("process_query") as span: span.set_attribute("input.value", query)
# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
    context = retriever.search(query)

with tracer.start_as_current_span("generate_response"):
    response = llm.generate(query, context)

span.set_attribute("output.value", response)
undefined

Projects

项目

Projects organize related traces:
python
import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
项目用于归类相关的链路数据:
python
import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"

Or per-trace

Or per-trace

from phoenix.otel import register tracer_provider = register(project_name="experiment-v2")
undefined
from phoenix.otel import register tracer_provider = register(project_name="experiment-v2")
undefined

Framework instrumentation

框架埋点集成

OpenAI

OpenAI

python
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
python
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain

LangChain

python
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
python
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

All LangChain operations traced

All LangChain operations traced

from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("Hello!")
undefined
from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("Hello!")
undefined

LlamaIndex

LlamaIndex

python
from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
python
from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic

Anthropic

python
from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)
python
from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework

评估框架

Built-in evaluators

内置评估器

python
from phoenix.evals import (
    OpenAIModel,
    HallucinationEvaluator,
    RelevanceEvaluator,
    ToxicityEvaluator,
    llm_classify
)
python
from phoenix.evals import (
    OpenAIModel,
    HallucinationEvaluator,
    RelevanceEvaluator,
    ToxicityEvaluator,
    llm_classify
)

Setup model for evaluation

Setup model for evaluation

eval_model = OpenAIModel(model="gpt-4o")
eval_model = OpenAIModel(model="gpt-4o")

Evaluate hallucination

Evaluate hallucination

hallucination_eval = HallucinationEvaluator(eval_model) results = hallucination_eval.evaluate( input="What is the capital of France?", output="The capital of France is Paris.", reference="Paris is the capital of France." )
undefined
hallucination_eval = HallucinationEvaluator(eval_model) results = hallucination_eval.evaluate( input="What is the capital of France?", output="The capital of France is Paris.", reference="Paris is the capital of France." )
undefined

Custom evaluators

自定义评估器

python
from phoenix.evals import llm_classify
python
from phoenix.evals import llm_classify

Define custom evaluation

Define custom evaluation

def evaluate_helpfulness(input_text, output_text): template = """ Evaluate if the response is helpful for the given question.
Question: {input}
Response: {output}

Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""

result = llm_classify(
    model=eval_model,
    template=template,
    input=input_text,
    output=output_text,
    rails=["helpful", "not_helpful"]
)
return result
undefined
def evaluate_helpfulness(input_text, output_text): template = """ Evaluate if the response is helpful for the given question.
Question: {input}
Response: {output}

Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""

result = llm_classify(
    model=eval_model,
    template=template,
    input=input_text,
    output=output_text,
    rails=["helpful", "not_helpful"]
)
return result
undefined

Run evaluations on dataset

对数据集运行评估

python
from phoenix import Client
from phoenix.evals import run_evals

client = Client()
python
from phoenix import Client
from phoenix.evals import run_evals

client = Client()

Get spans to evaluate

Get spans to evaluate

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'" )
spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'" )

Run evaluations

Run evaluations

eval_results = run_evals( dataframe=spans_df, evaluators=[ HallucinationEvaluator(eval_model), RelevanceEvaluator(eval_model) ], provide_explanation=True )
eval_results = run_evals( dataframe=spans_df, evaluators=[ HallucinationEvaluator(eval_model), RelevanceEvaluator(eval_model) ], provide_explanation=True )

Log results back to Phoenix

Log results back to Phoenix

client.log_evaluations(eval_results)
undefined
client.log_evaluations(eval_results)
undefined

Datasets and experiments

数据集与实验

Create dataset

创建数据集

python
from phoenix import Client

client = Client()
python
from phoenix import Client

client = Client()

Create dataset

Create dataset

dataset = client.create_dataset( name="qa-test-set", description="QA evaluation dataset" )
dataset = client.create_dataset( name="qa-test-set", description="QA evaluation dataset" )

Add examples

Add examples

client.add_examples_to_dataset( dataset_name="qa-test-set", examples=[ { "input": {"question": "What is Python?"}, "output": {"answer": "A programming language"} }, { "input": {"question": "What is ML?"}, "output": {"answer": "Machine learning"} } ] )
undefined
client.add_examples_to_dataset( dataset_name="qa-test-set", examples=[ { "input": {"question": "What is Python?"}, "output": {"answer": "A programming language"} }, { "input": {"question": "What is ML?"}, "output": {"answer": "Machine learning"} } ] )
undefined

Run experiment

运行实验

python
from phoenix import Client
from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data):
    """Your model function."""
    question = input_data["question"]
    return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected):
    """Custom evaluator."""
    return {
        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
    }
python
from phoenix import Client
from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data):
    """Your model function."""
    question = input_data["question"]
    return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected):
    """Custom evaluator."""
    return {
        "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
        "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
    }

Run experiment

Run experiment

results = run_experiment( dataset_name="qa-test-set", task=my_model, evaluators=[accuracy_evaluator], experiment_name="baseline-v1" )
print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")
undefined
results = run_experiment( dataset_name="qa-test-set", task=my_model, evaluators=[accuracy_evaluator], experiment_name="baseline-v1" )
print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")
undefined

Client API

客户端API

Query traces and spans

查询链路与Span

python
from phoenix import Client

client = Client(endpoint="http://localhost:6006")
python
from phoenix import Client

client = Client(endpoint="http://localhost:6006")

Get spans as DataFrame

Get spans as DataFrame

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'", limit=1000 )
spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'", limit=1000 )

Get specific span

Get specific span

span = client.get_span(span_id="abc123")
span = client.get_span(span_id="abc123")

Get trace

Get trace

trace = client.get_trace(trace_id="xyz789")
undefined
trace = client.get_trace(trace_id="xyz789")
undefined

Log feedback

上报反馈

python
from phoenix import Client

client = Client()
python
from phoenix import Client

client = Client()

Log user feedback

Log user feedback

client.log_annotation( span_id="abc123", name="user_rating", annotator_kind="HUMAN", score=0.8, label="helpful", metadata={"comment": "Good response"} )
undefined
client.log_annotation( span_id="abc123", name="user_rating", annotator_kind="HUMAN", score=0.8, label="helpful", metadata={"comment": "Good response"} )
undefined

Export data

数据导出

python
undefined
python
undefined

Export to pandas

Export to pandas

df = client.get_spans_dataframe(project_name="my-app")
df = client.get_spans_dataframe(project_name="my-app")

Export traces

Export traces

traces = client.list_traces(project_name="my-app")
undefined
traces = client.list_traces(project_name="my-app")
undefined

Production deployment

生产环境部署

Docker

Docker

bash
docker run -p 6006:6006 arizephoenix/phoenix:latest
bash
docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

搭配PostgreSQL使用

bash
undefined
bash
undefined

Set database URL

Set database URL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"

Start server

Start server

phoenix serve --host 0.0.0.0 --port 6006
undefined
phoenix serve --host 0.0.0.0 --port 6006
undefined

Environment variables

环境变量

VariableDescriptionDefault
PHOENIX_PORT
HTTP server port
6006
PHOENIX_HOST
Server bind address
127.0.0.1
PHOENIX_GRPC_PORT
gRPC/OTLP port
4317
PHOENIX_SQL_DATABASE_URL
Database connectionSQLite temp
PHOENIX_WORKING_DIR
Data storage directoryOS temp
PHOENIX_ENABLE_AUTH
Enable authentication
false
PHOENIX_SECRET
JWT signing secretRequired if auth enabled
变量名描述默认值
PHOENIX_PORT
HTTP服务端口
6006
PHOENIX_HOST
服务绑定地址
127.0.0.1
PHOENIX_GRPC_PORT
gRPC/OTLP端口
4317
PHOENIX_SQL_DATABASE_URL
数据库连接地址临时SQLite
PHOENIX_WORKING_DIR
数据存储目录系统临时目录
PHOENIX_ENABLE_AUTH
开启身份验证
false
PHOENIX_SECRET
JWT签名密钥开启验证时必填
PHOENIX_ADMIN_SECRET
管理员初始化Token开启验证时必填

With authentication

开启身份验证

bash
export PHOENIX_ENABLE_AUTH=true
export PHOENIX_SECRET="your-secret-key-min-32-chars"
export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve
bash
export PHOENIX_ENABLE_AUTH=true
export PHOENIX_SECRET="your-secret-key-min-32-chars"
export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve

Best practices

最佳实践

  1. Use projects: Separate traces by environment (dev/staging/prod)
  2. Add metadata: Include user IDs, session IDs for debugging
  3. Evaluate regularly: Run automated evaluations in CI/CD
  4. Version datasets: Track test set changes over time
  5. Monitor costs: Track token usage via Phoenix dashboards
  6. Self-host: Use PostgreSQL for production deployments
  1. 合理使用项目:按环境(开发/测试/生产)拆分链路数据
  2. 添加元数据:上报用户ID、会话ID便于排查问题
  3. 定期执行评估:在CI/CD中运行自动化评估
  4. 数据集版本控制:跟踪测试集的历史变更
  5. 成本监控:通过Phoenix看板跟踪Token使用量
  6. 自托管部署建议:生产环境使用PostgreSQL作为数据库

Common issues

常见问题

Traces not appearing:
python
from phoenix.otel import register
链路不显示:
python
from phoenix.otel import register

Verify endpoint

Verify endpoint

tracer_provider = register( project_name="my-app", endpoint="http://localhost:6006/v1/traces" # Correct endpoint )
tracer_provider = register( project_name="my-app", endpoint="http://localhost:6006/v1/traces" # Correct endpoint )

Force flush

Force flush

from opentelemetry import trace trace.get_tracer_provider().force_flush()

**High memory in notebook:**
```python
from opentelemetry import trace trace.get_tracer_provider().force_flush()

**Notebook内存占用过高:**
```python

Close session when done

Close session when done

session = px.launch_app()
session = px.launch_app()

... do work ...

... do work ...

session.close() px.close_app()

**Database connection issues:**
```bash
session.close() px.close_app()

**数据库连接问题:**
```bash

Verify PostgreSQL connection

Verify PostgreSQL connection

psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"
psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"

Check Phoenix logs

Check Phoenix logs

phoenix serve --log-level debug
undefined
phoenix serve --log-level debug
undefined

References

参考资料

  • Advanced Usage - Custom evaluators, experiments, production setup
  • Troubleshooting - Common issues, debugging, performance
  • 高阶用法 - 自定义评估器、实验、生产环境配置
  • 故障排查 - 常见问题、调试、性能优化

Resources

资源