langsmith-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangSmith - LLM Observability Platform

LangSmith - LLM可观测性平台

Development platform for debugging, evaluating, and monitoring language models and AI applications.
用于调试、评估和监控大语言模型(LLM)及AI应用的开发平台。

When to use LangSmith

何时使用LangSmith

Use LangSmith when:
  • Debugging LLM application issues (prompts, chains, agents)
  • Evaluating model outputs systematically against datasets
  • Monitoring production LLM systems
  • Building regression testing for AI features
  • Analyzing latency, token usage, and costs
  • Collaborating on prompt engineering
Key features:
  • Tracing: Capture inputs, outputs, latency for all LLM calls
  • Evaluation: Systematic testing with built-in and custom evaluators
  • Datasets: Create test sets from production traces or manually
  • Monitoring: Track metrics, errors, and costs in production
  • Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex
Use alternatives instead:
  • Weights & Biases: Deep learning experiment tracking, model training
  • MLflow: General ML lifecycle, model registry focus
  • Arize/WhyLabs: ML monitoring, data drift detection
在以下场景使用LangSmith:
  • 调试LLM应用问题(提示词、链、Agent)
  • 针对数据集系统化评估模型输出
  • 监控生产环境中的LLM系统
  • 为AI功能构建回归测试
  • 分析延迟、Token使用量及成本
  • 协作进行提示词工程
核心功能:
  • 追踪(Tracing):捕获所有LLM调用的输入、输出和延迟数据
  • 评估(Evaluation):使用内置及自定义评估器进行系统化测试
  • 数据集(Datasets):从生产追踪数据或手动创建测试集
  • 监控(Monitoring):在生产环境中追踪指标、错误及成本
  • 集成(Integrations):支持OpenAI、Anthropic、LangChain、LlamaIndex
可选择替代工具的场景:
  • Weights & Biases:深度学习实验追踪、模型训练
  • MLflow:通用机器学习生命周期管理、模型注册表
  • Arize/WhyLabs:机器学习监控、数据漂移检测

Quick start

快速开始

Installation

安装

bash
pip install langsmith
bash
pip install langsmith

Set environment variables

设置环境变量

export LANGSMITH_API_KEY="your-api-key" export LANGSMITH_TRACING=true
undefined
export LANGSMITH_API_KEY="your-api-key" export LANGSMITH_TRACING=true
undefined

Basic tracing with @traceable

使用@traceable进行基础追踪

python
from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content
python
from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Automatically traced to LangSmith

自动追踪至LangSmith

result = generate_response("What is machine learning?")
undefined
result = generate_response("What is machine learning?")
undefined

OpenAI wrapper (automatic tracing)

OpenAI包装器(自动追踪)

python
from langsmith.wrappers import wrap_openai
from openai import OpenAI
python
from langsmith.wrappers import wrap_openai
from openai import OpenAI

Wrap client for automatic tracing

包装客户端以实现自动追踪

client = wrap_openai(OpenAI())
client = wrap_openai(OpenAI())

All calls automatically traced

所有调用都会被自动追踪

response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )
undefined
response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )
undefined

Core concepts

核心概念

Runs and traces

运行与追踪

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.
python
from langsmith import traceable

@traceable(run_type="chain")
def process_query(query: str) -> str:
    # Parent run
    context = retrieve_context(query)  # Child run
    response = generate_answer(query, context)  # Child run
    return response

@traceable(run_type="retriever")
def retrieve_context(query: str) -> list:
    return vector_store.search(query)

@traceable(run_type="llm")
def generate_answer(query: str, context: list) -> str:
    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")
运行(Run)是单个执行单元(LLM调用、链、工具)。运行构成层级化的追踪(Traces),展示完整的执行流程。
python
from langsmith import traceable

@traceable(run_type="chain")
def process_query(query: str) -> str:
    # 父级运行
    context = retrieve_context(query)  # 子级运行
    response = generate_answer(query, context)  # 子级运行
    return response

@traceable(run_type="retriever")
def retrieve_context(query: str) -> list:
    return vector_store.search(query)

@traceable(run_type="llm")
def generate_answer(query: str, context: list) -> str:
    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

项目

Projects organize related runs. Set via environment or code:
python
import os
os.environ["LANGSMITH_PROJECT"] = "my-project"
项目用于组织相关的运行记录。可通过环境变量或代码设置:
python
import os
os.environ["LANGSMITH_PROJECT"] = "my-project"

Or per-function

或为单个函数设置

@traceable(project_name="my-project") def my_function(): pass
undefined
@traceable(project_name="my-project") def my_function(): pass
undefined

Client API

客户端API

python
from langsmith import Client

client = Client()
python
from langsmith import Client

client = Client()

List runs

列出运行记录

runs = list(client.list_runs( project_name="my-project", filter='eq(status, "success")', limit=100 ))
runs = list(client.list_runs( project_name="my-project", filter='eq(status, "success")', limit=100 ))

Get run details

获取运行详情

run = client.read_run(run_id="...")
run = client.read_run(run_id="...")

Create feedback

创建反馈

client.create_feedback( run_id="...", key="correctness", score=0.9, comment="Good answer" )
undefined
client.create_feedback( run_id="...", key="correctness", score=0.9, comment="Good answer" )
undefined

Datasets and evaluation

数据集与评估

Create dataset

创建数据集

python
from langsmith import Client

client = Client()
python
from langsmith import Client

client = Client()

Create dataset

创建数据集

dataset = client.create_dataset("qa-test-set", description="QA evaluation")
dataset = client.create_dataset("qa-test-set", description="QA evaluation")

Add examples

添加示例

client.create_examples( inputs=[ {"question": "What is Python?"}, {"question": "What is ML?"} ], outputs=[ {"answer": "A programming language"}, {"answer": "Machine learning"} ], dataset_id=dataset.id )
undefined
client.create_examples( inputs=[ {"question": "What is Python?"}, {"question": "What is ML?"} ], outputs=[ {"answer": "A programming language"}, {"answer": "Machine learning"} ], dataset_id=dataset.id )
undefined

Run evaluation

运行评估

python
from langsmith import evaluate

def my_model(inputs: dict) -> dict:
    # Your model logic
    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[correctness_evaluator],
    experiment_prefix="v1"
)

print(f"Average score: {results.aggregate_metrics['correctness']}")
python
from langsmith import evaluate

def my_model(inputs: dict) -> dict:
    # 你的模型逻辑
    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[correctness_evaluator],
    experiment_prefix="v1"
)

print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

内置评估器

python
from langsmith.evaluation import LangChainStringEvaluator
python
from langsmith.evaluation import LangChainStringEvaluator

Use LangChain evaluators

使用LangChain评估器

results = evaluate( my_model, data="qa-test-set", evaluators=[ LangChainStringEvaluator("qa"), LangChainStringEvaluator("cot_qa") ] )
undefined
results = evaluate( my_model, data="qa-test-set", evaluators=[ LangChainStringEvaluator("qa"), LangChainStringEvaluator("cot_qa") ] )
undefined

Advanced tracing

高级追踪

Tracing context

追踪上下文

python
from langsmith import tracing_context

with tracing_context(
    project_name="experiment-1",
    tags=["production", "v2"],
    metadata={"version": "2.0"}
):
    # All traceable calls inherit context
    result = my_function()
python
from langsmith import tracing_context

with tracing_context(
    project_name="experiment-1",
    tags=["production", "v2"],
    metadata={"version": "2.0"}
):
    # 所有可追踪调用都会继承此上下文
    result = my_function()

Manual runs

手动运行

python
from langsmith import trace

with trace(
    name="custom_operation",
    run_type="tool",
    inputs={"query": "test"}
) as run:
    result = do_something()
    run.end(outputs={"result": result})
python
from langsmith import trace

with trace(
    name="custom_operation",
    run_type="tool",
    inputs={"query": "test"}
) as run:
    result = do_something()
    run.end(outputs={"result": result})

Process inputs/outputs

处理输入/输出

python
def sanitize_inputs(inputs: dict) -> dict:
    if "password" in inputs:
        inputs["password"] = "***"
    return inputs

@traceable(process_inputs=sanitize_inputs)
def login(username: str, password: str):
    return authenticate(username, password)
python
def sanitize_inputs(inputs: dict) -> dict:
    if "password" in inputs:
        inputs["password"] = "***"
    return inputs

@traceable(process_inputs=sanitize_inputs)
def login(username: str, password: str):
    return authenticate(username, password)

Sampling

采样

python
import os
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling
python
import os
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% 采样率

LangChain integration

LangChain集成

python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

Tracing enabled automatically with LANGSMITH_TRACING=true

设置LANGSMITH_TRACING=true后自动启用追踪

llm = ChatOpenAI(model="gpt-4o") prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}") ])
chain = prompt | llm
llm = ChatOpenAI(model="gpt-4o") prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}") ])
chain = prompt | llm

All chain runs traced automatically

所有链运行都会被自动追踪

response = chain.invoke({"input": "Hello!"})
undefined
response = chain.invoke({"input": "Hello!"})
undefined

Production monitoring

生产环境监控

Hub prompts

Hub提示词

python
from langsmith import Client

client = Client()
python
from langsmith import Client

client = Client()

Pull prompt from hub

从Hub拉取提示词

prompt = client.pull_prompt("my-org/qa-prompt")
prompt = client.pull_prompt("my-org/qa-prompt")

Use in application

在应用中使用

result = prompt.invoke({"question": "What is AI?"})
undefined
result = prompt.invoke({"question": "What is AI?"})
undefined

Async client

异步客户端

python
from langsmith import AsyncClient

async def main():
    client = AsyncClient()

    runs = []
    async for run in client.list_runs(project_name="my-project"):
        runs.append(run)

    return runs
python
from langsmith import AsyncClient

async def main():
    client = AsyncClient()

    runs = []
    async for run in client.list_runs(project_name="my-project"):
        runs.append(run)

    return runs

Feedback collection

反馈收集

python
from langsmith import Client

client = Client()
python
from langsmith import Client

client = Client()

Collect user feedback

收集用户反馈

def record_feedback(run_id: str, user_rating: int, comment: str = None): client.create_feedback( run_id=run_id, key="user_rating", score=user_rating / 5.0, # Normalize to 0-1 comment=comment )
def record_feedback(run_id: str, user_rating: int, comment: str = None): client.create_feedback( run_id=run_id, key="user_rating", score=user_rating / 5.0, # 归一化至0-1 comment=comment )

In your application

在你的应用中调用

record_feedback(run_id="...", user_rating=4, comment="Helpful response")
undefined
record_feedback(run_id="...", user_rating=4, comment="Helpful response")
undefined

Testing integration

测试集成

Pytest integration

Pytest集成

python
from langsmith import test

@test
def test_qa_accuracy():
    result = my_qa_function("What is Python?")
    assert "programming" in result.lower()
python
from langsmith import test

@test
def test_qa_accuracy():
    result = my_qa_function("What is Python?")
    assert "programming" in result.lower()

Evaluation in CI/CD

CI/CD中的评估

python
from langsmith import evaluate

def run_evaluation():
    results = evaluate(
        my_model,
        data="regression-test-set",
        evaluators=[accuracy_evaluator]
    )

    # Fail CI if accuracy drops
    assert results.aggregate_metrics["accuracy"] >= 0.9, \
        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"
python
from langsmith import evaluate

def run_evaluation():
    results = evaluate(
        my_model,
        data="regression-test-set",
        evaluators=[accuracy_evaluator]
    )

    # 如果准确率低于阈值则终止CI流程
    assert results.aggregate_metrics["accuracy"] >= 0.9, \
        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

最佳实践

  1. Structured naming - Use consistent project/run naming conventions
  2. Add metadata - Include version, environment, user info
  3. Sample in production - Use sampling rate to control volume
  4. Create datasets - Build test sets from interesting production cases
  5. Automate evaluation - Run evaluations in CI/CD pipelines
  6. Monitor costs - Track token usage and latency trends
  1. 结构化命名 - 使用一致的项目/运行命名规范
  2. 添加元数据 - 包含版本、环境、用户信息
  3. 生产环境采样 - 使用采样率控制数据量
  4. 创建数据集 - 从生产环境中的典型案例构建测试集
  5. 自动化评估 - 在CI/CD流水线中运行评估
  6. 监控成本 - 追踪Token使用量和延迟趋势

Common issues

常见问题

Traces not appearing:
python
import os
追踪数据未显示:
python
import os

Ensure tracing is enabled

确保已启用追踪

os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_API_KEY"] = "your-key"
os.environ["LANGSMITH_TRACING"] = "true" os.environ["LANGSMITH_API_KEY"] = "your-key"

Verify connection

验证连接

from langsmith import Client client = Client() print(client.list_projects()) # Should work

**High latency from tracing:**
```python
from langsmith import Client client = Client() print(client.list_projects()) # 应正常返回结果

**追踪导致高延迟:**
```python

Enable background batching (default)

启用后台批量处理(默认开启)

from langsmith import Client client = Client(auto_batch_tracing=True)
from langsmith import Client client = Client(auto_batch_tracing=True)

Or use sampling

或使用采样

os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

**Large payloads:**
```python
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

**大负载数据:**
```python

Hide sensitive/large fields

隐藏敏感/大字段

@traceable( process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"} ) def my_function(data): pass
undefined
@traceable( process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"} ) def my_function(data): pass
undefined

References

参考资料

  • Advanced Usage - Custom evaluators, distributed tracing, hub prompts
  • Troubleshooting - Common issues, debugging, performance
  • 高级用法 - 自定义评估器、分布式追踪、Hub提示词
  • 故障排除 - 常见问题、调试、性能优化

Resources

资源