langsmith-evaluator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<oneliner> Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included. </oneliner> <setup> Environment Variables
bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_PROJECT=your-project-name                   # Check this to know which project has traces
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge
IMPORTANT: Always check the environment variables or
.env
file for
LANGSMITH_PROJECT
before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.
Python Dependencies
bash
pip install langsmith langchain-openai python-dotenv
CLI Tool (for uploading evaluators)
bash
curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh
JavaScript Dependencies
bash
npm install langsmith openai
</setup>
<crucial_requirement>
<oneliner> 三个核心组件:**(1) 创建评估器** - LLM-as-Judge、自定义代码;**(2) 定义运行函数** - 捕获Agent的输出/执行轨迹用于评估;**(3) 运行评估** - 通过本地`evaluate()`函数或上传评估器自动运行。包含Python和TypeScript示例。 </oneliner> <setup> 环境变量
bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # 必填
LANGSMITH_PROJECT=your-project-name                   # 查看此项以了解哪个项目包含追踪数据
LANGSMITH_WORKSPACE_ID=your-workspace-id              # 可选:适用于组织范围的密钥
OPENAI_API_KEY=your_openai_key                        # 用于LLM-as-Judge
重要提示: 在查询或与LangSmith交互之前,务必检查环境变量或
.env
文件中的
LANGSMITH_PROJECT
。它会告知你哪个项目包含相关的追踪数据。如果LangSmith项目不可用,请根据最佳判断确定正确的项目。
Python依赖
bash
pip install langsmith langchain-openai python-dotenv
CLI工具(用于上传评估器)
bash
curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh
JavaScript依赖
bash
npm install langsmith openai
</setup>
<crucial_requirement>

Golden Rule: Inspect Before You Implement

黄金法则:先检查再实现

CRITICAL: Before writing ANY evaluator or extraction logic, you MUST:
  1. Run your agent on sample inputs and capture the actual output
  2. Inspect the output - print it, query LangSmith traces, understand the exact structure
  3. Only then write code that processes that output
Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces to when outputs don't contain needed data to understand how to extract from execution. </crucial_requirement>
<evaluator_format>
关键提示: 在编写任何评估器或提取逻辑之前,你必须:
  1. 运行你的Agent 并在示例输入上捕获实际输出
  2. 检查输出 - 打印输出、查询LangSmith追踪数据、了解确切的结构
  3. 之后再 编写处理该输出的代码
输出结构会因框架、Agent类型和配置的不同而有显著差异。永远不要假设输出的格式 - 一定要先验证。当输出不包含所需数据时,查询LangSmith追踪数据以了解如何从执行过程中提取数据。 </crucial_requirement>
<evaluator_format>

Offline vs Online Evaluators

离线评估器 vs 在线评估器

Offline Evaluators (attached to datasets):
  • Function signature:
    (run, example)
    - receives both run outputs and dataset example
  • Use case: Comparing agent outputs to expected values in a dataset
  • Upload with:
    --dataset "Dataset Name"
Online Evaluators (attached to projects):
  • Function signature:
    (run)
    - receives only run outputs, NO example parameter
  • Use case: Real-time quality checks on production runs (no reference data)
  • Upload with:
    --project "Project Name"
CRITICAL - Return Format:
  • Each evaluator returns ONE metric only. For multiple metrics, create multiple evaluator functions.
  • Do NOT return
    {"metric_name": value}
    or lists of metrics - this will error.
CRITICAL - Local vs Uploaded Differences:
Local
evaluate()
Uploaded to LangSmith
Column namePython: auto-derived from function name. TypeScript: must include
key
field or column is untitled
Comes from evaluator name set at upload time. Do NOT include
key
— it creates a duplicate column
Python
run
type
RunTree
object →
run.outputs
(attribute)
dict
run["outputs"]
(subscript). Handle both:
run.outputs if hasattr(run, "outputs") else run.get("outputs", {})
TypeScript
run
type
Always attribute access:
run.outputs?.field
Always attribute access:
run.outputs?.field
Python return
{"score": value, "comment": "..."}
{"score": value, "comment": "..."}
TypeScript return
{ key: "name", score: value, comment: "..." }
{ score: value, comment: "..." }
</evaluator_format>
<evaluator_types>
  • LLM as Judge - Uses an LLM to grade outputs. Best for subjective quality (accuracy, helpfulness, relevance).
  • Custom Code - Deterministic logic. Best for objective checks (exact match, trajectory validation, format compliance). </evaluator_types>
<llm_judge>
离线评估器(关联到数据集):
  • 函数签名:
    (run, example)
    - 同时接收运行输出和数据集示例
  • 使用场景:将Agent输出与数据集中的预期值进行比较
  • 上传方式:使用
    --dataset "Dataset Name"
在线评估器(关联到项目):
  • 函数签名:
    (run)
    - 仅接收运行输出,无example参数
  • 使用场景:对生产环境中的运行进行实时质量检查(无参考数据)
  • 上传方式:使用
    --project "Project Name"
关键提示 - 返回格式:
  • 每个评估器仅返回一个指标。如需多个指标,请创建多个评估器函数。
  • 不要返回
    {"metric_name": value}
    或指标列表 - 这会导致错误。
关键提示 - 本地与上传的差异:
本地
evaluate()
上传到LangSmith
列名Python:自动从函数名派生。TypeScript:必须包含
key
字段,否则列名为无标题
来自上传时设置的评估器名称。不要包含
key
— 否则会创建重复列
Python
run
类型
RunTree
对象 →
run.outputs
(属性)
dict
run["outputs"]
(下标)。同时处理两种情况:
run.outputs if hasattr(run, "outputs") else run.get("outputs", {})
TypeScript
run
类型
始终使用属性访问:
run.outputs?.field
始终使用属性访问:
run.outputs?.field
Python返回值
{"score": value, "comment": "..."}
{"score": value, "comment": "..."}
TypeScript返回值
{ key: "name", score: value, comment: "..." }
{ score: value, comment: "..." }
</evaluator_format>
<evaluator_types>
  • LLM-as-Judge - 使用LLM对输出进行评分。最适合主观质量评估(准确性、有用性、相关性)。
  • 自定义代码 - 确定性逻辑。最适合客观检查(精确匹配、轨迹验证、格式合规性)。 </evaluator_types>
<llm_judge>

LLM as Judge Evaluators

LLM-as-Judge评估器

NOTE: LLM-as-Judge upload is currently not supported by the CLI — only code evaluators are supported. For evaluations against a dataset, STRONGLY PREFER defining local evaluators to use with
evaluate(evaluators=[...])
.
<python> ```python from typing import TypedDict, Annotated from langchain_openai import ChatOpenAI
class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)
async def accuracy_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}]) return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}
</python>

<typescript>
```javascript
import OpenAI from "openai";

const openai = new OpenAI();

async function accuracyEvaluator(run, example) {
    const runOutputs = run.outputs ?? {};
    const exampleOutputs = example.outputs ?? {};

    const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
    ]
    });

    const grade = JSON.parse(response.choices[0].message.content);
    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}
</typescript> </llm_judge>
<code_evaluators>
注意: 目前CLI不支持上传LLM-as-Judge评估器 — 仅支持代码评估器。对于针对数据集的评估,强烈建议定义本地评估器并与
evaluate(evaluators=[...])
配合使用。
<python> ```python from typing import TypedDict, Annotated from langchain_openai import ChatOpenAI
class Grade(TypedDict): reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"]
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(Grade, method="json_schema", strict=True)
async def accuracy_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} grade = await judge.ainvoke([{"role": "user", "content": f"Expected: {example_outputs}\nActual: {run_outputs}\nIs this accurate?"}]) return {"score": 1 if grade["is_accurate"] else 0, "comment": grade["reasoning"]}
</python>

<typescript>
```javascript
import OpenAI from "openai";

const openai = new OpenAI();

async function accuracyEvaluator(run, example) {
    const runOutputs = run.outputs ?? {};
    const exampleOutputs = example.outputs ?? {};

    const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [
        { role: "system", content: 'Respond with JSON: {"is_accurate": boolean, "reasoning": string}' },
        { role: "user", content: `Expected: ${JSON.stringify(exampleOutputs)}\nActual: ${JSON.stringify(runOutputs)}\nIs this accurate?` }
    ]
    });

    const grade = JSON.parse(response.choices[0].message.content);
    return { score: grade.is_accurate ? 1 : 0, comment: grade.reasoning };
}
</typescript> </llm_judge>
<code_evaluators>

Custom Code Evaluators

自定义代码评估器

Before writing an evaluator:
  1. Inspect your dataset to understand expected field names (see Golden Rule above)
  2. Test your run function and verify its output structure matches the dataset schema
  3. Query LangSmith traces to debug any mismatches
<python> ```python def trajectory_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} # IMPORTANT: Replace these placeholders with your actual field names # 1. Query your LangSmith trace to see what fields exist in run outputs # 2. Check your dataset schema for expected field names # Note: Trajectory data may not appear in default output - verify against trace! actual = run_outputs.get("YOUR_TRAJECTORY_FIELD", []) expected = example_outputs.get("YOUR_EXPECTED_FIELD", []) return {"score": 1 if actual == expected else 0, "comment": f"Expected {expected}, got {actual}"} ``` </python> <typescript> ```javascript function trajectoryEvaluator(run, example) { const runOutputs = run.outputs ?? {}; const exampleOutputs = example.outputs ?? {}; // IMPORTANT: Replace these placeholders with your actual field names // 1. Query your LangSmith trace to see what fields exist in run outputs // 2. Check your dataset schema for expected field names const actual = runOutputs.YOUR_TRAJECTORY_FIELD ?? []; const expected = exampleOutputs.YOUR_EXPECTED_FIELD ?? []; const match = JSON.stringify(actual) === JSON.stringify(expected); return { score: match ? 1 : 0, comment: `Expected ${JSON.stringify(expected)}, got ${JSON.stringify(actual)}` }; } ``` </typescript> </code_evaluators>
<run_functions>
编写评估器之前:
  1. 检查你的数据集以了解预期的字段名称(参见上面的黄金法则)
  2. 测试你的运行函数并验证其输出结构与数据集模式匹配
  3. 查询LangSmith追踪数据以调试任何不匹配的情况
<python> ```python def trajectory_evaluator(run, example): run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {} example_outputs = example.outputs if hasattr(example, "outputs") else example.get("outputs", {}) or {} # 重要提示:将这些占位符替换为你实际的字段名称 # 1. 查询你的LangSmith追踪数据以查看运行输出中存在哪些字段 # 2. 检查你的数据集模式以获取预期的字段名称 # 注意:轨迹数据可能不会出现在默认输出中 - 请对照追踪数据进行验证! actual = run_outputs.get("YOUR_TRAJECTORY_FIELD", []) expected = example_outputs.get("YOUR_EXPECTED_FIELD", []) return {"score": 1 if actual == expected else 0, "comment": f"Expected {expected}, got {actual}"} ``` </python> <typescript> ```javascript function trajectoryEvaluator(run, example) { const runOutputs = run.outputs ?? {}; const exampleOutputs = example.outputs ?? {}; // 重要提示:将这些占位符替换为你实际的字段名称 // 1. 查询你的LangSmith追踪数据以查看运行输出中存在哪些字段 // 2. 检查你的数据集模式以获取预期的字段名称 const actual = runOutputs.YOUR_TRAJECTORY_FIELD ?? []; const expected = exampleOutputs.YOUR_EXPECTED_FIELD ?? []; const match = JSON.stringify(actual) === JSON.stringify(expected); return { score: match ? 1 : 0, comment: `Expected ${JSON.stringify(expected)}, got ${JSON.stringify(actual)}` }; } ``` </typescript> </code_evaluators>
<run_functions>

Defining Run Functions

定义运行函数

Run functions execute your agent and return outputs for evaluation.
CRITICAL - Test Your Run Function First: Before writing evaluators, you MUST test your run function and inspect the actual output structure. Output shapes vary by framework, agent type, and configuration.
Debugging workflow:
  1. Run your agent once on sample input
  2. Query the trace to see the execution structure
  3. Print the raw output and verify against trace to output contains the right data
  4. Adjust the run function as needed
  5. Verify your output matches your dataset schema
Try your hardest to match your run function output to your dataset schema. This makes evaluators simple and reusable. If matching isn't possible, your evaluator must know how to extract and compare the right fields from each side.
<python> ```python def run_agent(inputs: dict) -> dict: result = your_agent.run(inputs) # ALWAYS inspect output shape first - run this, check the print, query traces print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}") print(f"DEBUG - value: {result}") return {"output": result} # Adjust to match your dataset schema ``` </python> <typescript> ```javascript async function runAgent(inputs) { const result = await yourAgent.invoke(inputs); // ALWAYS inspect output shape first console.log("DEBUG - type:", typeof result, "keys:", Object.keys(result)); console.log("DEBUG - value:", result); return { output: result }; // Adjust to match your dataset schema } ``` </typescript>
运行函数用于执行你的Agent并返回评估所需的输出。
关键提示 - 先测试你的运行函数: 在编写评估器之前,你必须测试你的运行函数并检查实际的输出结构。输出格式会因框架、Agent类型和配置的不同而有显著差异。
调试流程:
  1. 在示例输入上运行一次你的Agent
  2. 查询追踪数据以查看执行结构
  3. 打印原始输出并与追踪数据进行验证,确保输出包含正确的数据
  4. 根据需要调整运行函数
  5. 验证你的输出与数据集模式匹配
尽最大努力使运行函数的输出与数据集模式匹配。 这会使评估器简单且可重用。如果无法匹配,你的评估器必须知道如何从双方提取和比较正确的字段。
<python> ```python def run_agent(inputs: dict) -> dict: result = your_agent.run(inputs) # 务必先检查输出格式 - 运行此代码,查看打印结果,查询追踪数据 print(f"DEBUG - type: {type(result)}, keys: {result.keys() if hasattr(result, 'keys') else 'N/A'}") print(f"DEBUG - value: {result}") return {"output": result} # 根据你的数据集模式进行调整 ``` </python> <typescript> ```javascript async function runAgent(inputs) { const result = await yourAgent.invoke(inputs); // 务必先检查输出格式 console.log("DEBUG - type:", typeof result, "keys:", Object.keys(result)); console.log("DEBUG - value:", result); return { output: result }; // 根据你的数据集模式进行调整 } ``` </typescript>

Capturing Trajectories

捕获执行轨迹

For trajectory evaluation, your run function must capture tool calls during execution.
CRITICAL: Run output formats vary significantly by framework and agent type. You MUST inspect before implementing:
LangGraph agents (LangChain OSS): Use
stream_mode="debug"
with
subgraphs=True
to capture nested subagent tool calls.
python
import uuid

def run_agent_with_trajectory(agent, inputs: dict) -> dict:
    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
    trajectory = []
    final_result = None

    for chunk in agent.stream(inputs, config=config, stream_mode="debug", subgraphs=True):
        # STEP 1: Print chunks to understand the structure
        print(f"DEBUG chunk: {chunk}")

        # STEP 2: Write extraction based on YOUR observed structure
        # ... your extraction logic here ...

    # IMPORTANT: After running, query the LangSmith trace to verify
    # your trajectory data is complete. Default output may be missing
    # tool calls that appear in the trace.
    return {"output": final_result, "trajectory": trajectory}
Custom / Non-LangChain Agents:
  1. Inspect output first - Run your agent and inspect the result structure. Trajectory data may already be included in the output (e.g.,
    result.tool_calls
    ,
    result.steps
    , etc.)
  2. Callbacks/Hooks - If your framework supports execution callbacks, register a hook that records tool names on each invocation
  3. Parse execution logs - As a last resort, extract tool names from structured logs or trace data
The key is to capture the tool name at execution time, not at definition time. </run_functions>
<upload>
对于轨迹评估,你的运行函数必须在执行过程中捕获工具调用。
关键提示: 运行输出格式会因框架和Agent类型的不同而有显著差异。你必须先检查再实现:
LangGraph Agent(LangChain开源版): 使用
stream_mode="debug"
subgraphs=True
来捕获嵌套子Agent的工具调用。
python
import uuid

def run_agent_with_trajectory(agent, inputs: dict) -> dict:
    config = {"configurable": {"thread_id": f"eval-{uuid.uuid4()}"}}
    trajectory = []
    final_result = None

    for chunk in agent.stream(inputs, config=config, stream_mode="debug", subgraphs=True):
        # 步骤1:打印块以了解结构
        print(f"DEBUG chunk: {chunk}")

        # 步骤2:根据你观察到的结构编写提取逻辑
        # ... 你的提取逻辑在此处 ...

    # 重要提示:运行后,查询LangSmith追踪数据以验证
    # 你的轨迹数据是否完整。默认输出可能缺少
    # 追踪数据中显示的工具调用。
    return {"output": final_result, "trajectory": trajectory}
自定义/非LangChain Agent:
  1. 先检查输出 - 运行你的Agent并检查结果结构。轨迹数据可能已包含在输出中(例如
    result.tool_calls
    result.steps
    等)
  2. 回调/钩子 - 如果你的框架支持执行回调,注册一个钩子以记录每次调用的工具名称
  3. 解析执行日志 - 最后一种方法是从结构化日志或追踪数据中提取工具名称
关键是在执行时捕获工具名称,而不是在定义时。 </run_functions>
<upload>

Uploading Evaluators to LangSmith

将评估器上传到LangSmith

IMPORTANT - Auto-Run Behavior: Evaluators uploaded to a dataset automatically run when you run experiments on that dataset. You do NOT need to pass them to
evaluate()
- just run your agent against the dataset and the uploaded evaluators execute automatically.
IMPORTANT - Local vs Uploaded: Uploaded evaluators run in a sandboxed environment with very limited package access. Only use built-in/standard library imports, and place all imports inside the evaluator function body. For dataset (offline) evaluators, prefer running locally with
evaluate(evaluators=[...])
first — this gives you full package access.
IMPORTANT - Code vs Structured Evaluators:
  • Code evaluators (what the CLI uploads): Run in a limited environment without external packages. Use for deterministic logic (exact match, trajectory validation).
  • Structured evaluators (LLM-as-Judge): Configured via LangSmith UI, use a specific payload format with model/prompt/schema. The CLI does not support this format yet.
IMPORTANT - Choose the right target:
  • --dataset
    : Offline evaluator with
    (run, example)
    signature - for comparing to expected values
  • --project
    : Online evaluator with
    (run)
    signature - for real-time quality checks
You must specify one. Global evaluators are not supported.
bash
undefined
重要提示 - 自动运行行为: 上传到数据集的评估器会自动运行,当你在该数据集上运行实验时。你不需要将它们传递给
evaluate()
- 只需在数据集上运行你的Agent,上传的评估器就会自动执行。
重要提示 - 本地与上传的差异: 上传的评估器在沙箱环境中运行,包访问权限非常有限。仅使用内置/标准库导入,并将所有导入放在评估器函数体内部。对于数据集(离线)评估器,建议先使用
evaluate(evaluators=[...])
在本地运行 - 这样你可以拥有完整的包访问权限。
重要提示 - 代码评估器与结构化评估器:
  • 代码评估器(CLI上传的类型):在有限环境中运行,无外部包。用于确定性逻辑(精确匹配、轨迹验证)。
  • 结构化评估器(LLM-as-Judge):通过LangSmith UI配置,使用特定的负载格式,包含模型/提示/模式。目前CLI不支持此格式。
重要提示 - 选择正确的目标:
  • --dataset
    :离线评估器,签名为
    (run, example)
    - 用于与预期值比较
  • --project
    :在线评估器,签名为
    (run)
    - 用于实时质量检查
你必须指定其中一个。不支持全局评估器。
bash
undefined

List all evaluators

列出所有评估器

langsmith evaluator list
langsmith evaluator list

Upload offline evaluator (attached to dataset)

上传离线评估器(关联到数据集)

langsmith evaluator upload my_evaluators.py
--name "Trajectory Match" --function trajectory_evaluator
--dataset "My Dataset" --replace
langsmith evaluator upload my_evaluators.py
--name "Trajectory Match" --function trajectory_evaluator
--dataset "My Dataset" --replace

Upload online evaluator (attached to project)

上传在线评估器(关联到项目)

langsmith evaluator upload my_evaluators.py
--name "Quality Check" --function quality_check
--project "Production Agent" --replace
langsmith evaluator upload my_evaluators.py
--name "Quality Check" --function quality_check
--project "Production Agent" --replace

Delete

删除评估器

langsmith evaluator delete "Trajectory Match"

**IMPORTANT - Safety Prompts:**
- The CLI prompts for confirmation before destructive operations
- **NEVER use `--yes` flag unless the user explicitly requests it**
</upload>

<best_practices>
1. **Use structured output for LLM judges** - More reliable than parsing free-text
2. **Match evaluator to dataset type**
   - Final Response → LLM as Judge for quality
   - Trajectory → Custom Code for sequence
3. **Use async for LLM judges** - Enables parallel evaluation
4. **Test evaluators independently** - Validate on known good/bad examples first
5. **Choose the right language**
   - Python: Use for Python agents, langchain integrations
   - JavaScript: Use for TypeScript/Node.js agents
</best_practices>

<running_evaluations>
langsmith evaluator delete "Trajectory Match"

**重要提示 - 安全提示:**
- CLI在执行破坏性操作前会提示确认
- **除非用户明确要求,否则永远不要使用`--yes`标志**
</upload>

<best_practices>
1. **为LLM评估器使用结构化输出** - 比解析自由文本更可靠
2. **根据数据集类型选择评估器**
   - 最终响应 → 使用LLM-as-Judge进行质量评估
   - 执行轨迹 → 使用自定义代码进行序列验证
3. **为LLM评估器使用异步** - 支持并行评估
4. **独立测试评估器** - 先在已知的好/坏示例上验证
5. **选择合适的语言**
   - Python:用于Python Agent、LangChain集成
   - JavaScript:用于TypeScript/Node.js Agent
</best_practices>

<running_evaluations>

Running Evaluations

运行评估

Uploaded evaluators auto-run when you run experiments - no code needed. Local evaluators are passed directly for development/testing.
<python> ```python from langsmith import evaluate
上传的评估器会在你运行实验时自动运行 - 无需代码。本地评估器直接传入用于开发/测试。
<python> ```python from langsmith import evaluate

Uploaded evaluators run automatically

上传的评估器会自动运行

results = evaluate(run_agent, data="My Dataset", experiment_prefix="eval-v1")
results = evaluate(run_agent, data="My Dataset", experiment_prefix="eval-v1")

Or pass local evaluators for testing

或者传入本地评估器进行测试

results = evaluate(run_agent, data="My Dataset", evaluators=[my_evaluator], experiment_prefix="eval-v1")
</python>

<typescript>
```javascript
import { evaluate } from "langsmith/evaluation";

// Uploaded evaluators run automatically
const results = await evaluate(runAgent, {
  data: "My Dataset",
  experimentPrefix: "eval-v1",
});

// Or pass local evaluators for testing
const results = await evaluate(runAgent, {
  data: "My Dataset",
  evaluators: [myEvaluator],
  experimentPrefix: "eval-v1",
});
</typescript> </running_evaluations> <troubleshooting>
results = evaluate(run_agent, data="My Dataset", evaluators=[my_evaluator], experiment_prefix="eval-v1")
</python>

<typescript>
```javascript
import { evaluate } from "langsmith/evaluation";

// 上传的评估器会自动运行
const results = await evaluate(runAgent, {
  data: "My Dataset",
  experimentPrefix: "eval-v1",
});

// 或者传入本地评估器进行测试
const results = await evaluate(runAgent, {
  data: "My Dataset",
  evaluators: [myEvaluator],
  experimentPrefix: "eval-v1",
});
</typescript> </running_evaluations> <troubleshooting>

Common Issues

常见问题

Output doesn't match what you expect: Query the LangSmith trace. It shows exact inputs/outputs at each step - compare what you find to what you're trying to extract.
One metric per evaluator: Return
{"score": value, "comment": "..."}
. For multiple metrics, create separate functions.
Field name mismatch: Your run function output must match dataset schema exactly. Inspect dataset first with
client.read_example(example_id)
.
RunTree vs dict (Python only): Local
evaluate()
passes
RunTree
, uploaded evaluators receive
dict
. Handle both:
python
run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}
TypeScript always uses attribute access:
run.outputs?.field
</troubleshooting>
<resources> - [LangSmith Evaluation Concepts](https://docs.langchain.com/langsmith/evaluation-concepts) - [Custom Code Evaluators](https://changelog.langchain.com/announcements/custom-code-evaluators-in-langsmith) - [OpenEvals - Readymade Evaluators](https://github.com/langchain-ai/openevals) </resources>
输出与预期不符: 查询LangSmith追踪数据。它会显示每个步骤的确切输入/输出 - 将你找到的内容与你尝试提取的内容进行比较。
每个评估器对应一个指标: 返回
{"score": value, "comment": "..."}
。如需多个指标,请创建单独的函数。
字段名称不匹配: 你的运行函数输出必须与数据集模式完全匹配。先使用
client.read_example(example_id)
检查数据集。
RunTree与dict(仅Python): 本地
evaluate()
传递
RunTree
对象,上传的评估器接收
dict
。同时处理两种情况:
python
run_outputs = run.outputs if hasattr(run, "outputs") else run.get("outputs", {}) or {}
TypeScript始终使用属性访问:
run.outputs?.field
</troubleshooting>
<resources> - [LangSmith评估概念](https://docs.langchain.com/langsmith/evaluation-concepts) - [自定义代码评估器](https://changelog.langchain.com/announcements/custom-code-evaluators-in-langsmith) - [OpenEvals - 现成的评估器](https://github.com/langchain-ai/openevals) </resources>