langsmith-code-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LangSmith Code Evaluator Creation

LangSmith 代码评估器创建指南

Create code-based evaluators for LangSmith-traced agents through a 9-step collaborative process.

通过9步协作流程，为LangSmith追踪的Agent创建基于代码的评估器。

Workflow

工作流程

Step 1: Locate the Agent

步骤1：定位Agent文件

Ask: "Where is your agent file located?"

询问：“你的Agent文件存放在哪里？”

Step 2: Understand the Agent

步骤2：了解Agent

Read the agent file. Identify:

Main entry point function
Tools/functions it calls
Return format (string? dict with messages?)

读取Agent文件，明确以下信息：

主入口函数
调用的工具/函数
返回格式（字符串？带消息的字典？）

Step 3: Check for Traces

步骤3：检查追踪记录

Ask: "Do you have recent traces in LangSmith?"

If yes: Get project name
If no: Ask to run agent once to generate a trace

询问：“你在LangSmith中有近期的追踪记录吗？”

如果有：获取项目名称
如果没有：要求运行一次Agent以生成追踪记录

Step 4: Inspect Trace Structure

步骤4：检查追踪记录结构

Run:

python scripts/inspect_trace.py PROJECT_NAME

This shows where data lives:

Tool calls in
```
run.outputs["messages"]
```
?
Tool calls in
```
run.child_runs
```
?
What's in inputs/outputs?

Use the returned structure dict programmatically:

python

from inspect_trace import inspect_trace_structure

structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
    # Tool calls are in run.outputs["messages"]

运行：

python scripts/inspect_trace.py PROJECT_NAME

该脚本会展示数据的存储位置：

工具调用是否在
```
run.outputs["messages"]
```
中？
工具调用是否在
```
run.child_runs
```
中？
输入/输出中包含哪些内容？

通过编程方式使用返回的结构字典：

python

from inspect_trace import inspect_trace_structure

structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
    # Tool calls are in run.outputs["messages"]

Step 5: Clarify Evaluation Goals

步骤5：明确评估目标

Ask: "What behavior do you want to test for?"

If stated: Confirm understanding
If unclear: Ask clarifying questions
Understand: Pass vs fail criteria? Different categories? Metadata?

询问：“你想要测试Agent的哪些行为？”

如果用户已说明：确认理解内容
如果不明确：提出澄清问题
明确：通过/失败的标准？不同分类？元数据？

Step 6: Create the Evaluator

步骤6：创建评估器

Write

eval_[name].py

using this signature:

python

from langsmith.schemas import Run, Example

def evaluate_[name](run: Run, example: Example) -> dict:
    """Evaluate [specific behavior]."""

    # Extract data (based on Step 4)
    messages = run.outputs.get("messages", [])
    category = example.metadata.get("category") if example.metadata else None

    # Evaluation logic (based on Step 5)
    # ...

    return {
        "key": "evaluator_name",
        "score": 1 or 0,  # 1 = pass, 0 = fail
        "comment": "Specific feedback explaining the score"
    }

Extract tool calls from messages:

python

for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        for tc in msg["tool_calls"]:
            tool_name = tc["function"]["name"]
            args = json.loads(tc["function"]["arguments"])

Category-based evaluation:

python

category = example.metadata.get("category", "unknown")
if category == "stock":
    score = 1 if made_db_call else 0
elif category == "weather":
    score = 1 if not made_db_call else 0

使用以下签名编写

eval_[name].py

文件：

python

from langsmith.schemas import Run, Example

def evaluate_[name](run: Run, example: Example) -> dict:
    """Evaluate [specific behavior]."""

    # Extract data (based on Step 4)
    messages = run.outputs.get("messages", [])
    category = example.metadata.get("category") if example.metadata else None

    # Evaluation logic (based on Step 5)
    # ...

    return {
        "key": "evaluator_name",
        "score": 1 or 0,  # 1 = pass, 0 = fail
        "comment": "Specific feedback explaining the score"
    }

从消息中提取工具调用：

python

for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        for tc in msg["tool_calls"]:
            tool_name = tc["function"]["name"]
            args = json.loads(tc["function"]["arguments"])

基于分类的评估：

python

category = example.metadata.get("category", "unknown")
if category == "stock":
    score = 1 if made_db_call else 0
elif category == "weather":
    score = 1 if not made_db_call else 0

Step 7: Create/Update Experiment Runner

步骤7：创建/更新实验运行器

Check if

run_experiment_with_eval.py

exists. If not, create:

python

import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv

load_dotenv()

async def agent_wrapper(inputs: dict) -> dict:
    result = await [agent_function](inputs["question"])
    return result

async def main():
    results = await aevaluate(
        agent_wrapper,
        data="DATASET_NAME",
        evaluators=[evaluate_[name]],
        experiment_prefix="eval-test",
        max_concurrency=5,
    )
    print(f"Results: {results}")

if __name__ == "__main__":
    asyncio.run(main())

检查是否存在

run_experiment_with_eval.py

文件。如果不存在，创建该文件：

python

import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv

load_dotenv()

async def agent_wrapper(inputs: dict) -> dict:
    result = await [agent_function](inputs["question"])
    return result

async def main():
    results = await aevaluate(
        agent_wrapper,
        data="DATASET_NAME",
        evaluators=[evaluate_[name]],
        experiment_prefix="eval-test",
        max_concurrency=5,
    )
    print(f"Results: {results}")

if __name__ == "__main__":
    asyncio.run(main())

Step 8: Configure Dataset

步骤8：配置数据集

Ask: "What's your dataset name?" Ask: "Please update the dataset name in the experiment runner" Wait for confirmation.

询问：“你的数据集名称是什么？” 要求：“请更新实验运行器中的数据集名称” 等待用户确认。

Step 9: Run the Evaluation

步骤9：运行评估

Execute:

uv run python run_experiment_with_eval.py

Show the LangSmith URL when printed.

执行命令：

uv run python run_experiment_with_eval.py

打印结果时展示LangSmith的URL。

Key Patterns

核心模式

Extracting from messages (most reliable):

python

messages = run.outputs.get("messages", [])
for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        # Tool calls are here

Extracting from child_runs (if messages not available):

python

def traverse_runs(run):
    if run.name == "tool_name":
        # Found it
    if hasattr(run, 'child_runs') and run.child_runs:
        for child in run.child_runs:
            traverse_runs(child)

Using metadata:

python

category = example.metadata.get("category") if example.metadata else None

从消息中提取数据（最可靠）：

python

messages = run.outputs.get("messages", [])
for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        # Tool calls are here

从child_runs中提取数据（如果消息不可用）：

python

def traverse_runs(run):
    if run.name == "tool_name":
        # Found it
    if hasattr(run, 'child_runs') and run.child_runs:
        for child in run.child_runs:
            traverse_runs(child)

使用元数据：

python

category = example.metadata.get("category") if example.metadata else None

Troubleshooting

故障排除

Can't find tool calls: Re-run

inspect_trace.py

to see actual structure

child_runs empty: Agent should return messages in outputs

Same score always: Debug evaluation logic with print statements

Dataset not found: Verify name in LangSmith UI

无法找到工具调用：重新运行

inspect_trace.py

查看实际结构 child_runs为空：Agent应在输出中返回消息 评分始终相同：使用打印语句调试评估逻辑 数据集未找到：在LangSmith UI中验证数据集名称

Reference

参考资料

Documentation:

Code Evaluator SDK - Writing evaluators
Evaluate LLM Applications - Running experiments

Important: Extensive LangSmith documentation exists. If unsure about trace structure, SDK usage, or evaluation patterns, search documentation rather than assuming.

See parent project for complete example:

```
agent_v4.py
```
- Returns messages in outputs
```
eval_tool_call_schema.py
```
- Tool call + schema discovery evaluator
```
run_experiment_with_code_eval.py
```
- Experiment runner

文档：

Code Evaluator SDK - 编写评估器
Evaluate LLM Applications - 运行实验

重要提示： 存在大量LangSmith官方文档。如果对追踪记录结构、SDK使用或评估模式有疑问，请查阅文档而非主观假设。

查看父项目获取完整示例：

```
agent_v4.py
```
- 在输出中返回消息的Agent示例
```
eval_tool_call_schema.py
```
- 工具调用+模式发现评估器
```
run_experiment_with_code_eval.py
```
- 实验运行器示例