langsmith-code-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangSmith Code Evaluator Creation

LangSmith 代码评估器创建指南

Create code-based evaluators for LangSmith-traced agents through a 9-step collaborative process.
通过9步协作流程,为LangSmith追踪的Agent创建基于代码的评估器。

Workflow

工作流程

Step 1: Locate the Agent

步骤1:定位Agent文件

Ask: "Where is your agent file located?"
询问:“你的Agent文件存放在哪里?”

Step 2: Understand the Agent

步骤2:了解Agent

Read the agent file. Identify:
  • Main entry point function
  • Tools/functions it calls
  • Return format (string? dict with messages?)
读取Agent文件,明确以下信息:
  • 主入口函数
  • 调用的工具/函数
  • 返回格式(字符串?带消息的字典?)

Step 3: Check for Traces

步骤3:检查追踪记录

Ask: "Do you have recent traces in LangSmith?"
  • If yes: Get project name
  • If no: Ask to run agent once to generate a trace
询问:“你在LangSmith中有近期的追踪记录吗?”
  • 如果有:获取项目名称
  • 如果没有:要求运行一次Agent以生成追踪记录

Step 4: Inspect Trace Structure

步骤4:检查追踪记录结构

Run:
python scripts/inspect_trace.py PROJECT_NAME
This shows where data lives:
  • Tool calls in
    run.outputs["messages"]
    ?
  • Tool calls in
    run.child_runs
    ?
  • What's in inputs/outputs?
Use the returned structure dict programmatically:
python
from inspect_trace import inspect_trace_structure

structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
    # Tool calls are in run.outputs["messages"]
运行:
python scripts/inspect_trace.py PROJECT_NAME
该脚本会展示数据的存储位置:
  • 工具调用是否在
    run.outputs["messages"]
    中?
  • 工具调用是否在
    run.child_runs
    中?
  • 输入/输出中包含哪些内容?
通过编程方式使用返回的结构字典:
python
from inspect_trace import inspect_trace_structure

structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
    # Tool calls are in run.outputs["messages"]

Step 5: Clarify Evaluation Goals

步骤5:明确评估目标

Ask: "What behavior do you want to test for?"
  • If stated: Confirm understanding
  • If unclear: Ask clarifying questions
  • Understand: Pass vs fail criteria? Different categories? Metadata?
询问:“你想要测试Agent的哪些行为?”
  • 如果用户已说明:确认理解内容
  • 如果不明确:提出澄清问题
  • 明确:通过/失败的标准?不同分类?元数据?

Step 6: Create the Evaluator

步骤6:创建评估器

Write
eval_[name].py
using this signature:
python
from langsmith.schemas import Run, Example

def evaluate_[name](run: Run, example: Example) -> dict:
    """Evaluate [specific behavior]."""

    # Extract data (based on Step 4)
    messages = run.outputs.get("messages", [])
    category = example.metadata.get("category") if example.metadata else None

    # Evaluation logic (based on Step 5)
    # ...

    return {
        "key": "evaluator_name",
        "score": 1 or 0,  # 1 = pass, 0 = fail
        "comment": "Specific feedback explaining the score"
    }
Extract tool calls from messages:
python
for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        for tc in msg["tool_calls"]:
            tool_name = tc["function"]["name"]
            args = json.loads(tc["function"]["arguments"])
Category-based evaluation:
python
category = example.metadata.get("category", "unknown")
if category == "stock":
    score = 1 if made_db_call else 0
elif category == "weather":
    score = 1 if not made_db_call else 0
使用以下签名编写
eval_[name].py
文件:
python
from langsmith.schemas import Run, Example

def evaluate_[name](run: Run, example: Example) -> dict:
    """Evaluate [specific behavior]."""

    # Extract data (based on Step 4)
    messages = run.outputs.get("messages", [])
    category = example.metadata.get("category") if example.metadata else None

    # Evaluation logic (based on Step 5)
    # ...

    return {
        "key": "evaluator_name",
        "score": 1 or 0,  # 1 = pass, 0 = fail
        "comment": "Specific feedback explaining the score"
    }
从消息中提取工具调用:
python
for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        for tc in msg["tool_calls"]:
            tool_name = tc["function"]["name"]
            args = json.loads(tc["function"]["arguments"])
基于分类的评估:
python
category = example.metadata.get("category", "unknown")
if category == "stock":
    score = 1 if made_db_call else 0
elif category == "weather":
    score = 1 if not made_db_call else 0

Step 7: Create/Update Experiment Runner

步骤7:创建/更新实验运行器

Check if
run_experiment_with_eval.py
exists. If not, create:
python
import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv

load_dotenv()

async def agent_wrapper(inputs: dict) -> dict:
    result = await [agent_function](inputs["question"])
    return result

async def main():
    results = await aevaluate(
        agent_wrapper,
        data="DATASET_NAME",
        evaluators=[evaluate_[name]],
        experiment_prefix="eval-test",
        max_concurrency=5,
    )
    print(f"Results: {results}")

if __name__ == "__main__":
    asyncio.run(main())
检查是否存在
run_experiment_with_eval.py
文件。如果不存在,创建该文件:
python
import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv

load_dotenv()

async def agent_wrapper(inputs: dict) -> dict:
    result = await [agent_function](inputs["question"])
    return result

async def main():
    results = await aevaluate(
        agent_wrapper,
        data="DATASET_NAME",
        evaluators=[evaluate_[name]],
        experiment_prefix="eval-test",
        max_concurrency=5,
    )
    print(f"Results: {results}")

if __name__ == "__main__":
    asyncio.run(main())

Step 8: Configure Dataset

步骤8:配置数据集

Ask: "What's your dataset name?" Ask: "Please update the dataset name in the experiment runner" Wait for confirmation.
询问:“你的数据集名称是什么?” 要求:“请更新实验运行器中的数据集名称” 等待用户确认。

Step 9: Run the Evaluation

步骤9:运行评估

Execute:
uv run python run_experiment_with_eval.py
Show the LangSmith URL when printed.
执行命令:
uv run python run_experiment_with_eval.py
打印结果时展示LangSmith的URL。

Key Patterns

核心模式

Extracting from messages (most reliable):
python
messages = run.outputs.get("messages", [])
for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        # Tool calls are here
Extracting from child_runs (if messages not available):
python
def traverse_runs(run):
    if run.name == "tool_name":
        # Found it
    if hasattr(run, 'child_runs') and run.child_runs:
        for child in run.child_runs:
            traverse_runs(child)
Using metadata:
python
category = example.metadata.get("category") if example.metadata else None
从消息中提取数据(最可靠):
python
messages = run.outputs.get("messages", [])
for msg in messages:
    if msg.get("role") == "assistant" and msg.get("tool_calls"):
        # Tool calls are here
从child_runs中提取数据(如果消息不可用):
python
def traverse_runs(run):
    if run.name == "tool_name":
        # Found it
    if hasattr(run, 'child_runs') and run.child_runs:
        for child in run.child_runs:
            traverse_runs(child)
使用元数据:
python
category = example.metadata.get("category") if example.metadata else None

Troubleshooting

故障排除

Can't find tool calls: Re-run
inspect_trace.py
to see actual structure
child_runs empty: Agent should return messages in outputs
Same score always: Debug evaluation logic with print statements
Dataset not found: Verify name in LangSmith UI
无法找到工具调用:重新运行
inspect_trace.py
查看实际结构 child_runs为空:Agent应在输出中返回消息 评分始终相同:使用打印语句调试评估逻辑 数据集未找到:在LangSmith UI中验证数据集名称

Reference

参考资料

Documentation:
Important: Extensive LangSmith documentation exists. If unsure about trace structure, SDK usage, or evaluation patterns, search documentation rather than assuming.
See parent project for complete example:
  • agent_v4.py
    - Returns messages in outputs
  • eval_tool_call_schema.py
    - Tool call + schema discovery evaluator
  • run_experiment_with_code_eval.py
    - Experiment runner
文档:
重要提示: 存在大量LangSmith官方文档。如果对追踪记录结构、SDK使用或评估模式有疑问,请查阅文档而非主观假设。
查看父项目获取完整示例:
  • agent_v4.py
    - 在输出中返回消息的Agent示例
  • eval_tool_call_schema.py
    - 工具调用+模式发现评估器
  • run_experiment_with_code_eval.py
    - 实验运行器示例