langsmith-code-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangSmith Code Evaluator Creation
LangSmith 代码评估器创建指南
Create code-based evaluators for LangSmith-traced agents through a 9-step collaborative process.
通过9步协作流程,为LangSmith追踪的Agent创建基于代码的评估器。
Workflow
工作流程
Step 1: Locate the Agent
步骤1:定位Agent文件
Ask: "Where is your agent file located?"
询问:“你的Agent文件存放在哪里?”
Step 2: Understand the Agent
步骤2:了解Agent
Read the agent file. Identify:
- Main entry point function
- Tools/functions it calls
- Return format (string? dict with messages?)
读取Agent文件,明确以下信息:
- 主入口函数
- 调用的工具/函数
- 返回格式(字符串?带消息的字典?)
Step 3: Check for Traces
步骤3:检查追踪记录
Ask: "Do you have recent traces in LangSmith?"
- If yes: Get project name
- If no: Ask to run agent once to generate a trace
询问:“你在LangSmith中有近期的追踪记录吗?”
- 如果有:获取项目名称
- 如果没有:要求运行一次Agent以生成追踪记录
Step 4: Inspect Trace Structure
步骤4:检查追踪记录结构
Run:
python scripts/inspect_trace.py PROJECT_NAMEThis shows where data lives:
- Tool calls in ?
run.outputs["messages"] - Tool calls in ?
run.child_runs - What's in inputs/outputs?
Use the returned structure dict programmatically:
python
from inspect_trace import inspect_trace_structure
structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
# Tool calls are in run.outputs["messages"]运行:
python scripts/inspect_trace.py PROJECT_NAME该脚本会展示数据的存储位置:
- 工具调用是否在中?
run.outputs["messages"] - 工具调用是否在中?
run.child_runs - 输入/输出中包含哪些内容?
通过编程方式使用返回的结构字典:
python
from inspect_trace import inspect_trace_structure
structure = inspect_trace_structure("project-name")
if "extract_from_messages" in structure["recommendations"]:
# Tool calls are in run.outputs["messages"]Step 5: Clarify Evaluation Goals
步骤5:明确评估目标
Ask: "What behavior do you want to test for?"
- If stated: Confirm understanding
- If unclear: Ask clarifying questions
- Understand: Pass vs fail criteria? Different categories? Metadata?
询问:“你想要测试Agent的哪些行为?”
- 如果用户已说明:确认理解内容
- 如果不明确:提出澄清问题
- 明确:通过/失败的标准?不同分类?元数据?
Step 6: Create the Evaluator
步骤6:创建评估器
Write using this signature:
eval_[name].pypython
from langsmith.schemas import Run, Example
def evaluate_[name](run: Run, example: Example) -> dict:
"""Evaluate [specific behavior]."""
# Extract data (based on Step 4)
messages = run.outputs.get("messages", [])
category = example.metadata.get("category") if example.metadata else None
# Evaluation logic (based on Step 5)
# ...
return {
"key": "evaluator_name",
"score": 1 or 0, # 1 = pass, 0 = fail
"comment": "Specific feedback explaining the score"
}Extract tool calls from messages:
python
for msg in messages:
if msg.get("role") == "assistant" and msg.get("tool_calls"):
for tc in msg["tool_calls"]:
tool_name = tc["function"]["name"]
args = json.loads(tc["function"]["arguments"])Category-based evaluation:
python
category = example.metadata.get("category", "unknown")
if category == "stock":
score = 1 if made_db_call else 0
elif category == "weather":
score = 1 if not made_db_call else 0使用以下签名编写文件:
eval_[name].pypython
from langsmith.schemas import Run, Example
def evaluate_[name](run: Run, example: Example) -> dict:
"""Evaluate [specific behavior]."""
# Extract data (based on Step 4)
messages = run.outputs.get("messages", [])
category = example.metadata.get("category") if example.metadata else None
# Evaluation logic (based on Step 5)
# ...
return {
"key": "evaluator_name",
"score": 1 or 0, # 1 = pass, 0 = fail
"comment": "Specific feedback explaining the score"
}从消息中提取工具调用:
python
for msg in messages:
if msg.get("role") == "assistant" and msg.get("tool_calls"):
for tc in msg["tool_calls"]:
tool_name = tc["function"]["name"]
args = json.loads(tc["function"]["arguments"])基于分类的评估:
python
category = example.metadata.get("category", "unknown")
if category == "stock":
score = 1 if made_db_call else 0
elif category == "weather":
score = 1 if not made_db_call else 0Step 7: Create/Update Experiment Runner
步骤7:创建/更新实验运行器
Check if exists. If not, create:
run_experiment_with_eval.pypython
import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv
load_dotenv()
async def agent_wrapper(inputs: dict) -> dict:
result = await [agent_function](inputs["question"])
return result
async def main():
results = await aevaluate(
agent_wrapper,
data="DATASET_NAME",
evaluators=[evaluate_[name]],
experiment_prefix="eval-test",
max_concurrency=5,
)
print(f"Results: {results}")
if __name__ == "__main__":
asyncio.run(main())检查是否存在文件。如果不存在,创建该文件:
run_experiment_with_eval.pypython
import asyncio
from langsmith import aevaluate
from [agent_module] import [agent_function]
from eval_[name] import evaluate_[name]
from dotenv import load_dotenv
load_dotenv()
async def agent_wrapper(inputs: dict) -> dict:
result = await [agent_function](inputs["question"])
return result
async def main():
results = await aevaluate(
agent_wrapper,
data="DATASET_NAME",
evaluators=[evaluate_[name]],
experiment_prefix="eval-test",
max_concurrency=5,
)
print(f"Results: {results}")
if __name__ == "__main__":
asyncio.run(main())Step 8: Configure Dataset
步骤8:配置数据集
Ask: "What's your dataset name?"
Ask: "Please update the dataset name in the experiment runner"
Wait for confirmation.
询问:“你的数据集名称是什么?”
要求:“请更新实验运行器中的数据集名称”
等待用户确认。
Step 9: Run the Evaluation
步骤9:运行评估
Execute:
Show the LangSmith URL when printed.
uv run python run_experiment_with_eval.py执行命令:
打印结果时展示LangSmith的URL。
uv run python run_experiment_with_eval.pyKey Patterns
核心模式
Extracting from messages (most reliable):
python
messages = run.outputs.get("messages", [])
for msg in messages:
if msg.get("role") == "assistant" and msg.get("tool_calls"):
# Tool calls are hereExtracting from child_runs (if messages not available):
python
def traverse_runs(run):
if run.name == "tool_name":
# Found it
if hasattr(run, 'child_runs') and run.child_runs:
for child in run.child_runs:
traverse_runs(child)Using metadata:
python
category = example.metadata.get("category") if example.metadata else None从消息中提取数据(最可靠):
python
messages = run.outputs.get("messages", [])
for msg in messages:
if msg.get("role") == "assistant" and msg.get("tool_calls"):
# Tool calls are here从child_runs中提取数据(如果消息不可用):
python
def traverse_runs(run):
if run.name == "tool_name":
# Found it
if hasattr(run, 'child_runs') and run.child_runs:
for child in run.child_runs:
traverse_runs(child)使用元数据:
python
category = example.metadata.get("category") if example.metadata else NoneTroubleshooting
故障排除
Can't find tool calls: Re-run to see actual structure
inspect_trace.pychild_runs empty: Agent should return messages in outputs
Same score always: Debug evaluation logic with print statements
Dataset not found: Verify name in LangSmith UI
无法找到工具调用:重新运行查看实际结构
child_runs为空:Agent应在输出中返回消息
评分始终相同:使用打印语句调试评估逻辑
数据集未找到:在LangSmith UI中验证数据集名称
inspect_trace.pyReference
参考资料
Documentation:
- Code Evaluator SDK - Writing evaluators
- Evaluate LLM Applications - Running experiments
Important: Extensive LangSmith documentation exists. If unsure about trace structure, SDK usage, or evaluation patterns, search documentation rather than assuming.
See parent project for complete example:
- - Returns messages in outputs
agent_v4.py - - Tool call + schema discovery evaluator
eval_tool_call_schema.py - - Experiment runner
run_experiment_with_code_eval.py
文档:
- Code Evaluator SDK - 编写评估器
- Evaluate LLM Applications - 运行实验
重要提示: 存在大量LangSmith官方文档。如果对追踪记录结构、SDK使用或评估模式有疑问,请查阅文档而非主观假设。
查看父项目获取完整示例:
- - 在输出中返回消息的Agent示例
agent_v4.py - - 工具调用+模式发现评估器
eval_tool_call_schema.py - - 实验运行器示例
run_experiment_with_code_eval.py