langsmith-evaluator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangSmith Evaluator
LangSmith 评估器
Create evaluators to measure agent performance on your datasets. LangSmith supports two types: LLM as Judge (uses LLM to grade outputs) and Custom Code (deterministic logic).
创建评估器以衡量Agent在你的数据集上的性能表现。LangSmith支持两种类型:LLM as Judge(使用LLM对输出进行评分)和Custom Code(确定性逻辑)。
Setup
配置
Environment Variables
环境变量
bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key # For LLM as Judgebash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key # For LLM as JudgeDependencies
依赖安装
bash
pip install langsmith langchain-openai python-dotenvbash
pip install langsmith langchain-openai python-dotenvEvaluator Format
评估器格式
Evaluators support two function signatures:
Method 1: Dict Parameters (For running evaluations locally):
python
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""Evaluate a single prediction."""
user_query = inputs.get("query", "")
agent_response = outputs.get("expected_response", "")
expected = reference_outputs.get("expected_response", "") if reference_outputs else None
return {
"key": "metric_name", # Metric identifier
"score": 0.85, # Number or boolean
"comment": "Reason..." # Optional explanation
}Method 2: Run/Example Parameters (For uploading to LangSmith):
python
def evaluator_name(run, example):
"""Evaluate using run/example dicts.
Args:
run: Dict with run["outputs"] containing agent outputs
example: Dict with example["outputs"] containing expected outputs
"""
agent_response = run["outputs"].get("expected_response", "")
expected = example["outputs"].get("expected_response", "")
return {
"metric_name": 0.85, # Metric name as key directly
"comment": "Reason..." # Optional explanation
}评估器支持两种函数签名:
方法1:字典参数(用于本地运行评估):
python
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
"""Evaluate a single prediction."""
user_query = inputs.get("query", "")
agent_response = outputs.get("expected_response", "")
expected = reference_outputs.get("expected_response", "") if reference_outputs else None
return {
"key": "metric_name", # Metric identifier
"score": 0.85, # Number or boolean
"comment": "Reason..." # Optional explanation
}方法2:Run/Example参数(用于上传至LangSmith):
python
def evaluator_name(run, example):
"""Evaluate using run/example dicts.
Args:
run: Dict with run["outputs"] containing agent outputs
example: Dict with example["outputs"] containing expected outputs
"""
agent_response = run["outputs"].get("expected_response", "")
expected = example["outputs"].get("expected_response", "")
return {
"metric_name": 0.85, # Metric name as key directly
"comment": "Reason..." # Optional explanation
}LLM as Judge Evaluators
LLM as Judge 评估器
Use structured output for reliable grading:
python
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI
class AccuracyGrade(TypedDict):
"""Structured evaluation output."""
reasoning: Annotated[str, ..., "Explain your reasoning"]
is_accurate: Annotated[bool, ..., "True if response is accurate"]
confidence: Annotated[float, ..., "Confidence 0.0-1.0"]使用结构化输出以获得可靠的评分结果:
python
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI
class AccuracyGrade(TypedDict):
"""Structured evaluation output."""
reasoning: Annotated[str, ..., "Explain your reasoning"]
is_accurate: Annotated[bool, ..., "True if response is accurate"]
confidence: Annotated[float, ..., "Confidence 0.0-1.0"]Configure model with structured output
Configure model with structured output
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(
AccuracyGrade, method="json_schema", strict=True
)
async def accuracy_evaluator(run, example):
"""Evaluate factual accuracy for LangSmith upload."""
expected = example["outputs"].get('expected_response', '')
agent_output = run["outputs"].get('expected_response', '')
prompt = f"""Expected: {expected}Agent Output: {agent_output}
Evaluate accuracy:"""
grade = await judge.ainvoke([{"role": "user", "content": prompt}])
return {
"accuracy": 1 if grade["is_accurate"] else 0,
"comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
}
**Common Metrics:** Completeness, correctness, helpfulness, professionalismjudge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(
AccuracyGrade, method="json_schema", strict=True
)
async def accuracy_evaluator(run, example):
"""Evaluate factual accuracy for LangSmith upload."""
expected = example["outputs"].get('expected_response', '')
agent_output = run["outputs"].get('expected_response', '')
prompt = f"""Expected: {expected}Agent Output: {agent_output}
Evaluate accuracy:"""
grade = await judge.ainvoke([{"role": "user", "content": prompt}])
return {
"accuracy": 1 if grade["is_accurate"] else 0,
"comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
}
**常见指标:** 完整性、正确性、实用性、专业性Custom Code Evaluators
自定义代码评估器
Exact Match
精确匹配
python
def exact_match_evaluator(run, example):
"""Check if output exactly matches expected."""
output = run["outputs"].get("expected_response", "").strip().lower()
expected = example["outputs"].get("expected_response", "").strip().lower()
match = output == expected
return {
"exact_match": 1 if match else 0,
"comment": f"Match: {match}"
}python
def exact_match_evaluator(run, example):
"""Check if output exactly matches expected."""
output = run["outputs"].get("expected_response", "").strip().lower()
expected = example["outputs"].get("expected_response", "").strip().lower()
match = output == expected
return {
"exact_match": 1 if match else 0,
"comment": f"Match: {match}"
}Trajectory Validation
轨迹验证
python
def trajectory_evaluator(run, example):
"""Evaluate tool call sequence."""
trajectory = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])
# Exact sequence match
exact = trajectory == expected
# All required tools used (order-agnostic)
all_tools = set(expected).issubset(set(trajectory))
# Efficiency: count extra steps
extra_steps = len(trajectory) - len(expected)
return {
"trajectory_match": 1 if exact else 0,
"comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
}python
def trajectory_evaluator(run, example):
"""Evaluate tool call sequence."""
trajectory = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])
# Exact sequence match
exact = trajectory == expected
# All required tools used (order-agnostic)
all_tools = set(expected).issubset(set(trajectory))
# Efficiency: count extra steps
extra_steps = len(trajectory) - len(expected)
return {
"trajectory_match": 1 if exact else 0,
"comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
}Single Step Validation
单步骤验证
python
def single_step_evaluator(run, example):
"""Evaluate single node output."""
output = run["outputs"].get("output", {})
expected = example["outputs"].get("expected_output", {})
node_name = run["outputs"].get("node_name", "")
# For classification nodes
if "classification" in node_name:
classification = output.get("classification", "")
expected_class = expected.get("classification", "")
match = classification.lower() == expected_class.lower()
return {
"classification_correct": 1 if match else 0,
"comment": f"Output: {classification}, Expected: {expected_class}"
}
# For other nodes
match = output == expected
return {
"output_match": 1 if match else 0,
"comment": f"Match: {match}"
}python
def single_step_evaluator(run, example):
"""Evaluate single node output."""
output = run["outputs"].get("output", {})
expected = example["outputs"].get("expected_output", {})
node_name = run["outputs"].get("node_name", "")
# For classification nodes
if "classification" in node_name:
classification = output.get("classification", "")
expected_class = expected.get("classification", "")
match = classification.lower() == expected_class.lower()
return {
"classification_correct": 1 if match else 0,
"comment": f"Output: {classification}, Expected: {expected_class}"
}
# For other nodes
match = output == expected
return {
"output_match": 1 if match else 0,
"comment": f"Match: {match}"
}Running Evaluations
运行评估
python
from langsmith import Client
client = Client()python
from langsmith import Client
client = Client()Define your agent function
Define your agent function
def run_agent(inputs: dict) -> dict:
"""Your agent invocation logic."""
result = your_agent.invoke(inputs)
return {"expected_response": result}
def run_agent(inputs: dict) -> dict:
"""Your agent invocation logic."""
result = your_agent.invoke(inputs)
return {"expected_response": result}
Run evaluation
Run evaluation
results = await client.aevaluate(
run_agent,
data="Skills: Final Response", # Dataset name
evaluators=[
exact_match_evaluator,
accuracy_evaluator,
trajectory_evaluator
],
experiment_prefix="skills-eval-v1",
max_concurrency=4
)
undefinedresults = await client.aevaluate(
run_agent,
data="Skills: Final Response", # Dataset name
evaluators=[
exact_match_evaluator,
accuracy_evaluator,
trajectory_evaluator
],
experiment_prefix="skills-eval-v1",
max_concurrency=4
)
undefinedUpload Evaluators to LangSmith
将评估器上传至LangSmith
The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.
Navigate to to upload evaluators.
skills/langsmith-evaluator/scripts/Important: LangSmith API requires evaluators to use signature where:
(run, example)- : dict with
runcontaining agent outputsrun["outputs"] - : dict with
examplecontaining expected outputsexample["outputs"]
上传脚本是一个用于将自定义评估器部署到LangSmith的实用工具。编写适用于你的业务场景的评估器,然后进行上传。
导航至 目录以上传评估器。
skills/langsmith-evaluator/scripts/重要提示: LangSmith API要求评估器使用签名,其中:
(run, example)- : 包含
run的字典,存储Agent的输出结果run["outputs"] - : 包含
example的字典,存储预期输出结果example["outputs"]
Create Evaluator File
创建评估器文件
python
undefinedpython
undefinedmy_project/evaluators/custom_evals.py
my_project/evaluators/custom_evals.py
def my_custom_evaluator(run, example):
"""Your custom evaluation logic.
Args:
run: Dict with run["outputs"] - agent outputs
example: Dict with example["outputs"] - expected outputs
Returns:
Dict with metric_name as key, score as value, optional comment
"""
# Extract relevant data
agent_output = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])
# Your custom logic here
match = agent_output == expected
return {
"my_metric": 1 if match else 0,
"comment": "Custom reasoning here"
}undefineddef my_custom_evaluator(run, example):
"""Your custom evaluation logic.
Args:
run: Dict with run["outputs"] - agent outputs
example: Dict with example["outputs"] - expected outputs
Returns:
Dict with metric_name as key, score as value, optional comment
"""
# Extract relevant data
agent_output = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])
# Your custom logic here
match = agent_output == expected
return {
"my_metric": 1 if match else 0,
"comment": "Custom reasoning here"
}undefinedUpload
上传操作
bash
undefinedbash
undefinedList existing evaluators
List existing evaluators
python upload_evaluators.py list
python upload_evaluators.py list
Upload evaluator
Upload evaluator
python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace
python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace
Delete evaluator (will prompt for confirmation)
Delete evaluator (will prompt for confirmation)
python upload_evaluators.py delete "Trajectory Match"
python upload_evaluators.py delete "Trajectory Match"
Skip confirmation prompts (use with caution)
Skip confirmation prompts (use with caution)
python upload_evaluators.py delete "Trajectory Match" --yes
python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--replace --yes
--name "Trajectory Match"
--function trajectory_match
--replace --yes
**Options:**
- `--name` - Display name in LangSmith
- `--function` - Function name to extract
- `--dataset` - Target dataset name
- `--project` - Target project name
- `--sample-rate` - Sampling rate (0.0-1.0)
- `--replace` - Replace if exists (will prompt for confirmation)
- `--yes` - Skip confirmation prompts for replace/delete operations
**IMPORTANT - Safety Prompts:**
- The script prompts for confirmation before any destructive operations (delete, replace)
- **ALWAYS respect these prompts** - wait for user input before proceeding
- **NEVER use `--yes` flag unless the user explicitly requests it**
- The `--yes` flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the userpython upload_evaluators.py delete "Trajectory Match" --yes
python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--replace --yes
--name "Trajectory Match"
--function trajectory_match
--replace --yes
**选项说明:**
- `--name` - LangSmith中的显示名称
- `--function` - 要提取的函数名称
- `--dataset` - 目标数据集名称
- `--project` - 目标项目名称
- `--sample-rate` - 采样率(0.0-1.0)
- `--replace` - 若评估器已存在则替换(会提示确认)
- `--yes` - 跳过替换/删除操作的确认提示
**重要安全提示:**
- 脚本在执行任何破坏性操作(删除、替换)前会提示确认
- **请务必遵循这些提示** - 等待用户输入后再继续
- **除非用户明确要求,否则请勿使用`--yes`标志**
- `--yes`标志会跳过所有安全提示,仅应在用户明确授权的自动化工作流中使用Best Practices
最佳实践
- Use structured output for LLM judges - More reliable than parsing free-text
- Match evaluator to dataset type
- Final Response → LLM as Judge for quality, Custom Code for format
- Single Step → Custom Code for exact match
- Trajectory → Custom Code for sequence/efficiency
- Combine multiple evaluators - Run both subjective (LLM) and objective (code)
- Use async for LLM judges - Enables parallel evaluation, much faster
- Test evaluators independently - Validate on known good/bad examples first
- Upload to LangSmith - Automatic evaluation on new runs
- 为LLM评估器使用结构化输出 - 比解析自由文本更可靠
- 根据数据集类型匹配评估器
- 最终响应 → 使用LLM as Judge评估质量,自定义代码评估格式
- 单步骤 → 使用自定义代码进行精确匹配
- 轨迹 → 使用自定义代码评估序列/效率
- 组合多个评估器 - 同时运行主观(LLM)和客观(代码)评估
- 为LLM评估器使用异步方式 - 支持并行评估,速度更快
- 独立测试评估器 - 先在已知的优质/劣质示例上验证
- 上传至LangSmith - 对新的运行自动执行评估
Example Workflow
示例工作流
bash
undefinedbash
undefined1. Create evaluators file
1. 创建评估器文件
cat > evaluators.py <<'EOF'
def exact_match(run, example):
"""Check if output exactly matches expected."""
output = run["outputs"].get("expected_response", "").strip().lower()
expected = example["outputs"].get("expected_response", "").strip().lower()
match = output == expected
return {
"exact_match": 1 if match else 0,
"comment": f"Match: {match}"
}
EOF
cat > evaluators.py <<'EOF'
def exact_match(run, example):
"""Check if output exactly matches expected."""
output = run["outputs"].get("expected_response", "").strip().lower()
expected = example["outputs"].get("expected_response", "").strip().lower()
match = output == expected
return {
"exact_match": 1 if match else 0,
"comment": f"Match: {match}"
}
EOF
2. Upload to LangSmith
2. 上传至LangSmith
python upload_evaluators.py upload evaluators.py
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace
python upload_evaluators.py upload evaluators.py
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace
3. Evaluator runs automatically on new dataset runs
3. 评估器会自动在数据集的新运行中执行
undefinedundefined