langsmith-evaluator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangSmith Evaluator

LangSmith 评估器

Create evaluators to measure agent performance on your datasets. LangSmith supports two types: LLM as Judge (uses LLM to grade outputs) and Custom Code (deterministic logic).
创建评估器以衡量Agent在你的数据集上的性能表现。LangSmith支持两种类型:LLM as Judge(使用LLM对输出进行评分)和Custom Code(确定性逻辑)。

Setup

配置

Environment Variables

环境变量

bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge
bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

Dependencies

依赖安装

bash
pip install langsmith langchain-openai python-dotenv
bash
pip install langsmith langchain-openai python-dotenv

Evaluator Format

评估器格式

Evaluators support two function signatures:
Method 1: Dict Parameters (For running evaluations locally):
python
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }
Method 2: Run/Example Parameters (For uploading to LangSmith):
python
def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }
评估器支持两种函数签名:
方法1:字典参数(用于本地运行评估):
python
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }
方法2:Run/Example参数(用于上传至LangSmith):
python
def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }

LLM as Judge Evaluators

LLM as Judge 评估器

Use structured output for reliable grading:
python
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]
使用结构化输出以获得可靠的评分结果:
python
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]

Configure model with structured output

Configure model with structured output

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output( AccuracyGrade, method="json_schema", strict=True )
async def accuracy_evaluator(run, example): """Evaluate factual accuracy for LangSmith upload.""" expected = example["outputs"].get('expected_response', '') agent_output = run["outputs"].get('expected_response', '')
prompt = f"""Expected: {expected}
Agent Output: {agent_output} Evaluate accuracy:"""
grade = await judge.ainvoke([{"role": "user", "content": prompt}])

return {
    "accuracy": 1 if grade["is_accurate"] else 0,
    "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
}

**Common Metrics:** Completeness, correctness, helpfulness, professionalism
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output( AccuracyGrade, method="json_schema", strict=True )
async def accuracy_evaluator(run, example): """Evaluate factual accuracy for LangSmith upload.""" expected = example["outputs"].get('expected_response', '') agent_output = run["outputs"].get('expected_response', '')
prompt = f"""Expected: {expected}
Agent Output: {agent_output} Evaluate accuracy:"""
grade = await judge.ainvoke([{"role": "user", "content": prompt}])

return {
    "accuracy": 1 if grade["is_accurate"] else 0,
    "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
}

**常见指标:** 完整性、正确性、实用性、专业性

Custom Code Evaluators

自定义代码评估器

Exact Match

精确匹配

python
def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
python
def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Trajectory Validation

轨迹验证

python
def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }
python
def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }

Single Step Validation

单步骤验证

python
def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
python
def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Running Evaluations

运行评估

python
from langsmith import Client

client = Client()
python
from langsmith import Client

client = Client()

Define your agent function

Define your agent function

def run_agent(inputs: dict) -> dict: """Your agent invocation logic.""" result = your_agent.invoke(inputs) return {"expected_response": result}
def run_agent(inputs: dict) -> dict: """Your agent invocation logic.""" result = your_agent.invoke(inputs) return {"expected_response": result}

Run evaluation

Run evaluation

results = await client.aevaluate( run_agent, data="Skills: Final Response", # Dataset name evaluators=[ exact_match_evaluator, accuracy_evaluator, trajectory_evaluator ], experiment_prefix="skills-eval-v1", max_concurrency=4 )
undefined
results = await client.aevaluate( run_agent, data="Skills: Final Response", # Dataset name evaluators=[ exact_match_evaluator, accuracy_evaluator, trajectory_evaluator ], experiment_prefix="skills-eval-v1", max_concurrency=4 )
undefined

Upload Evaluators to LangSmith

将评估器上传至LangSmith

The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.
Navigate to
skills/langsmith-evaluator/scripts/
to upload evaluators.
Important: LangSmith API requires evaluators to use
(run, example)
signature where:
  • run
    : dict with
    run["outputs"]
    containing agent outputs
  • example
    : dict with
    example["outputs"]
    containing expected outputs
上传脚本是一个用于将自定义评估器部署到LangSmith的实用工具。编写适用于你的业务场景的评估器,然后进行上传。
导航至
skills/langsmith-evaluator/scripts/
目录以上传评估器。
重要提示: LangSmith API要求评估器使用
(run, example)
签名,其中:
  • run
    : 包含
    run["outputs"]
    的字典,存储Agent的输出结果
  • example
    : 包含
    example["outputs"]
    的字典,存储预期输出结果

Create Evaluator File

创建评估器文件

python
undefined
python
undefined

my_project/evaluators/custom_evals.py

my_project/evaluators/custom_evals.py

def my_custom_evaluator(run, example): """Your custom evaluation logic.
Args:
    run: Dict with run["outputs"] - agent outputs
    example: Dict with example["outputs"] - expected outputs

Returns:
    Dict with metric_name as key, score as value, optional comment
"""
# Extract relevant data
agent_output = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])

# Your custom logic here
match = agent_output == expected

return {
    "my_metric": 1 if match else 0,
    "comment": "Custom reasoning here"
}
undefined
def my_custom_evaluator(run, example): """Your custom evaluation logic.
Args:
    run: Dict with run["outputs"] - agent outputs
    example: Dict with example["outputs"] - expected outputs

Returns:
    Dict with metric_name as key, score as value, optional comment
"""
# Extract relevant data
agent_output = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])

# Your custom logic here
match = agent_output == expected

return {
    "my_metric": 1 if match else 0,
    "comment": "Custom reasoning here"
}
undefined

Upload

上传操作

bash
undefined
bash
undefined

List existing evaluators

List existing evaluators

python upload_evaluators.py list
python upload_evaluators.py list

Upload evaluator

Upload evaluator

python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace
python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace

Delete evaluator (will prompt for confirmation)

Delete evaluator (will prompt for confirmation)

python upload_evaluators.py delete "Trajectory Match"
python upload_evaluators.py delete "Trajectory Match"

Skip confirmation prompts (use with caution)

Skip confirmation prompts (use with caution)

python upload_evaluators.py delete "Trajectory Match" --yes python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--replace --yes

**Options:**
- `--name` - Display name in LangSmith
- `--function` - Function name to extract
- `--dataset` - Target dataset name
- `--project` - Target project name
- `--sample-rate` - Sampling rate (0.0-1.0)
- `--replace` - Replace if exists (will prompt for confirmation)
- `--yes` - Skip confirmation prompts for replace/delete operations

**IMPORTANT - Safety Prompts:**
- The script prompts for confirmation before any destructive operations (delete, replace)
- **ALWAYS respect these prompts** - wait for user input before proceeding
- **NEVER use `--yes` flag unless the user explicitly requests it**
- The `--yes` flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user
python upload_evaluators.py delete "Trajectory Match" --yes python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--replace --yes

**选项说明:**
- `--name` - LangSmith中的显示名称
- `--function` - 要提取的函数名称
- `--dataset` - 目标数据集名称
- `--project` - 目标项目名称
- `--sample-rate` - 采样率(0.0-1.0)
- `--replace` - 若评估器已存在则替换(会提示确认)
- `--yes` - 跳过替换/删除操作的确认提示

**重要安全提示:**
- 脚本在执行任何破坏性操作(删除、替换)前会提示确认
- **请务必遵循这些提示** - 等待用户输入后再继续
- **除非用户明确要求,否则请勿使用`--yes`标志**
- `--yes`标志会跳过所有安全提示,仅应在用户明确授权的自动化工作流中使用

Best Practices

最佳实践

  1. Use structured output for LLM judges - More reliable than parsing free-text
  2. Match evaluator to dataset type
    • Final Response → LLM as Judge for quality, Custom Code for format
    • Single Step → Custom Code for exact match
    • Trajectory → Custom Code for sequence/efficiency
  3. Combine multiple evaluators - Run both subjective (LLM) and objective (code)
  4. Use async for LLM judges - Enables parallel evaluation, much faster
  5. Test evaluators independently - Validate on known good/bad examples first
  6. Upload to LangSmith - Automatic evaluation on new runs
  1. 为LLM评估器使用结构化输出 - 比解析自由文本更可靠
  2. 根据数据集类型匹配评估器
    • 最终响应 → 使用LLM as Judge评估质量,自定义代码评估格式
    • 单步骤 → 使用自定义代码进行精确匹配
    • 轨迹 → 使用自定义代码评估序列/效率
  3. 组合多个评估器 - 同时运行主观(LLM)和客观(代码)评估
  4. 为LLM评估器使用异步方式 - 支持并行评估,速度更快
  5. 独立测试评估器 - 先在已知的优质/劣质示例上验证
  6. 上传至LangSmith - 对新的运行自动执行评估

Example Workflow

示例工作流

bash
undefined
bash
undefined

1. Create evaluators file

1. 创建评估器文件

cat > evaluators.py <<'EOF' def exact_match(run, example): """Check if output exactly matches expected.""" output = run["outputs"].get("expected_response", "").strip().lower() expected = example["outputs"].get("expected_response", "").strip().lower() match = output == expected return { "exact_match": 1 if match else 0, "comment": f"Match: {match}" } EOF
cat > evaluators.py <<'EOF' def exact_match(run, example): """Check if output exactly matches expected.""" output = run["outputs"].get("expected_response", "").strip().lower() expected = example["outputs"].get("expected_response", "").strip().lower() match = output == expected return { "exact_match": 1 if match else 0, "comment": f"Match: {match}" } EOF

2. Upload to LangSmith

2. 上传至LangSmith

python upload_evaluators.py upload evaluators.py
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace
python upload_evaluators.py upload evaluators.py
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace

3. Evaluator runs automatically on new dataset runs

3. 评估器会自动在数据集的新运行中执行

undefined
undefined

Resources

参考资源