langsmith-evaluator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LangSmith Evaluator

LangSmith 评估器

Create evaluators to measure agent performance on your datasets. LangSmith supports two types: LLM as Judge (uses LLM to grade outputs) and Custom Code (deterministic logic).

创建评估器以衡量Agent在你的数据集上的性能表现。LangSmith支持两种类型：LLM as Judge（使用LLM对输出进行评分）和Custom Code（确定性逻辑）。

Setup

配置

Environment Variables

环境变量

bash

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

bash

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

Dependencies

依赖安装

bash

pip install langsmith langchain-openai python-dotenv

bash

pip install langsmith langchain-openai python-dotenv

Evaluator Format

评估器格式

Evaluators support two function signatures:

Method 1: Dict Parameters (For running evaluations locally):

python

def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }

Method 2: Run/Example Parameters (For uploading to LangSmith):

python

def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }

评估器支持两种函数签名：

方法1：字典参数（用于本地运行评估）：

python

def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }

方法2：Run/Example参数（用于上传至LangSmith）：

python

def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }

LLM as Judge Evaluators

LLM as Judge 评估器

Use structured output for reliable grading:

python

from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]

使用结构化输出以获得可靠的评分结果：

python

from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]

Configure model with structured output

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output( AccuracyGrade, method="json_schema", strict=True )

async def accuracy_evaluator(run, example): """Evaluate factual accuracy for LangSmith upload.""" expected = example["outputs"].get('expected_response', '') agent_output = run["outputs"].get('expected_response', '')

prompt = f"""Expected: {expected}

Agent Output: {agent_output} Evaluate accuracy:"""

grade = await judge.ainvoke([{"role": "user", "content": prompt}])

return {
    "accuracy": 1 if grade["is_accurate"] else 0,
    "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
}


**Common Metrics:** Completeness, correctness, helpfulness, professionalism

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output( AccuracyGrade, method="json_schema", strict=True )

prompt = f"""Expected: {expected}

Agent Output: {agent_output} Evaluate accuracy:"""

grade = await judge.ainvoke([{"role": "user", "content": prompt}])

return {
    "accuracy": 1 if grade["is_accurate"] else 0,
    "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
}


**常见指标：** 完整性、正确性、实用性、专业性

Custom Code Evaluators

自定义代码评估器

Exact Match

精确匹配

python

def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

python

def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Trajectory Validation

轨迹验证

python

def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }

python

def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }

Single Step Validation

单步骤验证

python

def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

python

def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Running Evaluations

运行评估

python

from langsmith import Client

client = Client()

python

from langsmith import Client

client = Client()

Define your agent function

def run_agent(inputs: dict) -> dict: """Your agent invocation logic.""" result = your_agent.invoke(inputs) return {"expected_response": result}

Run evaluation

results = await client.aevaluate( run_agent, data="Skills: Final Response", # Dataset name evaluators=[ exact_match_evaluator, accuracy_evaluator, trajectory_evaluator ], experiment_prefix="skills-eval-v1", max_concurrency=4 )

undefined

undefined

Upload Evaluators to LangSmith

将评估器上传至LangSmith

The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.

Navigate to

skills/langsmith-evaluator/scripts/

to upload evaluators.

Important: LangSmith API requires evaluators to use

(run, example)

signature where:

```
run
```
: dict with
```
run["outputs"]
```
containing agent outputs
```
example
```
: dict with
```
example["outputs"]
```
containing expected outputs

上传脚本是一个用于将自定义评估器部署到LangSmith的实用工具。编写适用于你的业务场景的评估器，然后进行上传。

导航至

skills/langsmith-evaluator/scripts/

目录以上传评估器。

重要提示： LangSmith API要求评估器使用

(run, example)

签名，其中：

```
run
```
: 包含
```
run["outputs"]
```
的字典，存储Agent的输出结果
```
example
```
: 包含
```
example["outputs"]
```
的字典，存储预期输出结果

Create Evaluator File

创建评估器文件

python

undefined

python

undefined

my_project/evaluators/custom_evals.py

def my_custom_evaluator(run, example): """Your custom evaluation logic.

Args:
    run: Dict with run["outputs"] - agent outputs
    example: Dict with example["outputs"] - expected outputs

Returns:
    Dict with metric_name as key, score as value, optional comment
"""
# Extract relevant data
agent_output = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])

# Your custom logic here
match = agent_output == expected

return {
    "my_metric": 1 if match else 0,
    "comment": "Custom reasoning here"
}

undefined

def my_custom_evaluator(run, example): """Your custom evaluation logic.

Args:
    run: Dict with run["outputs"] - agent outputs
    example: Dict with example["outputs"] - expected outputs

Returns:
    Dict with metric_name as key, score as value, optional comment
"""
# Extract relevant data
agent_output = run["outputs"].get("expected_trajectory", [])
expected = example["outputs"].get("expected_trajectory", [])

# Your custom logic here
match = agent_output == expected

return {
    "my_metric": 1 if match else 0,
    "comment": "Custom reasoning here"
}

undefined

Upload

上传操作

bash

undefined

bash

undefined

List existing evaluators

python upload_evaluators.py list

Upload evaluator

python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--dataset "Skills: Trajectory"
--replace

Delete evaluator (will prompt for confirmation)

python upload_evaluators.py delete "Trajectory Match"

Skip confirmation prompts (use with caution)

python upload_evaluators.py delete "Trajectory Match" --yes python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--replace --yes


**Options:**
- `--name` - Display name in LangSmith
- `--function` - Function name to extract
- `--dataset` - Target dataset name
- `--project` - Target project name
- `--sample-rate` - Sampling rate (0.0-1.0)
- `--replace` - Replace if exists (will prompt for confirmation)
- `--yes` - Skip confirmation prompts for replace/delete operations

**IMPORTANT - Safety Prompts:**
- The script prompts for confirmation before any destructive operations (delete, replace)
- **ALWAYS respect these prompts** - wait for user input before proceeding
- **NEVER use `--yes` flag unless the user explicitly requests it**
- The `--yes` flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user

python upload_evaluators.py delete "Trajectory Match" --yes python upload_evaluators.py upload my_evaluators.py
--name "Trajectory Match"
--function trajectory_match
--replace --yes


**选项说明：**
- `--name` - LangSmith中的显示名称
- `--function` - 要提取的函数名称
- `--dataset` - 目标数据集名称
- `--project` - 目标项目名称
- `--sample-rate` - 采样率（0.0-1.0）
- `--replace` - 若评估器已存在则替换（会提示确认）
- `--yes` - 跳过替换/删除操作的确认提示

**重要安全提示：**
- 脚本在执行任何破坏性操作（删除、替换）前会提示确认
- **请务必遵循这些提示** - 等待用户输入后再继续
- **除非用户明确要求，否则请勿使用`--yes`标志**
- `--yes`标志会跳过所有安全提示，仅应在用户明确授权的自动化工作流中使用

Best Practices

最佳实践

Use structured output for LLM judges - More reliable than parsing free-text
Match evaluator to dataset type
- Final Response → LLM as Judge for quality, Custom Code for format
- Single Step → Custom Code for exact match
- Trajectory → Custom Code for sequence/efficiency
Combine multiple evaluators - Run both subjective (LLM) and objective (code)
Use async for LLM judges - Enables parallel evaluation, much faster
Test evaluators independently - Validate on known good/bad examples first
Upload to LangSmith - Automatic evaluation on new runs

为LLM评估器使用结构化输出 - 比解析自由文本更可靠
根据数据集类型匹配评估器
- 最终响应 → 使用LLM as Judge评估质量，自定义代码评估格式
- 单步骤 → 使用自定义代码进行精确匹配
- 轨迹 → 使用自定义代码评估序列/效率
组合多个评估器 - 同时运行主观（LLM）和客观（代码）评估
为LLM评估器使用异步方式 - 支持并行评估，速度更快
独立测试评估器 - 先在已知的优质/劣质示例上验证
上传至LangSmith - 对新的运行自动执行评估

Example Workflow

示例工作流

bash

undefined

bash

undefined

1. Create evaluators file

1. 创建评估器文件

cat > evaluators.py <<'EOF' def exact_match(run, example): """Check if output exactly matches expected.""" output = run["outputs"].get("expected_response", "").strip().lower() expected = example["outputs"].get("expected_response", "").strip().lower() match = output == expected return { "exact_match": 1 if match else 0, "comment": f"Match: {match}" } EOF

2. Upload to LangSmith

2. 上传至LangSmith

python upload_evaluators.py upload evaluators.py
--name "Exact Match"
--function exact_match
--dataset "Skills: Final Response"
--replace

3. Evaluator runs automatically on new dataset runs

3. 评估器会自动在数据集的新运行中执行

undefined

undefined

langsmith-evaluator

Original

Translation

LangSmith Evaluator

LangSmith 评估器

Setup

配置

Environment Variables

环境变量

Dependencies

依赖安装

Evaluator Format

评估器格式

LLM as Judge Evaluators

LLM as Judge 评估器

Configure model with structured output

Configure model with structured output

Custom Code Evaluators

自定义代码评估器

Exact Match

精确匹配

Trajectory Validation

轨迹验证

Single Step Validation

单步骤验证

Running Evaluations

运行评估

Define your agent function

Define your agent function

Run evaluation

Run evaluation

Upload Evaluators to LangSmith

将评估器上传至LangSmith

Create Evaluator File

创建评估器文件

my_project/evaluators/custom_evals.py

my_project/evaluators/custom_evals.py

Upload

上传操作

List existing evaluators

List existing evaluators

Upload evaluator

Upload evaluator

Delete evaluator (will prompt for confirmation)

Delete evaluator (will prompt for confirmation)

Skip confirmation prompts (use with caution)

Skip confirmation prompts (use with caution)

Best Practices

最佳实践

Example Workflow

示例工作流

1. Create evaluators file

1. 创建评估器文件

2. Upload to LangSmith

2. 上传至LangSmith

3. Evaluator runs automatically on new dataset runs

3. 评估器会自动在数据集的新运行中执行

Resources

参考资源