langsmith-dataset

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LangSmith Dataset

LangSmith 数据集

Auto-generate evaluation datasets from LangSmith traces for testing and validation.

从LangSmith轨迹自动生成评估数据集，用于测试与验证。

Setup

配置

Environment Variables

环境变量

bash

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_PROJECT=your-project-name                   # Optional: default project
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys

bash

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # 必填
LANGSMITH_PROJECT=your-project-name                   # 可选：默认项目
LANGSMITH_WORKSPACE_ID=your-workspace-id              # 可选：适用于组织范围的密钥

Dependencies

依赖项

bash

pip install langsmith click rich python-dotenv

bash

pip install langsmith click rich python-dotenv

Usage

使用方法

Navigate to

skills/langsmith-dataset/scripts/

to run commands.

导航至

skills/langsmith-dataset/scripts/

目录以运行命令。

Scripts

脚本说明

generate_datasets.py
- Create evaluation datasets from traces query_datasets.py
- View and inspect datasets

generate_datasets.py
- 从轨迹创建评估数据集 query_datasets.py
- 查看和检查数据集

Common Flags

通用参数

All dataset generation commands support:

```
--root-run-name <name>
```
- Filter traces by root run name (e.g., "LangGraph" for DeepAgents)
```
--limit <n>
```
- Number of traces to process (default: 30)
```
--last-n-minutes <n>
```
- Only recent traces
```
--output <path>
```
- Output file (.json or .csv)
```
--upload <name>
```
- Upload to LangSmith with this dataset name
```
--replace
```
- Overwrite existing file/dataset (will prompt for confirmation)
```
--yes
```
- Skip confirmation prompts (use with caution)

IMPORTANT - Safety Prompts:

The script prompts for confirmation before deleting existing datasets with
```
--replace
```
ALWAYS respect these prompts - wait for user input before proceeding
NEVER use
--yes
flag unless the user explicitly requests it
The
```
--yes
```
flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user

所有数据集生成命令均支持以下参数：

```
--root-run-name <名称>
```
- 按根运行名称过滤轨迹（例如，针对DeepAgents使用"LangGraph"）
```
--limit <数值>
```
- 要处理的轨迹数量（默认值：30）
```
--last-n-minutes <数值>
```
- 仅处理最近指定分钟内的轨迹
```
--output <路径>
```
- 输出文件路径（支持.json或.csv格式）
```
--upload <名称>
```
- 以此数据集名称上传至LangSmith
```
--replace
```
- 覆盖现有文件/数据集（操作前会提示确认）
```
--yes
```
- 跳过所有确认提示（请谨慎使用）

重要安全提示：

使用
```
--replace
```
参数时，脚本会在删除现有数据集前提示确认
请务必重视这些提示 - 等待用户输入后再继续操作
除非用户明确要求，否则请勿使用
--yes
参数
```
--yes
```
参数会跳过所有安全提示，仅应在用户明确授权的自动化工作流中使用

Understanding Trace Hierarchy

理解轨迹层级

Traces have depth levels based on parent-child relationships:

Depth 0: Root agent (e.g., "LangGraph")
  ├── Depth 1: Middleware/chains (model, tools, SummarizationMiddleware)
  │     ├── Depth 2: Tool calls (sql_db_query, retriever, etc.)
  │     └── Depth 2: LLM calls (ChatOpenAI, ChatAnthropic)
  └── Depth 3+: Nested subagent calls

Use
--root-run-name
to target specific agent frameworks:

DeepAgents:
```
--root-run-name LangGraph
```
Custom agents: Use your root node name

轨迹根据父子关系分为不同深度层级：

深度0：根Agent（例如"LangGraph"）
  ├── 深度1：中间件/链（模型、工具、SummarizationMiddleware）
  │     ├── 深度2：工具调用（sql_db_query、retriever等）
  │     └── 深度2：LLM调用（ChatOpenAI、ChatAnthropic）
  └── 深度3及以上：嵌套子Agent调用

使用
--root-run-name
定位特定Agent框架：

DeepAgents：
```
--root-run-name LangGraph
```
自定义Agent：使用你的根节点名称

Dataset Types

数据集类型

1. Final Response

1. 最终响应（Final Response）

Full conversation with expected output - tests complete agent behavior.

bash

undefined

包含完整对话与预期输出，用于测试Agent的完整行为。

bash

undefined

Basic usage

基础用法

python generate_datasets.py --type final_response
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json

With custom output fields

自定义输出字段

python generate_datasets.py --type final_response
--project my-project
--output-fields "answer,result"
--output /tmp/final.json

Messages only (ignore output dict keys)

仅保留消息（忽略输出字典中的键）

python generate_datasets.py --type final_response
--project my-project
--messages-only
--output /tmp/final.json


**Structure:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
  }
}

Extraction Priority:

Messages from root run (AI responses with content)
User-specified output fields (
```
--output-fields
```
)
Common keys (answer, output)
Full output dict

Important: Always checks root run first for final response to avoid intermediate tool outputs.

python generate_datasets.py --type final_response
--project my-project
--messages-only
--output /tmp/final.json


**结构：**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
  }
}

提取优先级：

根运行中的消息（包含内容的AI响应）
用户指定的输出字段（
```
--output-fields
```
）
通用键（answer、output）
完整输出字典

注意： 始终优先检查根运行的最终响应，以避免获取中间工具输出。

2. Single Step

2. 单步（Single Step）

Single node inputs/outputs - tests any specific node's behavior. Supports multiple occurrences per trace to capture conversation evolution.

bash

undefined

单个节点的输入/输出，用于测试特定节点的行为。支持每个轨迹包含多个实例，以捕获对话的演变过程。

bash

undefined

Extract all occurrences (default)

提取所有实例（默认）

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json

Sample 2 occurrences per trace

每个轨迹随机采样2个实例

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json

Target specific tool at depth 2

定位深度为2的特定工具

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json


**Structure:**
```json
{
  "trace_id": "...",
  "run_id": "...",
  "occurrence": 2,
  "inputs": {
    "messages": [
      {"type": "human", "content": "What are the top 3 genres?"},
      {"type": "ai", "content": "", "tool_calls": [...]},
      {"type": "tool", "content": "...results..."},
      ...
    ]
  },
  "outputs": {
    "expected_output": {
      "messages": [
        {"type": "ai", "content": "", "tool_calls": [...]}
      ]
    },
    "node_name": "model"
  }
}

Key Features:

```
occurrence
```
field tracks which invocation (1st, 2nd, 3rd, etc.)
Later occurrences have more conversation history → tests context handling
```
--sample-per-trace
```
randomly samples N occurrences per trace
Use
```
--run-name
```
to target any node at any depth

Common targets:

```
model
```
(depth 1) - LLM invocations with growing context
```
tools
```
(depth 1) - Tool execution chain
Any custom node name

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json


**结构：**
```json
{
  "trace_id": "...",
  "run_id": "...",
  "occurrence": 2,
  "inputs": {
    "messages": [
      {"type": "human", "content": "What are the top 3 genres?"},
      {"type": "ai", "content": "", "tool_calls": [...]},
      {"type": "tool", "content": "...results..."},
      ...
    ]
  },
  "outputs": {
    "expected_output": {
      "messages": [
        {"type": "ai", "content": "", "tool_calls": [...]}
      ]
    },
    "node_name": "model"
  }
}

核心特性：

```
occurrence
```
字段记录调用次数（第1次、第2次、第3次等）
后续实例包含更多对话历史 → 用于测试上下文处理能力
```
--sample-per-trace
```
参数可在每个轨迹中随机采样N个实例
使用
```
--run-name
```
定位任意深度的节点

常见目标节点：

```
model
```
（深度1）- 带有上下文增长的LLM调用
```
tools
```
（深度1）- 工具执行链
任意自定义节点名称

3. Trajectory

3. 轨迹序列（Trajectory）

Tool call sequence - tests execution path with configurable depth.

bash

undefined

工具调用序列，用于测试可配置深度的执行路径。

bash

undefined

Include all tool calls (all depths)

包含所有工具调用（所有深度）

python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json

Only tool calls up to depth 2

仅包含深度不超过2的工具调用

python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json

Only root-level tool calls (depth 0) - usually empty if tools are at depth 2+

仅包含根层级的工具调用（深度0）- 若工具位于深度2及以上，通常结果为空

python generate_datasets.py --type trajectory
--project my-project
--depth 0
--output /tmp/trajectory_root.json


**Structure:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_trajectory": [
      "sql_db_list_tables",
      "sql_db_schema",
      "sql_db_query_checker",
      "sql_db_query"
    ]
  }
}

Depth Control:

Omit
```
--depth
```
= all levels (includes subagent tool calls)
```
--depth 2
```
= root + 2 levels (typical for capturing all main tools)
```
--depth 1
```
= often only middleware/chains, no actual tool calls
```
--depth 0
```
= root only (no tool calls)

Note: Tool calls are typically at depth 2 in LangGraph/DeepAgents architecture.

python generate_datasets.py --type trajectory
--project my-project
--depth 0
--output /tmp/trajectory_root.json


**结构：**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_trajectory": [
      "sql_db_list_tables",
      "sql_db_schema",
      "sql_db_query_checker",
      "sql_db_query"
    ]
  }
}

深度控制：

省略
```
--depth
```
参数 = 包含所有层级（包括子Agent的工具调用）
```
--depth 2
```
= 根层级 + 2层（通常可捕获所有主要工具）
```
--depth 1
```
= 通常仅包含中间件/链，无实际工具调用
```
--depth 0
```
= 仅根层级（无工具调用）

注意： 在LangGraph/DeepAgents架构中，工具调用通常位于深度2。

4. RAG

Question/chunks/answer/citations - tests retrieval quality.

bash

python generate_datasets.py --type rag \
  --project my-project \
  --limit 30 \
  --output /tmp/rag_ds.csv  # Supports .json or .csv

Structure (CSV format):

csv

question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"

问题/片段/答案/引用，用于测试检索质量。

bash

python generate_datasets.py --type rag \
  --project my-project \
  --limit 30 \
  --output /tmp/rag_ds.csv  # 支持.json或.csv格式

结构（CSV格式）：

csv

question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"

Output Formats

输出格式

All dataset types support both JSON and CSV:

bash

undefined

所有数据集类型均支持JSON和CSV格式：

bash

undefined

JSON output (default)

JSON输出（默认）

python generate_datasets.py --type trajectory --project my-project --output ds.json

CSV output (use .csv extension)

CSV输出（使用.csv扩展名）

python generate_datasets.py --type trajectory --project my-project --output ds.csv

undefined

python generate_datasets.py --type trajectory --project my-project --output ds.csv

undefined

Upload to LangSmith

上传至LangSmith

bash

undefined

bash

undefined

Generate and upload in one command

一键生成并上传

python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"

Use --replace to overwrite existing dataset

使用--replace参数覆盖现有数据集

python generate_datasets.py --type final_response
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace


**Naming Convention:** Use "Skills: <Type>" format for consistency:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"

python generate_datasets.py --type final_response
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace


**命名规范：** 为保持一致性，请使用"Skills: <类型>"格式：
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"

Query Datasets

查询数据集

bash

undefined

bash

undefined

List all datasets

列出所有数据集

python query_datasets.py list-datasets

Filter by name pattern

按名称模式过滤

python query_datasets.py list-datasets | grep "Skills:"

View dataset examples

查看数据集示例

python query_datasets.py show "Skills: Trajectory" --limit 5

View local file

查看本地文件

python query_datasets.py view-file /tmp/trajectory_ds.json --limit 3

Analyze structure

分析结构

python query_datasets.py structure /tmp/trajectory_ds.json

Export from LangSmith to local

从LangSmith导出至本地

python query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100

undefined

python query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100

undefined

Tips for Dataset Generation

数据集生成技巧

Always use
--root-run-name
- Filter for specific agent framework (e.g., "LangGraph")
Start with successful traces - Use recent successful runs for baseline datasets
Use time windows -
```
--last-n-minutes 1440
```
for last 24 hours of data
Sample for single_step - Use
```
--sample-per-trace 2
```
to capture conversation evolution
Match depth to needs -
```
--depth 2
```
typically captures all main tool calls
Review before upload - Use
```
query_datasets.py view-file
```
to inspect first
Iterative refinement - Generate small batches (10-20) first, validate, then scale up
Use
--replace
carefully - Overwrites existing datasets, useful for iteration

始终使用
--root-run-name
- 过滤特定Agent框架（例如"LangGraph"）
从成功的轨迹开始 - 使用近期成功的运行生成基准数据集
使用时间窗口 -
```
--last-n-minutes 1440
```
可获取过去24小时的数据
对single_step类型进行采样 - 使用
```
--sample-per-trace 2
```
捕获对话演变
根据需求匹配深度 -
```
--depth 2
```
通常可捕获所有主要工具调用
上传前先审核 - 使用
```
query_datasets.py view-file
```
先检查内容
迭代优化 - 先生成小批量（10-20条），验证后再扩大规模
谨慎使用
--replace
- 会覆盖现有数据集，适合迭代场景

Example Workflow

示例工作流

bash

undefined

bash

undefined

1. Generate fresh traces (if needed)

1. 生成新的轨迹（如有需要）

python tests/test_agent.py --batch # Your test agent

python tests/test_agent.py --batch # 你的测试Agent

2. Generate all dataset types from LangGraph traces

2. 从LangGraph轨迹生成所有类型的数据集

python generate_datasets.py --type final_response
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace

python generate_datasets.py --type single_step
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace

python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace

python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace

python generate_datasets.py --type final_response
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace

python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace

python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace

3. Review in LangSmith UI

3. 在LangSmith UI中查看

Visit https://smith.langchain.com → Datasets → Filter for "Skills:"

访问https://smith.langchain.com → Datasets → 过滤"Skills:"

4. Query locally if needed

4. 如有需要，本地查询

python query_datasets.py show "Skills: Final Response" --limit 3

undefined

python query_datasets.py show "Skills: Final Response" --limit 3

undefined

Troubleshooting

故障排除

Empty final_response outputs:

Ensure
```
--root-run-name
```
matches your agent's root node
Check that root run has messages with AI responses
Use
```
--messages-only
```
if output dict is empty

No trajectory examples:

Tools might be at different depth - try removing
```
--depth
```
or use
```
--depth 2
```

Verify tool calls exist:

python query_traces.py trace <id> --show-hierarchy

Too many single_step examples:

Use
```
--sample-per-trace 2
```
to limit examples per trace
Reduces dataset size while maintaining diversity

Dataset upload fails:

Check dataset doesn't exist or use
```
--replace
```
Verify LANGSMITH_API_KEY is set

final_response输出为空：

确保
```
--root-run-name
```
与你的Agent根节点名称匹配
检查根运行是否包含带有AI响应的消息
若输出字典为空，使用
```
--messages-only
```
参数

无trajectory示例：

工具可能位于不同深度 - 尝试移除
```
--depth
```
参数或使用
```
--depth 2
```

验证工具调用是否存在：

python query_traces.py trace <id> --show-hierarchy

single_step示例过多：

使用
```
--sample-per-trace 2
```
限制每个轨迹的示例数量
在减少数据集大小的同时保持多样性

数据集上传失败：

检查数据集是否已存在，或使用
```
--replace
```
参数
验证LANGSMITH_API_KEY是否已正确设置