langsmith-dataset

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangSmith Dataset

LangSmith 数据集

Auto-generate evaluation datasets from LangSmith traces for testing and validation.
从LangSmith轨迹自动生成评估数据集,用于测试与验证。

Setup

配置

Environment Variables

环境变量

bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_PROJECT=your-project-name                   # Optional: default project
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # 必填
LANGSMITH_PROJECT=your-project-name                   # 可选:默认项目
LANGSMITH_WORKSPACE_ID=your-workspace-id              # 可选:适用于组织范围的密钥

Dependencies

依赖项

bash
pip install langsmith click rich python-dotenv
bash
pip install langsmith click rich python-dotenv

Usage

使用方法

Navigate to
skills/langsmith-dataset/scripts/
to run commands.
导航至
skills/langsmith-dataset/scripts/
目录以运行命令。

Scripts

脚本说明

generate_datasets.py
- Create evaluation datasets from traces
query_datasets.py
- View and inspect datasets
generate_datasets.py
- 从轨迹创建评估数据集
query_datasets.py
- 查看和检查数据集

Common Flags

通用参数

All dataset generation commands support:
  • --root-run-name <name>
    - Filter traces by root run name (e.g., "LangGraph" for DeepAgents)
  • --limit <n>
    - Number of traces to process (default: 30)
  • --last-n-minutes <n>
    - Only recent traces
  • --output <path>
    - Output file (.json or .csv)
  • --upload <name>
    - Upload to LangSmith with this dataset name
  • --replace
    - Overwrite existing file/dataset (will prompt for confirmation)
  • --yes
    - Skip confirmation prompts (use with caution)
IMPORTANT - Safety Prompts:
  • The script prompts for confirmation before deleting existing datasets with
    --replace
  • ALWAYS respect these prompts - wait for user input before proceeding
  • NEVER use
    --yes
    flag unless the user explicitly requests it
  • The
    --yes
    flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user
所有数据集生成命令均支持以下参数:
  • --root-run-name <名称>
    - 按根运行名称过滤轨迹(例如,针对DeepAgents使用"LangGraph")
  • --limit <数值>
    - 要处理的轨迹数量(默认值:30)
  • --last-n-minutes <数值>
    - 仅处理最近指定分钟内的轨迹
  • --output <路径>
    - 输出文件路径(支持.json或.csv格式)
  • --upload <名称>
    - 以此数据集名称上传至LangSmith
  • --replace
    - 覆盖现有文件/数据集(操作前会提示确认)
  • --yes
    - 跳过所有确认提示(请谨慎使用)
重要安全提示:
  • 使用
    --replace
    参数时,脚本会在删除现有数据集前提示确认
  • 请务必重视这些提示 - 等待用户输入后再继续操作
  • 除非用户明确要求,否则请勿使用
    --yes
    参数
  • --yes
    参数会跳过所有安全提示,仅应在用户明确授权的自动化工作流中使用

Understanding Trace Hierarchy

理解轨迹层级

Traces have depth levels based on parent-child relationships:
Depth 0: Root agent (e.g., "LangGraph")
  ├── Depth 1: Middleware/chains (model, tools, SummarizationMiddleware)
  │     ├── Depth 2: Tool calls (sql_db_query, retriever, etc.)
  │     └── Depth 2: LLM calls (ChatOpenAI, ChatAnthropic)
  └── Depth 3+: Nested subagent calls
Use
--root-run-name
to target specific agent frameworks:
  • DeepAgents:
    --root-run-name LangGraph
  • Custom agents: Use your root node name
轨迹根据父子关系分为不同深度层级:
深度0:根Agent(例如"LangGraph")
  ├── 深度1:中间件/链(模型、工具、SummarizationMiddleware)
  │     ├── 深度2:工具调用(sql_db_query、retriever等)
  │     └── 深度2:LLM调用(ChatOpenAI、ChatAnthropic)
  └── 深度3及以上:嵌套子Agent调用
使用
--root-run-name
定位特定Agent框架:
  • DeepAgents:
    --root-run-name LangGraph
  • 自定义Agent:使用你的根节点名称

Dataset Types

数据集类型

1. Final Response

1. 最终响应(Final Response)

Full conversation with expected output - tests complete agent behavior.
bash
undefined
包含完整对话与预期输出,用于测试Agent的完整行为。
bash
undefined

Basic usage

基础用法

python generate_datasets.py --type final_response
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json
python generate_datasets.py --type final_response
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json

With custom output fields

自定义输出字段

python generate_datasets.py --type final_response
--project my-project
--output-fields "answer,result"
--output /tmp/final.json
python generate_datasets.py --type final_response
--project my-project
--output-fields "answer,result"
--output /tmp/final.json

Messages only (ignore output dict keys)

仅保留消息(忽略输出字典中的键)

python generate_datasets.py --type final_response
--project my-project
--messages-only
--output /tmp/final.json

**Structure:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
  }
}
Extraction Priority:
  1. Messages from root run (AI responses with content)
  2. User-specified output fields (
    --output-fields
    )
  3. Common keys (answer, output)
  4. Full output dict
Important: Always checks root run first for final response to avoid intermediate tool outputs.
python generate_datasets.py --type final_response
--project my-project
--messages-only
--output /tmp/final.json

**结构:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
  }
}
提取优先级:
  1. 根运行中的消息(包含内容的AI响应)
  2. 用户指定的输出字段(
    --output-fields
  3. 通用键(answer、output)
  4. 完整输出字典
注意: 始终优先检查根运行的最终响应,以避免获取中间工具输出。

2. Single Step

2. 单步(Single Step)

Single node inputs/outputs - tests any specific node's behavior. Supports multiple occurrences per trace to capture conversation evolution.
bash
undefined
单个节点的输入/输出,用于测试特定节点的行为。支持每个轨迹包含多个实例,以捕获对话的演变过程。
bash
undefined

Extract all occurrences (default)

提取所有实例(默认)

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json

Sample 2 occurrences per trace

每个轨迹随机采样2个实例

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json

Target specific tool at depth 2

定位深度为2的特定工具

python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json

**Structure:**
```json
{
  "trace_id": "...",
  "run_id": "...",
  "occurrence": 2,
  "inputs": {
    "messages": [
      {"type": "human", "content": "What are the top 3 genres?"},
      {"type": "ai", "content": "", "tool_calls": [...]},
      {"type": "tool", "content": "...results..."},
      ...
    ]
  },
  "outputs": {
    "expected_output": {
      "messages": [
        {"type": "ai", "content": "", "tool_calls": [...]}
      ]
    },
    "node_name": "model"
  }
}
Key Features:
  • occurrence
    field tracks which invocation (1st, 2nd, 3rd, etc.)
  • Later occurrences have more conversation history → tests context handling
  • --sample-per-trace
    randomly samples N occurrences per trace
  • Use
    --run-name
    to target any node at any depth
Common targets:
  • model
    (depth 1) - LLM invocations with growing context
  • tools
    (depth 1) - Tool execution chain
  • Any custom node name
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json

**结构:**
```json
{
  "trace_id": "...",
  "run_id": "...",
  "occurrence": 2,
  "inputs": {
    "messages": [
      {"type": "human", "content": "What are the top 3 genres?"},
      {"type": "ai", "content": "", "tool_calls": [...]},
      {"type": "tool", "content": "...results..."},
      ...
    ]
  },
  "outputs": {
    "expected_output": {
      "messages": [
        {"type": "ai", "content": "", "tool_calls": [...]}
      ]
    },
    "node_name": "model"
  }
}
核心特性:
  • occurrence
    字段记录调用次数(第1次、第2次、第3次等)
  • 后续实例包含更多对话历史 → 用于测试上下文处理能力
  • --sample-per-trace
    参数可在每个轨迹中随机采样N个实例
  • 使用
    --run-name
    定位任意深度的节点
常见目标节点:
  • model
    (深度1)- 带有上下文增长的LLM调用
  • tools
    (深度1)- 工具执行链
  • 任意自定义节点名称

3. Trajectory

3. 轨迹序列(Trajectory)

Tool call sequence - tests execution path with configurable depth.
bash
undefined
工具调用序列,用于测试可配置深度的执行路径。
bash
undefined

Include all tool calls (all depths)

包含所有工具调用(所有深度)

python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json

Only tool calls up to depth 2

仅包含深度不超过2的工具调用

python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json

Only root-level tool calls (depth 0) - usually empty if tools are at depth 2+

仅包含根层级的工具调用(深度0)- 若工具位于深度2及以上,通常结果为空

python generate_datasets.py --type trajectory
--project my-project
--depth 0
--output /tmp/trajectory_root.json

**Structure:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_trajectory": [
      "sql_db_list_tables",
      "sql_db_schema",
      "sql_db_query_checker",
      "sql_db_query"
    ]
  }
}
Depth Control:
  • Omit
    --depth
    = all levels (includes subagent tool calls)
  • --depth 2
    = root + 2 levels (typical for capturing all main tools)
  • --depth 1
    = often only middleware/chains, no actual tool calls
  • --depth 0
    = root only (no tool calls)
Note: Tool calls are typically at depth 2 in LangGraph/DeepAgents architecture.
python generate_datasets.py --type trajectory
--project my-project
--depth 0
--output /tmp/trajectory_root.json

**结构:**
```json
{
  "trace_id": "...",
  "inputs": {"query": "What are the top 3 genres?"},
  "outputs": {
    "expected_trajectory": [
      "sql_db_list_tables",
      "sql_db_schema",
      "sql_db_query_checker",
      "sql_db_query"
    ]
  }
}
深度控制:
  • 省略
    --depth
    参数 = 包含所有层级(包括子Agent的工具调用)
  • --depth 2
    = 根层级 + 2层(通常可捕获所有主要工具)
  • --depth 1
    = 通常仅包含中间件/链,无实际工具调用
  • --depth 0
    = 仅根层级(无工具调用)
注意: 在LangGraph/DeepAgents架构中,工具调用通常位于深度2。

4. RAG

4. RAG

Question/chunks/answer/citations - tests retrieval quality.
bash
python generate_datasets.py --type rag \
  --project my-project \
  --limit 30 \
  --output /tmp/rag_ds.csv  # Supports .json or .csv
Structure (CSV format):
csv
question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"
问题/片段/答案/引用,用于测试检索质量。
bash
python generate_datasets.py --type rag \
  --project my-project \
  --limit 30 \
  --output /tmp/rag_ds.csv  # 支持.json或.csv格式
结构(CSV格式):
csv
question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"

Output Formats

输出格式

All dataset types support both JSON and CSV:
bash
undefined
所有数据集类型均支持JSON和CSV格式:
bash
undefined

JSON output (default)

JSON输出(默认)

python generate_datasets.py --type trajectory --project my-project --output ds.json
python generate_datasets.py --type trajectory --project my-project --output ds.json

CSV output (use .csv extension)

CSV输出(使用.csv扩展名)

python generate_datasets.py --type trajectory --project my-project --output ds.csv
undefined
python generate_datasets.py --type trajectory --project my-project --output ds.csv
undefined

Upload to LangSmith

上传至LangSmith

bash
undefined
bash
undefined

Generate and upload in one command

一键生成并上传

python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"

Use --replace to overwrite existing dataset

使用--replace参数覆盖现有数据集

python generate_datasets.py --type final_response
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace

**Naming Convention:** Use "Skills: <Type>" format for consistency:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"
python generate_datasets.py --type final_response
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace

**命名规范:** 为保持一致性,请使用"Skills: <类型>"格式:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"

Query Datasets

查询数据集

bash
undefined
bash
undefined

List all datasets

列出所有数据集

python query_datasets.py list-datasets
python query_datasets.py list-datasets

Filter by name pattern

按名称模式过滤

python query_datasets.py list-datasets | grep "Skills:"
python query_datasets.py list-datasets | grep "Skills:"

View dataset examples

查看数据集示例

python query_datasets.py show "Skills: Trajectory" --limit 5
python query_datasets.py show "Skills: Trajectory" --limit 5

View local file

查看本地文件

python query_datasets.py view-file /tmp/trajectory_ds.json --limit 3
python query_datasets.py view-file /tmp/trajectory_ds.json --limit 3

Analyze structure

分析结构

python query_datasets.py structure /tmp/trajectory_ds.json
python query_datasets.py structure /tmp/trajectory_ds.json

Export from LangSmith to local

从LangSmith导出至本地

python query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100
undefined
python query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100
undefined

Tips for Dataset Generation

数据集生成技巧

  1. Always use
    --root-run-name
    - Filter for specific agent framework (e.g., "LangGraph")
  2. Start with successful traces - Use recent successful runs for baseline datasets
  3. Use time windows -
    --last-n-minutes 1440
    for last 24 hours of data
  4. Sample for single_step - Use
    --sample-per-trace 2
    to capture conversation evolution
  5. Match depth to needs -
    --depth 2
    typically captures all main tool calls
  6. Review before upload - Use
    query_datasets.py view-file
    to inspect first
  7. Iterative refinement - Generate small batches (10-20) first, validate, then scale up
  8. Use
    --replace
    carefully
    - Overwrites existing datasets, useful for iteration
  1. 始终使用
    --root-run-name
    - 过滤特定Agent框架(例如"LangGraph")
  2. 从成功的轨迹开始 - 使用近期成功的运行生成基准数据集
  3. 使用时间窗口 -
    --last-n-minutes 1440
    可获取过去24小时的数据
  4. 对single_step类型进行采样 - 使用
    --sample-per-trace 2
    捕获对话演变
  5. 根据需求匹配深度 -
    --depth 2
    通常可捕获所有主要工具调用
  6. 上传前先审核 - 使用
    query_datasets.py view-file
    先检查内容
  7. 迭代优化 - 先生成小批量(10-20条),验证后再扩大规模
  8. 谨慎使用
    --replace
    - 会覆盖现有数据集,适合迭代场景

Example Workflow

示例工作流

bash
undefined
bash
undefined

1. Generate fresh traces (if needed)

1. 生成新的轨迹(如有需要)

python tests/test_agent.py --batch # Your test agent
python tests/test_agent.py --batch # 你的测试Agent

2. Generate all dataset types from LangGraph traces

2. 从LangGraph轨迹生成所有类型的数据集

python generate_datasets.py --type final_response
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace
python generate_datasets.py --type single_step
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace
python generate_datasets.py --type final_response
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace
python generate_datasets.py --type single_step
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace

3. Review in LangSmith UI

3. 在LangSmith UI中查看

Visit https://smith.langchain.com → Datasets → Filter for "Skills:"

访问https://smith.langchain.com → Datasets → 过滤"Skills:"

4. Query locally if needed

4. 如有需要,本地查询

python query_datasets.py show "Skills: Final Response" --limit 3
undefined
python query_datasets.py show "Skills: Final Response" --limit 3
undefined

Troubleshooting

故障排除

Empty final_response outputs:
  • Ensure
    --root-run-name
    matches your agent's root node
  • Check that root run has messages with AI responses
  • Use
    --messages-only
    if output dict is empty
No trajectory examples:
  • Tools might be at different depth - try removing
    --depth
    or use
    --depth 2
  • Verify tool calls exist:
    python query_traces.py trace <id> --show-hierarchy
Too many single_step examples:
  • Use
    --sample-per-trace 2
    to limit examples per trace
  • Reduces dataset size while maintaining diversity
Dataset upload fails:
  • Check dataset doesn't exist or use
    --replace
  • Verify LANGSMITH_API_KEY is set
final_response输出为空:
  • 确保
    --root-run-name
    与你的Agent根节点名称匹配
  • 检查根运行是否包含带有AI响应的消息
  • 若输出字典为空,使用
    --messages-only
    参数
无trajectory示例:
  • 工具可能位于不同深度 - 尝试移除
    --depth
    参数或使用
    --depth 2
  • 验证工具调用是否存在:
    python query_traces.py trace <id> --show-hierarchy
single_step示例过多:
  • 使用
    --sample-per-trace 2
    限制每个轨迹的示例数量
  • 在减少数据集大小的同时保持多样性
数据集上传失败:
  • 检查数据集是否已存在,或使用
    --replace
    参数
  • 验证LANGSMITH_API_KEY是否已正确设置

Related Skills

相关技能

  • Use langsmith-trace skill to query and export traces
  • Use langsmith-evaluator skill to create evaluators and measure performance
  • 使用langsmith-trace技能查询和导出轨迹
  • 使用langsmith-evaluator技能创建评估器并衡量性能