langsmith-dataset
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangSmith Dataset
LangSmith 数据集
Auto-generate evaluation datasets from LangSmith traces for testing and validation.
从LangSmith轨迹自动生成评估数据集,用于测试与验证。
Setup
配置
Environment Variables
环境变量
bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # Required
LANGSMITH_PROJECT=your-project-name # Optional: default project
LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keysbash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # 必填
LANGSMITH_PROJECT=your-project-name # 可选:默认项目
LANGSMITH_WORKSPACE_ID=your-workspace-id # 可选:适用于组织范围的密钥Dependencies
依赖项
bash
pip install langsmith click rich python-dotenvbash
pip install langsmith click rich python-dotenvUsage
使用方法
Navigate to to run commands.
skills/langsmith-dataset/scripts/导航至目录以运行命令。
skills/langsmith-dataset/scripts/Scripts
脚本说明
generate_datasets.pyquery_datasets.pygenerate_datasets.pyquery_datasets.pyCommon Flags
通用参数
All dataset generation commands support:
- - Filter traces by root run name (e.g., "LangGraph" for DeepAgents)
--root-run-name <name> - - Number of traces to process (default: 30)
--limit <n> - - Only recent traces
--last-n-minutes <n> - - Output file (.json or .csv)
--output <path> - - Upload to LangSmith with this dataset name
--upload <name> - - Overwrite existing file/dataset (will prompt for confirmation)
--replace - - Skip confirmation prompts (use with caution)
--yes
IMPORTANT - Safety Prompts:
- The script prompts for confirmation before deleting existing datasets with
--replace - ALWAYS respect these prompts - wait for user input before proceeding
- NEVER use flag unless the user explicitly requests it
--yes - The flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user
--yes
所有数据集生成命令均支持以下参数:
- - 按根运行名称过滤轨迹(例如,针对DeepAgents使用"LangGraph")
--root-run-name <名称> - - 要处理的轨迹数量(默认值:30)
--limit <数值> - - 仅处理最近指定分钟内的轨迹
--last-n-minutes <数值> - - 输出文件路径(支持.json或.csv格式)
--output <路径> - - 以此数据集名称上传至LangSmith
--upload <名称> - - 覆盖现有文件/数据集(操作前会提示确认)
--replace - - 跳过所有确认提示(请谨慎使用)
--yes
重要安全提示:
- 使用参数时,脚本会在删除现有数据集前提示确认
--replace - 请务必重视这些提示 - 等待用户输入后再继续操作
- 除非用户明确要求,否则请勿使用参数
--yes - 参数会跳过所有安全提示,仅应在用户明确授权的自动化工作流中使用
--yes
Understanding Trace Hierarchy
理解轨迹层级
Traces have depth levels based on parent-child relationships:
Depth 0: Root agent (e.g., "LangGraph")
├── Depth 1: Middleware/chains (model, tools, SummarizationMiddleware)
│ ├── Depth 2: Tool calls (sql_db_query, retriever, etc.)
│ └── Depth 2: LLM calls (ChatOpenAI, ChatAnthropic)
└── Depth 3+: Nested subagent callsUse to target specific agent frameworks:
--root-run-name- DeepAgents:
--root-run-name LangGraph - Custom agents: Use your root node name
轨迹根据父子关系分为不同深度层级:
深度0:根Agent(例如"LangGraph")
├── 深度1:中间件/链(模型、工具、SummarizationMiddleware)
│ ├── 深度2:工具调用(sql_db_query、retriever等)
│ └── 深度2:LLM调用(ChatOpenAI、ChatAnthropic)
└── 深度3及以上:嵌套子Agent调用使用定位特定Agent框架:
--root-run-name- DeepAgents:
--root-run-name LangGraph - 自定义Agent:使用你的根节点名称
Dataset Types
数据集类型
1. Final Response
1. 最终响应(Final Response)
Full conversation with expected output - tests complete agent behavior.
bash
undefined包含完整对话与预期输出,用于测试Agent的完整行为。
bash
undefinedBasic usage
基础用法
python generate_datasets.py --type final_response
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json
python generate_datasets.py --type final_response
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/final_response.json
With custom output fields
自定义输出字段
python generate_datasets.py --type final_response
--project my-project
--output-fields "answer,result"
--output /tmp/final.json
--project my-project
--output-fields "answer,result"
--output /tmp/final.json
python generate_datasets.py --type final_response
--project my-project
--output-fields "answer,result"
--output /tmp/final.json
--project my-project
--output-fields "answer,result"
--output /tmp/final.json
Messages only (ignore output dict keys)
仅保留消息(忽略输出字典中的键)
python generate_datasets.py --type final_response
--project my-project
--messages-only
--output /tmp/final.json
--project my-project
--messages-only
--output /tmp/final.json
**Structure:**
```json
{
"trace_id": "...",
"inputs": {"query": "What are the top 3 genres?"},
"outputs": {
"expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
}
}Extraction Priority:
- Messages from root run (AI responses with content)
- User-specified output fields ()
--output-fields - Common keys (answer, output)
- Full output dict
Important: Always checks root run first for final response to avoid intermediate tool outputs.
python generate_datasets.py --type final_response
--project my-project
--messages-only
--output /tmp/final.json
--project my-project
--messages-only
--output /tmp/final.json
**结构:**
```json
{
"trace_id": "...",
"inputs": {"query": "What are the top 3 genres?"},
"outputs": {
"expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks"
}
}提取优先级:
- 根运行中的消息(包含内容的AI响应)
- 用户指定的输出字段()
--output-fields - 通用键(answer、output)
- 完整输出字典
注意: 始终优先检查根运行的最终响应,以避免获取中间工具输出。
2. Single Step
2. 单步(Single Step)
Single node inputs/outputs - tests any specific node's behavior. Supports multiple occurrences per trace to capture conversation evolution.
bash
undefined单个节点的输入/输出,用于测试特定节点的行为。支持每个轨迹包含多个实例,以捕获对话的演变过程。
bash
undefinedExtract all occurrences (default)
提取所有实例(默认)
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json
--project my-project
--root-run-name LangGraph
--run-name model
--output /tmp/single_step.json
Sample 2 occurrences per trace
每个轨迹随机采样2个实例
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json
--project my-project
--root-run-name LangGraph
--run-name model
--sample-per-trace 2
--output /tmp/single_step_sampled.json
Target specific tool at depth 2
定位深度为2的特定工具
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json
**Structure:**
```json
{
"trace_id": "...",
"run_id": "...",
"occurrence": 2,
"inputs": {
"messages": [
{"type": "human", "content": "What are the top 3 genres?"},
{"type": "ai", "content": "", "tool_calls": [...]},
{"type": "tool", "content": "...results..."},
...
]
},
"outputs": {
"expected_output": {
"messages": [
{"type": "ai", "content": "", "tool_calls": [...]}
]
},
"node_name": "model"
}
}Key Features:
- field tracks which invocation (1st, 2nd, 3rd, etc.)
occurrence - Later occurrences have more conversation history → tests context handling
- randomly samples N occurrences per trace
--sample-per-trace - Use to target any node at any depth
--run-name
Common targets:
- (depth 1) - LLM invocations with growing context
model - (depth 1) - Tool execution chain
tools - Any custom node name
python generate_datasets.py --type single_step
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json
--project my-project
--root-run-name LangGraph
--run-name sql_db_query
--output /tmp/sql_query.json
**结构:**
```json
{
"trace_id": "...",
"run_id": "...",
"occurrence": 2,
"inputs": {
"messages": [
{"type": "human", "content": "What are the top 3 genres?"},
{"type": "ai", "content": "", "tool_calls": [...]},
{"type": "tool", "content": "...results..."},
...
]
},
"outputs": {
"expected_output": {
"messages": [
{"type": "ai", "content": "", "tool_calls": [...]}
]
},
"node_name": "model"
}
}核心特性:
- 字段记录调用次数(第1次、第2次、第3次等)
occurrence - 后续实例包含更多对话历史 → 用于测试上下文处理能力
- 参数可在每个轨迹中随机采样N个实例
--sample-per-trace - 使用定位任意深度的节点
--run-name
常见目标节点:
- (深度1)- 带有上下文增长的LLM调用
model - (深度1)- 工具执行链
tools - 任意自定义节点名称
3. Trajectory
3. 轨迹序列(Trajectory)
Tool call sequence - tests execution path with configurable depth.
bash
undefined工具调用序列,用于测试可配置深度的执行路径。
bash
undefinedInclude all tool calls (all depths)
包含所有工具调用(所有深度)
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json
--project my-project
--root-run-name LangGraph
--limit 30
--output /tmp/trajectory_all.json
Only tool calls up to depth 2
仅包含深度不超过2的工具调用
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json
--project my-project
--root-run-name LangGraph
--depth 2
--output /tmp/trajectory_depth2.json
Only root-level tool calls (depth 0) - usually empty if tools are at depth 2+
仅包含根层级的工具调用(深度0)- 若工具位于深度2及以上,通常结果为空
python generate_datasets.py --type trajectory
--project my-project
--depth 0
--output /tmp/trajectory_root.json
--project my-project
--depth 0
--output /tmp/trajectory_root.json
**Structure:**
```json
{
"trace_id": "...",
"inputs": {"query": "What are the top 3 genres?"},
"outputs": {
"expected_trajectory": [
"sql_db_list_tables",
"sql_db_schema",
"sql_db_query_checker",
"sql_db_query"
]
}
}Depth Control:
- Omit = all levels (includes subagent tool calls)
--depth - = root + 2 levels (typical for capturing all main tools)
--depth 2 - = often only middleware/chains, no actual tool calls
--depth 1 - = root only (no tool calls)
--depth 0
Note: Tool calls are typically at depth 2 in LangGraph/DeepAgents architecture.
python generate_datasets.py --type trajectory
--project my-project
--depth 0
--output /tmp/trajectory_root.json
--project my-project
--depth 0
--output /tmp/trajectory_root.json
**结构:**
```json
{
"trace_id": "...",
"inputs": {"query": "What are the top 3 genres?"},
"outputs": {
"expected_trajectory": [
"sql_db_list_tables",
"sql_db_schema",
"sql_db_query_checker",
"sql_db_query"
]
}
}深度控制:
- 省略参数 = 包含所有层级(包括子Agent的工具调用)
--depth - = 根层级 + 2层(通常可捕获所有主要工具)
--depth 2 - = 通常仅包含中间件/链,无实际工具调用
--depth 1 - = 仅根层级(无工具调用)
--depth 0
注意: 在LangGraph/DeepAgents架构中,工具调用通常位于深度2。
4. RAG
4. RAG
Question/chunks/answer/citations - tests retrieval quality.
bash
python generate_datasets.py --type rag \
--project my-project \
--limit 30 \
--output /tmp/rag_ds.csv # Supports .json or .csvStructure (CSV format):
csv
question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"问题/片段/答案/引用,用于测试检索质量。
bash
python generate_datasets.py --type rag \
--project my-project \
--limit 30 \
--output /tmp/rag_ds.csv # 支持.json或.csv格式结构(CSV格式):
csv
question,retrieved_chunks,answer,cited_chunks
"How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"Output Formats
输出格式
All dataset types support both JSON and CSV:
bash
undefined所有数据集类型均支持JSON和CSV格式:
bash
undefinedJSON output (default)
JSON输出(默认)
python generate_datasets.py --type trajectory --project my-project --output ds.json
python generate_datasets.py --type trajectory --project my-project --output ds.json
CSV output (use .csv extension)
CSV输出(使用.csv扩展名)
python generate_datasets.py --type trajectory --project my-project --output ds.csv
undefinedpython generate_datasets.py --type trajectory --project my-project --output ds.csv
undefinedUpload to LangSmith
上传至LangSmith
bash
undefinedbash
undefinedGenerate and upload in one command
一键生成并上传
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"
python generate_datasets.py --type trajectory
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"
--project my-project
--root-run-name LangGraph
--limit 50
--output /tmp/trajectory_ds.json
--upload "Skills: Trajectory"
Use --replace to overwrite existing dataset
使用--replace参数覆盖现有数据集
python generate_datasets.py --type final_response
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace
**Naming Convention:** Use "Skills: <Type>" format for consistency:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"python generate_datasets.py --type final_response
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace
--project my-project
--output /tmp/final.json
--upload "Skills: Final Response"
--replace
**命名规范:** 为保持一致性,请使用"Skills: <类型>"格式:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"Query Datasets
查询数据集
bash
undefinedbash
undefinedList all datasets
列出所有数据集
python query_datasets.py list-datasets
python query_datasets.py list-datasets
Filter by name pattern
按名称模式过滤
python query_datasets.py list-datasets | grep "Skills:"
python query_datasets.py list-datasets | grep "Skills:"
View dataset examples
查看数据集示例
python query_datasets.py show "Skills: Trajectory" --limit 5
python query_datasets.py show "Skills: Trajectory" --limit 5
View local file
查看本地文件
python query_datasets.py view-file /tmp/trajectory_ds.json --limit 3
python query_datasets.py view-file /tmp/trajectory_ds.json --limit 3
Analyze structure
分析结构
python query_datasets.py structure /tmp/trajectory_ds.json
python query_datasets.py structure /tmp/trajectory_ds.json
Export from LangSmith to local
从LangSmith导出至本地
python query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100
undefinedpython query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100
undefinedTips for Dataset Generation
数据集生成技巧
- Always use - Filter for specific agent framework (e.g., "LangGraph")
--root-run-name - Start with successful traces - Use recent successful runs for baseline datasets
- Use time windows - for last 24 hours of data
--last-n-minutes 1440 - Sample for single_step - Use to capture conversation evolution
--sample-per-trace 2 - Match depth to needs - typically captures all main tool calls
--depth 2 - Review before upload - Use to inspect first
query_datasets.py view-file - Iterative refinement - Generate small batches (10-20) first, validate, then scale up
- Use carefully - Overwrites existing datasets, useful for iteration
--replace
- 始终使用- 过滤特定Agent框架(例如"LangGraph")
--root-run-name - 从成功的轨迹开始 - 使用近期成功的运行生成基准数据集
- 使用时间窗口 - 可获取过去24小时的数据
--last-n-minutes 1440 - 对single_step类型进行采样 - 使用捕获对话演变
--sample-per-trace 2 - 根据需求匹配深度 - 通常可捕获所有主要工具调用
--depth 2 - 上传前先审核 - 使用先检查内容
query_datasets.py view-file - 迭代优化 - 先生成小批量(10-20条),验证后再扩大规模
- 谨慎使用- 会覆盖现有数据集,适合迭代场景
--replace
Example Workflow
示例工作流
bash
undefinedbash
undefined1. Generate fresh traces (if needed)
1. 生成新的轨迹(如有需要)
python tests/test_agent.py --batch # Your test agent
python tests/test_agent.py --batch # 你的测试Agent
2. Generate all dataset types from LangGraph traces
2. 从LangGraph轨迹生成所有类型的数据集
python generate_datasets.py --type final_response
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace
python generate_datasets.py --type single_step
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace
python generate_datasets.py --type final_response
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace
--project skills --root-run-name LangGraph --limit 10
--output /tmp/final.json --upload "Skills: Final Response" --replace
python generate_datasets.py --type single_step
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace
--project skills --root-run-name LangGraph --run-name model
--sample-per-trace 2 --limit 10
--output /tmp/model.json --upload "Skills: Single Step (model)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace
--project skills --root-run-name LangGraph --limit 10
--output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace
python generate_datasets.py --type trajectory
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace
--project skills --root-run-name LangGraph --depth 2 --limit 10
--output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace
3. Review in LangSmith UI
3. 在LangSmith UI中查看
Visit https://smith.langchain.com → Datasets → Filter for "Skills:"
访问https://smith.langchain.com → Datasets → 过滤"Skills:"
4. Query locally if needed
4. 如有需要,本地查询
python query_datasets.py show "Skills: Final Response" --limit 3
undefinedpython query_datasets.py show "Skills: Final Response" --limit 3
undefinedTroubleshooting
故障排除
Empty final_response outputs:
- Ensure matches your agent's root node
--root-run-name - Check that root run has messages with AI responses
- Use if output dict is empty
--messages-only
No trajectory examples:
- Tools might be at different depth - try removing or use
--depth--depth 2 - Verify tool calls exist:
python query_traces.py trace <id> --show-hierarchy
Too many single_step examples:
- Use to limit examples per trace
--sample-per-trace 2 - Reduces dataset size while maintaining diversity
Dataset upload fails:
- Check dataset doesn't exist or use
--replace - Verify LANGSMITH_API_KEY is set
final_response输出为空:
- 确保与你的Agent根节点名称匹配
--root-run-name - 检查根运行是否包含带有AI响应的消息
- 若输出字典为空,使用参数
--messages-only
无trajectory示例:
- 工具可能位于不同深度 - 尝试移除参数或使用
--depth--depth 2 - 验证工具调用是否存在:
python query_traces.py trace <id> --show-hierarchy
single_step示例过多:
- 使用限制每个轨迹的示例数量
--sample-per-trace 2 - 在减少数据集大小的同时保持多样性
数据集上传失败:
- 检查数据集是否已存在,或使用参数
--replace - 验证LANGSMITH_API_KEY是否已正确设置
Related Skills
相关技能
- Use langsmith-trace skill to query and export traces
- Use langsmith-evaluator skill to create evaluators and measure performance
- 使用langsmith-trace技能查询和导出轨迹
- 使用langsmith-evaluator技能创建评估器并衡量性能