nemo-gym-pivot-datasets
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNemo Gym Pivot Datasets
Nemo Gym Pivot数据集
Paper Reference
论文参考
This skill operationalizes PivotRL: create local
single-step pivot datasets from successful trajectories, prefer informative mixed-reward states,
and train with verifier-based local rewards rather than exact trajectory imitation.
本技能基于PivotRL实现:从成功的trajectory创建本地单步pivot数据集,优先选择信息丰富的混合奖励状态,并使用基于验证器的本地奖励进行训练,而非精确的trajectory模仿。
Invocation Check
调用场景判断
Use this skill when the task is to turn existing agent trajectories or rollout artifacts into a
Nemo Gym pivot dataset, or to validate whether a pivot JSONL/config pair can be used for
single-step local RL or evaluation.
Before writing a converter, inspect representative source rows and the target resource server.
Do not assume the source field names are the contract. Convert by reconstructing the semantic
pieces needed by Gym's Responses-style row format.
当需要将现有Agent trajectory或rollout工件转换为Nemo Gym pivot数据集,或验证某一pivot JSONL/配置对是否可用于单步本地强化学习(RL)或评估时,使用本技能。
编写转换器前,请检查代表性源数据行和目标资源服务器。不要假设源字段名称符合契约规范,应通过重构Gym的Responses风格行格式所需的语义信息来完成转换。
Core Workflow
核心工作流程
- Inspect the source data shape and count the candidate assistant decision points.
- Identify the semantic fields needed for each pivot:
- model-call input context before the pivot action
- available tools at that decision point
- expected assistant action
- reward/verifier target if it is separate from the demonstrated action
- optional provenance such as task id, source trajectory id, rollout id, uuid, depth, and original metadata
- Convert each accepted decision point into one pivot JSONL row.
- Generate or update the matching Gym config so the pivot-format JSONL can be used directly.
- Validate with the bundled validator and, when available, the target Gym resource-server models.
- Write metrics that make skipped rows, action types, tool names, depth, and provenance coverage easy to inspect.
- 检查源数据结构,统计候选助手决策点数量。
- 确定每个pivot所需的语义字段:
- pivot动作执行前的模型调用输入上下文
- 该决策点可用的工具
- 预期的助手动作
- 与演示动作分离的奖励/验证器目标(若存在)
- 可选溯源信息,如任务ID、源trajectory ID、rollout ID、UUID、深度及原始元数据
- 将每个被认可的决策点转换为一条pivot JSONL行。
- 生成或更新匹配的Gym配置,使pivot格式的JSONL可直接使用。
- 使用内置验证器(若可用,还需结合目标Gym资源服务器模型)进行验证。
- 编写指标,便于查看跳过的行、动作类型、工具名称、深度及溯源覆盖情况。
Row Shape
行结构
Read references/row-contract.md when implementing or reviewing a
converter. For , the essential row fields are:
single_step_tool_use_with_argument_comparison- : Responses API-style input and tool specs for the model call.
responses_create_params - : one
expected_actionor onefunction_call.message - : row-level agent routing that matches the generated config.
agent_ref
Do not copy optional null fields into ; omit them unless the target
contract explicitly wants them.
responses_create_paramsexpected_action实现或审核转换器时,请阅读references/row-contract.md。对于,核心行字段包括:
single_step_tool_use_with_argument_comparison- : 模型调用所需的Responses API风格输入和工具规格。
responses_create_params - : 一个
expected_action或一条function_call。message - : 与生成的配置匹配的行级Agent路由信息。
agent_ref
不要将可选的空字段复制到中;除非目标契约明确要求,否则应省略这些字段。
responses_create_paramsexpected_actionConversion Patterns
转换模式
Read references/conversion-patterns.md when the source data
is not already in pivot shape. The rule is to normalize by meaning, not by source container.
Useful reference scripts live under . They are copied from real conversions and
may contain dataset-specific paths, assumptions, or older branch behavior, so treat them as examples
to borrow from rather than canonical commands to run unchanged:
scripts/reference/- : generic source rows to pivot rows.
generic_pivot_dataset_reference.py - : chat-completion messages to pivot rows.
chat_messages_to_pivot_dataset_reference.py - : conversational message trajectories to pivot rows with reasoning/provenance handling.
conversational_messages_to_pivot_dataset_reference.py - : message/tool-use style rows to pivot rows.
tool_messages_to_pivot_dataset_reference.py
当源数据尚未采用pivot格式时,请阅读references/conversion-patterns.md。转换原则是按语义进行标准化,而非按源容器格式。
实用的参考脚本位于目录下,这些脚本来自实际转换场景,可能包含特定数据集的路径、假设或旧分支行为,因此应将其视为可借鉴的示例,而非无需修改即可运行的标准命令:
scripts/reference/- : 通用源数据行转pivot数据行。
generic_pivot_dataset_reference.py - : 聊天完成消息转pivot数据行。
chat_messages_to_pivot_dataset_reference.py - : 对话消息轨迹转包含推理/溯源处理的pivot数据行。
conversational_messages_to_pivot_dataset_reference.py - : 消息/工具使用风格行转pivot数据行。
tool_messages_to_pivot_dataset_reference.py
Pivot Selection
Pivot选择
Use clean, positive source trajectories for the demonstrated pivots. When multiple source
trajectories exist for a task, prefer tasks whose source trajectory group has mixed rewards
instead of all success or all failure; this avoids spending data on tasks that were trivial or
impossible for the source model. Treat that source-task filter as preferred, not mandatory, because
the source model and downstream policy may have different capabilities.
When possible, profile candidate pivots with local on-policy rollouts from the downstream or
initial policy. Use at least 8 sampled local rollouts per candidate as the default. Keep candidates
with mixed local rewards, discard all-1 and all-0 reward groups, and if data is abundant, drop the
easiest/high-pass-rate pivots first so training concentrates on hard but learnable states.
使用干净、正向的源trajectory作为演示pivot。当某一任务存在多个源trajectory时,优先选择源trajectory组具有混合奖励的任务,而非全成功或全失败的任务;这样可避免将数据浪费在源模型认为 trivial 或不可能完成的任务上。该源任务过滤规则为优先项而非强制项,因为源模型和下游策略的能力可能存在差异。
若可能,使用下游或初始策略的本地在线rollout对候选pivot进行分析。默认情况下,每个候选pivot至少使用8次采样本地rollout。保留具有混合本地奖励的候选pivot,丢弃全1和全0奖励组;若数据充足,先剔除最容易/通过率最高的pivot,使训练集中在难度适中但可学习的状态上。
Config And Training
配置与训练
Read references/config-training-and-agent-ref.md
when creating the Gym YAML or explaining how to train/evaluate from the dataset.
Key points:
- The pivot JSONL is the training/eval dataset; point the config's train dataset entry directly at it.
- in each row must match the agent block used by the config unless the launcher overrides routing intentionally.
agent_ref.name - is the main string-argument matching knob for the single-step tool-use verifier.
word_count_similarity_threshold - Use for these rows;
tool_choice: "auto"can route some inference engines into structured decoding paths.tool_choice: "required" - Validate configs and datasets together; a valid JSONL file can still be unusable if the agent/resource-server names do not line up.
创建Gym YAML配置或说明如何从数据集进行训练/评估时,请阅读references/config-training-and-agent-ref.md。
核心要点:
- pivot JSONL即为训练/评估数据集;将配置中的训练数据集条目直接指向该文件。
- 除非启动器有意覆盖路由,否则每行中的必须与配置使用的Agent块匹配。
agent_ref.name - 是单步工具使用验证器的主要字符串参数匹配参数。
word_count_similarity_threshold - 对这些行使用;
tool_choice: "auto"可引导部分推理引擎进入结构化解码路径。tool_choice: "required" - 需同时验证配置和数据集;即使JSONL文件有效,若Agent/资源服务器名称不匹配,仍无法使用。
Validation
验证
Run the bundled validator before calling a pivot dataset done:
bash
python scripts/validate_pivot_dataset.py --path /path/to/pivot.jsonl --agent-ref expected_agent_nameWhen the Gym repo is available, also validate against the resource-server Pydantic models:
bash
python scripts/validate_pivot_dataset.py \
--path /path/to/pivot.jsonl \
--agent-ref expected_agent_name \
--gym-repo /path/to/Gym-githubUse and only when a dataset-specific workflow needs extra
provenance checks. Provenance is useful for debugging and filtering, but it is not required by the
resource-server request model.
--require-field--require-any-fieldThe validator accepts both supported expected-action types by default ( and )
and prints an end summary split between tool-call and message pivots.
function_callmessage在完成pivot数据集前,运行内置验证器:
bash
python scripts/validate_pivot_dataset.py --path /path/to/pivot.jsonl --agent-ref expected_agent_name当Gym仓库可用时,还需结合资源服务器的Pydantic模型进行验证:
bash
python scripts/validate_pivot_dataset.py \
--path /path/to/pivot.jsonl \
--agent-ref expected_agent_name \
--gym-repo /path/to/Gym-github仅当特定数据集工作流需要额外溯源检查时,才使用和参数。溯源信息有助于调试和过滤,但并非资源服务器请求模型的必需项。
--require-field--require-any-field验证器默认支持两种预期动作类型(和),并在结束时按工具调用和消息pivot分类打印汇总信息。
function_callmessage