nemo-gym-pivot-datasets

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Nemo Gym Pivot Datasets

Nemo Gym Pivot数据集

Paper Reference

论文参考

This skill operationalizes PivotRL: create local single-step pivot datasets from successful trajectories, prefer informative mixed-reward states, and train with verifier-based local rewards rather than exact trajectory imitation.
本技能基于PivotRL实现:从成功的trajectory创建本地单步pivot数据集,优先选择信息丰富的混合奖励状态,并使用基于验证器的本地奖励进行训练,而非精确的trajectory模仿。

Invocation Check

调用场景判断

Use this skill when the task is to turn existing agent trajectories or rollout artifacts into a Nemo Gym pivot dataset, or to validate whether a pivot JSONL/config pair can be used for single-step local RL or evaluation.
Before writing a converter, inspect representative source rows and the target resource server. Do not assume the source field names are the contract. Convert by reconstructing the semantic pieces needed by Gym's Responses-style row format.
当需要将现有Agent trajectory或rollout工件转换为Nemo Gym pivot数据集,或验证某一pivot JSONL/配置对是否可用于单步本地强化学习(RL)或评估时,使用本技能。
编写转换器前,请检查代表性源数据行和目标资源服务器。不要假设源字段名称符合契约规范,应通过重构Gym的Responses风格行格式所需的语义信息来完成转换。

Core Workflow

核心工作流程

  1. Inspect the source data shape and count the candidate assistant decision points.
  2. Identify the semantic fields needed for each pivot:
  • model-call input context before the pivot action
  • available tools at that decision point
  • expected assistant action
  • reward/verifier target if it is separate from the demonstrated action
  • optional provenance such as task id, source trajectory id, rollout id, uuid, depth, and original metadata
  1. Convert each accepted decision point into one pivot JSONL row.
  2. Generate or update the matching Gym config so the pivot-format JSONL can be used directly.
  3. Validate with the bundled validator and, when available, the target Gym resource-server models.
  4. Write metrics that make skipped rows, action types, tool names, depth, and provenance coverage easy to inspect.
  1. 检查源数据结构,统计候选助手决策点数量。
  2. 确定每个pivot所需的语义字段:
  • pivot动作执行前的模型调用输入上下文
  • 该决策点可用的工具
  • 预期的助手动作
  • 与演示动作分离的奖励/验证器目标(若存在)
  • 可选溯源信息,如任务ID、源trajectory ID、rollout ID、UUID、深度及原始元数据
  1. 将每个被认可的决策点转换为一条pivot JSONL行。
  2. 生成或更新匹配的Gym配置,使pivot格式的JSONL可直接使用。
  3. 使用内置验证器(若可用,还需结合目标Gym资源服务器模型)进行验证。
  4. 编写指标,便于查看跳过的行、动作类型、工具名称、深度及溯源覆盖情况。

Row Shape

行结构

Read references/row-contract.md when implementing or reviewing a converter. For
single_step_tool_use_with_argument_comparison
, the essential row fields are:
  • responses_create_params
    : Responses API-style input and tool specs for the model call.
  • expected_action
    : one
    function_call
    or one
    message
    .
  • agent_ref
    : row-level agent routing that matches the generated config.
Do not copy optional null fields into
responses_create_params
; omit them unless the target contract explicitly wants them.
expected_action
is singular. If a source assistant turn has more than one tool call, filter that turn out of the pivot dataset and keep it only in a skipped-row audit if it needs review.
实现或审核转换器时,请阅读references/row-contract.md。对于
single_step_tool_use_with_argument_comparison
,核心行字段包括:
  • responses_create_params
    : 模型调用所需的Responses API风格输入和工具规格。
  • expected_action
    : 一个
    function_call
    或一条
    message
  • agent_ref
    : 与生成的配置匹配的行级Agent路由信息。
不要将可选的空字段复制到
responses_create_params
中;除非目标契约明确要求,否则应省略这些字段。
expected_action
为单个动作。若源助手回合包含多个工具调用,则将该回合从pivot数据集中过滤掉,仅在需审核时保留在跳过行审计记录中。

Conversion Patterns

转换模式

Read references/conversion-patterns.md when the source data is not already in pivot shape. The rule is to normalize by meaning, not by source container.
Useful reference scripts live under
scripts/reference/
. They are copied from real conversions and may contain dataset-specific paths, assumptions, or older branch behavior, so treat them as examples to borrow from rather than canonical commands to run unchanged:
  • generic_pivot_dataset_reference.py
    : generic source rows to pivot rows.
  • chat_messages_to_pivot_dataset_reference.py
    : chat-completion messages to pivot rows.
  • conversational_messages_to_pivot_dataset_reference.py
    : conversational message trajectories to pivot rows with reasoning/provenance handling.
  • tool_messages_to_pivot_dataset_reference.py
    : message/tool-use style rows to pivot rows.
当源数据尚未采用pivot格式时,请阅读references/conversion-patterns.md。转换原则是按语义进行标准化,而非按源容器格式。
实用的参考脚本位于
scripts/reference/
目录下,这些脚本来自实际转换场景,可能包含特定数据集的路径、假设或旧分支行为,因此应将其视为可借鉴的示例,而非无需修改即可运行的标准命令:
  • generic_pivot_dataset_reference.py
    : 通用源数据行转pivot数据行。
  • chat_messages_to_pivot_dataset_reference.py
    : 聊天完成消息转pivot数据行。
  • conversational_messages_to_pivot_dataset_reference.py
    : 对话消息轨迹转包含推理/溯源处理的pivot数据行。
  • tool_messages_to_pivot_dataset_reference.py
    : 消息/工具使用风格行转pivot数据行。

Pivot Selection

Pivot选择

Use clean, positive source trajectories for the demonstrated pivots. When multiple source trajectories exist for a task, prefer tasks whose source trajectory group has mixed rewards instead of all success or all failure; this avoids spending data on tasks that were trivial or impossible for the source model. Treat that source-task filter as preferred, not mandatory, because the source model and downstream policy may have different capabilities.
When possible, profile candidate pivots with local on-policy rollouts from the downstream or initial policy. Use at least 8 sampled local rollouts per candidate as the default. Keep candidates with mixed local rewards, discard all-1 and all-0 reward groups, and if data is abundant, drop the easiest/high-pass-rate pivots first so training concentrates on hard but learnable states.
使用干净、正向的源trajectory作为演示pivot。当某一任务存在多个源trajectory时,优先选择源trajectory组具有混合奖励的任务,而非全成功或全失败的任务;这样可避免将数据浪费在源模型认为 trivial 或不可能完成的任务上。该源任务过滤规则为优先项而非强制项,因为源模型和下游策略的能力可能存在差异。
若可能,使用下游或初始策略的本地在线rollout对候选pivot进行分析。默认情况下,每个候选pivot至少使用8次采样本地rollout。保留具有混合本地奖励的候选pivot,丢弃全1和全0奖励组;若数据充足,先剔除最容易/通过率最高的pivot,使训练集中在难度适中但可学习的状态上。

Config And Training

配置与训练

Read references/config-training-and-agent-ref.md when creating the Gym YAML or explaining how to train/evaluate from the dataset.
Key points:
  • The pivot JSONL is the training/eval dataset; point the config's train dataset entry directly at it.
  • agent_ref.name
    in each row must match the agent block used by the config unless the launcher overrides routing intentionally.
  • word_count_similarity_threshold
    is the main string-argument matching knob for the single-step tool-use verifier.
  • Use
    tool_choice: "auto"
    for these rows;
    tool_choice: "required"
    can route some inference engines into structured decoding paths.
  • Validate configs and datasets together; a valid JSONL file can still be unusable if the agent/resource-server names do not line up.
创建Gym YAML配置或说明如何从数据集进行训练/评估时,请阅读references/config-training-and-agent-ref.md
核心要点:
  • pivot JSONL即为训练/评估数据集;将配置中的训练数据集条目直接指向该文件。
  • 除非启动器有意覆盖路由,否则每行中的
    agent_ref.name
    必须与配置使用的Agent块匹配。
  • word_count_similarity_threshold
    是单步工具使用验证器的主要字符串参数匹配参数。
  • 对这些行使用
    tool_choice: "auto"
    tool_choice: "required"
    可引导部分推理引擎进入结构化解码路径。
  • 需同时验证配置和数据集;即使JSONL文件有效,若Agent/资源服务器名称不匹配,仍无法使用。

Validation

验证

Run the bundled validator before calling a pivot dataset done:
bash
python scripts/validate_pivot_dataset.py --path /path/to/pivot.jsonl --agent-ref expected_agent_name
When the Gym repo is available, also validate against the resource-server Pydantic models:
bash
python scripts/validate_pivot_dataset.py \
  --path /path/to/pivot.jsonl \
  --agent-ref expected_agent_name \
  --gym-repo /path/to/Gym-github
Use
--require-field
and
--require-any-field
only when a dataset-specific workflow needs extra provenance checks. Provenance is useful for debugging and filtering, but it is not required by the resource-server request model.
The validator accepts both supported expected-action types by default (
function_call
and
message
) and prints an end summary split between tool-call and message pivots.
在完成pivot数据集前,运行内置验证器:
bash
python scripts/validate_pivot_dataset.py --path /path/to/pivot.jsonl --agent-ref expected_agent_name
当Gym仓库可用时,还需结合资源服务器的Pydantic模型进行验证:
bash
python scripts/validate_pivot_dataset.py \
  --path /path/to/pivot.jsonl \
  --agent-ref expected_agent_name \
  --gym-repo /path/to/Gym-github
仅当特定数据集工作流需要额外溯源检查时,才使用
--require-field
--require-any-field
参数。溯源信息有助于调试和过滤,但并非资源服务器请求模型的必需项。
验证器默认支持两种预期动作类型(
function_call
message
),并在结束时按工具调用和消息pivot分类打印汇总信息。