nemo-gym-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Nemo Gym Debugging

Nemo Gym 调试

Invocation Check

调用检查

Use this skill when something failed or looks suspicious in a Nemo Gym run. If the task is adding a new env, use the
nemo-gym-env-integration
skill; if it is changing profiling behavior, use the
nemo-gym-reward-profiling
skill.
Debug by classification, not by guessing. The first goal is to decide whether the issue is:
  • infra: Slurm, Ray, container, filesystem, network, ports
  • model serving: vLLM startup/readiness/throughput
  • config: wrong config bundle, missing agent, wrong extra args
  • data/schema: JSONL fields do not match verifier/resource server expectations
  • verifier/runtime: resource server exception or malformed verify response
  • cache/resume: stale materialized inputs or partial rollout output
  • throughput/resources: concurrency too high, judge bottleneck, tool/sandbox latency
当Nemo Gym运行出现失败或异常情况时使用此技能。如果任务是添加新环境,请使用
nemo-gym-env-integration
技能;如果是修改分析行为,请使用
nemo-gym-reward-profiling
技能。
通过分类进行调试,而非猜测。首要目标是确定问题属于以下哪一类:
  • 基础设施:Slurm、Ray、容器、文件系统、网络、端口
  • 模型服务:vLLM启动/就绪状态/吞吐量
  • 配置:错误的配置包、缺失Agent、错误的额外参数
  • 数据/架构:JSONL字段不符合验证器/资源服务器的预期
  • 验证器/运行时:资源服务器异常或格式错误的验证响应
  • 缓存/恢复:物化输入过时或rollout输出不完整
  • 吞吐量/资源:并发度过高、评判器瓶颈、工具/沙箱延迟

Debug Order

调试顺序

  1. Check Slurm/Ray job state and logs.
  2. Check vLLM readiness and
    /models
    availability.
  3. Check Gym server readiness: all expected servers started.
  4. Check tool routing if the env uses tools; check sandbox readiness only if a sandbox is configured.
  5. Check materialized inputs and source data timestamps.
  6. Check rollout output and profiling/metrics output counts.
  7. Inspect the first real verifier exception, not shutdown noise.
  8. Compare failing row schema against the resource server request model.
  1. 检查Slurm/Ray任务状态和日志。
  2. 检查vLLM就绪状态以及
    /models
    的可用性。
  3. 检查Gym服务器就绪状态:确认所有预期服务器已启动。
  4. 如果环境使用工具,检查工具路由;仅当配置了沙箱时,才检查沙箱就绪状态。
  5. 检查物化输入和源数据时间戳。
  6. 检查rollout输出和分析/指标输出的数量。
  7. 查看第一个真实的验证器异常,而非关闭时的冗余信息。
  8. 将失败行的架构与资源服务器请求模型进行对比。

High-Value Suspects

重点排查对象

  • If data changed and
    resume_from_cache
    was enabled, stale materialized inputs are a first-class suspect.
  • If rollout output has a few rows and profiling is empty, inspect verifier errors and partial-output cache.
  • If all servers are ready but verifier returns 422/500, inspect request body schema before debugging infra.
  • If tool envs hang or partially work, check tool ownership/loading before changing model settings; check sandbox readiness only when a sandbox is actually part of the env.
  • If tool-call rows fail before generation with vLLM grammar/schema errors, read
    references/vllm-tool-call-schema-checks.md
    and run a static tool-schema check before changing Gym wrappers.
  • If logs only show nested "inner server" 500s without the real provider/verifier body, first enable existing request-boundary visibility with
    ++global_aiohttp_client_request_debug=True
    . Read
    references/request-boundary-visibility.md
    before changing code.
  • 如果数据已更改且启用了
    resume_from_cache
    ,那么过时的物化输入是首要排查对象。
  • 如果rollout输出仅有少量行且分析结果为空,请检查验证器错误和部分输出缓存。
  • 如果所有服务器均已就绪但验证器返回422/500错误,请先检查请求体架构,再调试基础设施。
  • 如果工具环境挂起或仅部分工作,请先检查工具所有权/加载情况,再修改模型设置;仅当沙箱确实是环境的一部分时,才检查沙箱就绪状态。
  • 如果工具调用行在生成前因vLLM语法/架构错误失败,请阅读
    references/vllm-tool-call-schema-checks.md
    并在修改Gym包装器前运行静态工具架构检查。
  • 如果日志仅显示嵌套的“内部服务器”500错误,而无真实的提供者/验证器主体信息,请先通过
    ++global_aiohttp_client_request_debug=True
    启用现有的请求边界可见性。在修改代码前,请阅读
    references/request-boundary-visibility.md

Reference Loading

参考资料加载

  • Read
    references/error-profiles.md
    to classify the failing layer before changing code or data.
  • Read
    references/diagnostic-snippets.md
    when you need copy-paste commands to inspect logs, output counts, materialized inputs, rollout JSONL shape, server readiness, or reward summaries without mutating run state.
  • Read
    references/vllm-tool-call-schema-checks.md
    when a tool-call dataset may be rejected by vLLM/Outlines grammar compilation before any meaningful generation happens.
  • Read
    references/request-boundary-visibility.md
    when
    /run
    500s hide row identity or nested Gym 500s hide the inner model/verifier/provider error. It covers the existing Gym debug flag, shipped request-boundary markers, empty provider bodies, and vLLM provider-side escalation.
  • 在修改代码或数据之前,请阅读
    references/error-profiles.md
    以对失败层进行分类。
  • 当你需要复制粘贴命令来检查日志、输出数量、物化输入、rollout JSONL结构、服务器就绪状态或奖励摘要,且不改变运行状态时,请阅读
    references/diagnostic-snippets.md
  • 当工具调用数据集可能在进行有意义的生成之前就被vLLM/Outlines语法编译拒绝时,请阅读
    references/vllm-tool-call-schema-checks.md
  • /run
    500错误隐藏了行标识,或嵌套的Gym 500错误隐藏了内部模型/验证器/提供者错误时,请阅读
    references/request-boundary-visibility.md
    。它涵盖了现有的Gym调试标志、已发布的请求边界标记、空提供者主体以及vLLM提供者端的问题升级。

Communication Pattern

沟通模式

When reporting back, state:
  • observed symptom
  • failing layer
  • evidence from logs/files
  • likely cause
  • next concrete action
反馈问题时,请说明:
  • 观察到的症状
  • 失败的层级
  • 来自日志/文件的证据
  • 可能的原因
  • 下一步具体行动