nemo-gym-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNemo Gym Debugging
Nemo Gym 调试
Invocation Check
调用检查
Use this skill when something failed or looks suspicious in a Nemo Gym run. If the task is adding a new env, use the skill; if it is changing profiling behavior, use the skill.
nemo-gym-env-integrationnemo-gym-reward-profilingDebug by classification, not by guessing. The first goal is to decide whether the issue is:
- infra: Slurm, Ray, container, filesystem, network, ports
- model serving: vLLM startup/readiness/throughput
- config: wrong config bundle, missing agent, wrong extra args
- data/schema: JSONL fields do not match verifier/resource server expectations
- verifier/runtime: resource server exception or malformed verify response
- cache/resume: stale materialized inputs or partial rollout output
- throughput/resources: concurrency too high, judge bottleneck, tool/sandbox latency
当Nemo Gym运行出现失败或异常情况时使用此技能。如果任务是添加新环境,请使用技能;如果是修改分析行为,请使用技能。
nemo-gym-env-integrationnemo-gym-reward-profiling通过分类进行调试,而非猜测。首要目标是确定问题属于以下哪一类:
- 基础设施:Slurm、Ray、容器、文件系统、网络、端口
- 模型服务:vLLM启动/就绪状态/吞吐量
- 配置:错误的配置包、缺失Agent、错误的额外参数
- 数据/架构:JSONL字段不符合验证器/资源服务器的预期
- 验证器/运行时:资源服务器异常或格式错误的验证响应
- 缓存/恢复:物化输入过时或rollout输出不完整
- 吞吐量/资源:并发度过高、评判器瓶颈、工具/沙箱延迟
Debug Order
调试顺序
- Check Slurm/Ray job state and logs.
- Check vLLM readiness and availability.
/models - Check Gym server readiness: all expected servers started.
- Check tool routing if the env uses tools; check sandbox readiness only if a sandbox is configured.
- Check materialized inputs and source data timestamps.
- Check rollout output and profiling/metrics output counts.
- Inspect the first real verifier exception, not shutdown noise.
- Compare failing row schema against the resource server request model.
- 检查Slurm/Ray任务状态和日志。
- 检查vLLM就绪状态以及的可用性。
/models - 检查Gym服务器就绪状态:确认所有预期服务器已启动。
- 如果环境使用工具,检查工具路由;仅当配置了沙箱时,才检查沙箱就绪状态。
- 检查物化输入和源数据时间戳。
- 检查rollout输出和分析/指标输出的数量。
- 查看第一个真实的验证器异常,而非关闭时的冗余信息。
- 将失败行的架构与资源服务器请求模型进行对比。
High-Value Suspects
重点排查对象
- If data changed and was enabled, stale materialized inputs are a first-class suspect.
resume_from_cache - If rollout output has a few rows and profiling is empty, inspect verifier errors and partial-output cache.
- If all servers are ready but verifier returns 422/500, inspect request body schema before debugging infra.
- If tool envs hang or partially work, check tool ownership/loading before changing model settings; check sandbox readiness only when a sandbox is actually part of the env.
- If tool-call rows fail before generation with vLLM grammar/schema errors, read and run a static tool-schema check before changing Gym wrappers.
references/vllm-tool-call-schema-checks.md - If logs only show nested "inner server" 500s without the real provider/verifier body, first enable existing request-boundary visibility with . Read
++global_aiohttp_client_request_debug=Truebefore changing code.references/request-boundary-visibility.md
- 如果数据已更改且启用了,那么过时的物化输入是首要排查对象。
resume_from_cache - 如果rollout输出仅有少量行且分析结果为空,请检查验证器错误和部分输出缓存。
- 如果所有服务器均已就绪但验证器返回422/500错误,请先检查请求体架构,再调试基础设施。
- 如果工具环境挂起或仅部分工作,请先检查工具所有权/加载情况,再修改模型设置;仅当沙箱确实是环境的一部分时,才检查沙箱就绪状态。
- 如果工具调用行在生成前因vLLM语法/架构错误失败,请阅读并在修改Gym包装器前运行静态工具架构检查。
references/vllm-tool-call-schema-checks.md - 如果日志仅显示嵌套的“内部服务器”500错误,而无真实的提供者/验证器主体信息,请先通过启用现有的请求边界可见性。在修改代码前,请阅读
++global_aiohttp_client_request_debug=True。references/request-boundary-visibility.md
Reference Loading
参考资料加载
- Read to classify the failing layer before changing code or data.
references/error-profiles.md - Read when you need copy-paste commands to inspect logs, output counts, materialized inputs, rollout JSONL shape, server readiness, or reward summaries without mutating run state.
references/diagnostic-snippets.md - Read when a tool-call dataset may be rejected by vLLM/Outlines grammar compilation before any meaningful generation happens.
references/vllm-tool-call-schema-checks.md - Read when
references/request-boundary-visibility.md500s hide row identity or nested Gym 500s hide the inner model/verifier/provider error. It covers the existing Gym debug flag, shipped request-boundary markers, empty provider bodies, and vLLM provider-side escalation./run
- 在修改代码或数据之前,请阅读以对失败层进行分类。
references/error-profiles.md - 当你需要复制粘贴命令来检查日志、输出数量、物化输入、rollout JSONL结构、服务器就绪状态或奖励摘要,且不改变运行状态时,请阅读。
references/diagnostic-snippets.md - 当工具调用数据集可能在进行有意义的生成之前就被vLLM/Outlines语法编译拒绝时,请阅读。
references/vllm-tool-call-schema-checks.md - 当500错误隐藏了行标识,或嵌套的Gym 500错误隐藏了内部模型/验证器/提供者错误时,请阅读
/run。它涵盖了现有的Gym调试标志、已发布的请求边界标记、空提供者主体以及vLLM提供者端的问题升级。references/request-boundary-visibility.md
Communication Pattern
沟通模式
When reporting back, state:
- observed symptom
- failing layer
- evidence from logs/files
- likely cause
- next concrete action
反馈问题时,请说明:
- 观察到的症状
- 失败的层级
- 来自日志/文件的证据
- 可能的原因
- 下一步具体行动