nemo-gym-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Chinese

Use this skill when something failed or looks suspicious in a Nemo Gym run. If the task is adding a new env, use the

nemo-gym-env-integration

skill; if it is changing profiling behavior, use the

nemo-gym-reward-profiling

skill.

Debug by classification, not by guessing. The first goal is to decide whether the issue is:

infra: Slurm, Ray, container, filesystem, network, ports
model serving: vLLM startup/readiness/throughput
config: wrong config bundle, missing agent, wrong extra args
data/schema: JSONL fields do not match verifier/resource server expectations
verifier/runtime: resource server exception or malformed verify response
cache/resume: stale materialized inputs or partial rollout output
throughput/resources: concurrency too high, judge bottleneck, tool/sandbox latency

当Nemo Gym运行出现失败或异常情况时使用此技能。如果任务是添加新环境，请使用

nemo-gym-env-integration

技能；如果是修改分析行为，请使用

nemo-gym-reward-profiling

技能。

通过分类进行调试，而非猜测。首要目标是确定问题属于以下哪一类：

Check Slurm/Ray job state and logs.
Check vLLM readiness and
```
/models
```
availability.
Check Gym server readiness: all expected servers started.
Check tool routing if the env uses tools; check sandbox readiness only if a sandbox is configured.
Check materialized inputs and source data timestamps.
Check rollout output and profiling/metrics output counts.
Inspect the first real verifier exception, not shutdown noise.
Compare failing row schema against the resource server request model.

If data changed and
```
resume_from_cache
```
was enabled, stale materialized inputs are a first-class suspect.
If rollout output has a few rows and profiling is empty, inspect verifier errors and partial-output cache.
If all servers are ready but verifier returns 422/500, inspect request body schema before debugging infra.
If tool envs hang or partially work, check tool ownership/loading before changing model settings; check sandbox readiness only when a sandbox is actually part of the env.
If tool-call rows fail before generation with vLLM grammar/schema errors, read
```
references/vllm-tool-call-schema-checks.md
```
and run a static tool-schema check before changing Gym wrappers.
If logs only show nested "inner server" 500s without the real provider/verifier body, first enable existing request-boundary visibility with
```
++global_aiohttp_client_request_debug=True
```
. Read
```
references/request-boundary-visibility.md
```
before changing code.

如果数据已更改且启用了
```
resume_from_cache
```
，那么过时的物化输入是首要排查对象。
如果rollout输出仅有少量行且分析结果为空，请检查验证器错误和部分输出缓存。
如果所有服务器均已就绪但验证器返回422/500错误，请先检查请求体架构，再调试基础设施。
如果工具环境挂起或仅部分工作，请先检查工具所有权/加载情况，再修改模型设置；仅当沙箱确实是环境的一部分时，才检查沙箱就绪状态。
如果工具调用行在生成前因vLLM语法/架构错误失败，请阅读
```
references/vllm-tool-call-schema-checks.md
```
并在修改Gym包装器前运行静态工具架构检查。
如果日志仅显示嵌套的“内部服务器”500错误，而无真实的提供者/验证器主体信息，请先通过
```
++global_aiohttp_client_request_debug=True
```
启用现有的请求边界可见性。在修改代码前，请阅读
```
references/request-boundary-visibility.md
```
。

Read
```
references/error-profiles.md
```
to classify the failing layer before changing code or data.
Read
```
references/diagnostic-snippets.md
```
when you need copy-paste commands to inspect logs, output counts, materialized inputs, rollout JSONL shape, server readiness, or reward summaries without mutating run state.
Read
```
references/vllm-tool-call-schema-checks.md
```
when a tool-call dataset may be rejected by vLLM/Outlines grammar compilation before any meaningful generation happens.
Read
```
references/request-boundary-visibility.md
```
when
```
/run
```
500s hide row identity or nested Gym 500s hide the inner model/verifier/provider error. It covers the existing Gym debug flag, shipped request-boundary markers, empty provider bodies, and vLLM provider-side escalation.

在修改代码或数据之前，请阅读
```
references/error-profiles.md
```
以对失败层进行分类。
当你需要复制粘贴命令来检查日志、输出数量、物化输入、rollout JSONL结构、服务器就绪状态或奖励摘要，且不改变运行状态时，请阅读
```
references/diagnostic-snippets.md
```
。
当工具调用数据集可能在进行有意义的生成之前就被vLLM/Outlines语法编译拒绝时，请阅读
```
references/vllm-tool-call-schema-checks.md
```
。
当
```
/run
```
500错误隐藏了行标识，或嵌套的Gym 500错误隐藏了内部模型/验证器/提供者错误时，请阅读
```
references/request-boundary-visibility.md
```
。它涵盖了现有的Gym调试标志、已发布的请求边界标记、空提供者主体以及vLLM提供者端的问题升级。

When reporting back, state:

反馈问题时，请说明：