tao-train-foundation-stereo

Original🇺🇸 English
Translated

Stereo depth estimation using FoundationStereo. Predicts disparity maps from stereo image pairs for 3D reconstruction. Use when training, evaluating, exporting, or running inference for a TAO FoundationStereo model. Trigger phrases include "train stereo depth", "FoundationStereo", "stereo disparity estimation", "3D reconstruction from stereo".

7installs
Added on

NPX Install

npx skill4agent add promptingcompany/nv-skills tao-train-foundation-stereo

Depth Net Stereo

Stereo depth estimation using FoundationStereo architecture. Predicts disparity maps from stereo image pairs for 3D reconstruction.
Uses pretrained Depth Anything v2 and EdgeNeXt encoders. Set
model.stereo_backbone.depth_anything_v2_pretrained_path
and
model.stereo_backbone.edgenext_pretrained_path
.
The mono and stereo skills both invoke the unified TAO
depth_net
CLI inside the container; the mono/stereo family is selected via
model.model_type
(e.g.,
FoundationStereo
).
For TAO Deploy TensorRT actions (
gen_trt_engine
, TensorRT
evaluate
, and TensorRT
inference
), read
references/tao-deploy-foundation-stereo.md
first. The deploy spec template lives in this skill's
references/spec_template_deploy.yaml
.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read
references/skill_info.yaml
and resolve the run override from either an explicit
automl_policy
value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as
automl_policy: off
for this run only; otherwise default to
auto
. When
automl_policy: auto
,
automl_enabled: true
, and both
schemas/train.schema.json
and
references/spec_template_train.yaml
are packaged, route the train action through
tao-skill-bank:tao-run-automl
by default with this model's
skill_dir
. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and
automl_policy
. Use direct model training only when
automl_policy: off
or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.
Non-train actions such as
evaluate
,
inference
,
export
, and deploy flows stay in this model skill. The per-run
automl_policy
override does not change model metadata.

Workflow

Prerequisites — data accessibility

Your dataset (left + right images + GT disparity) must be reachable from inside the container:
  • SDK runner: place files at the S3 paths the runner resolves (the
    S3_TRAIN
    /
    S3_EVAL
    placeholders shown in Typical Spec Overrides). The runner handles S3 → container-path mounting transparently.
  • Direct
    docker run
    (e.g. local testing): mount the host dataset root read-only at the same in-container path:
docker run ... -v <host_data_root>:<host_data_root>:ro <container> ...
The same accessibility requirement applies to the
<output_dir>
written by all actions.

Step 1 — Annotation file

Per-line annotation file referenced by
data_sources[*].data_file
:
ColumnsFormatUse
2
<left> <right>
Stereo inference (no GT)
3
<left> <right> <disparity>
Stereo with GT
4
<left> <right> <disparity> <occlusion_mask>
Stereo with GT and occlusion mask
If you already have one, point to it. Otherwise generate via
depth_net convert
:
depth_net convert -e <convert_spec.yaml>
convert_spec.yaml
template (stereo):
yaml
data_root: <directory whose immediate children are scene folders that contain your image+depth files; convert walks data_root recursively but expects per-scene subdirectories at one level below>
image_dir_pattern: [<substring matching left image paths>]
right_dir_pattern: [<substring matching right image paths>]
depth_dir_pattern: [<substring matching GT disparity paths>]
nocc_dir_pattern: []                 # optional, occlusion mask paths
image_extension: '.png'  # always include the leading dot
depth_extension: '.png'  # form must match image_extension (the swap is a substring replace)
nocc_extension: ''
split_ratio: 0.0        # 0.0/1.0 = test-only; 0.8 = 80/20 train+val
convert
walks
data_root
recursively, selects paths whose path-string contains all substrings in
image_dir_pattern
(AND-filter), then derives right / depth / mask paths by replacing
image_dir_pattern[0]
with the corresponding pattern's first element plus extension swap. Inspect your dataset's directory layout and identify the substrings distinguishing left, right, and GT (e.g.
im0
vs
im1
vs
disp0GT
for Middlebury).

Step 2 — Pair
model_type
and
dataset_name
based on your data

Prefer the dataset-specific class when your layout matches a supported one — it applies class-specific path conventions, evaluation crops, and (where applicable) occlusion-mask handling. Fall back to
GenericDataset
only for layouts that do not match any registered class.
Data category
model_type
dataset_name
Middlebury data
FoundationStereo
Middlebury
KITTI data
FoundationStereo
Kitti
ETH3D data
FoundationStereo
Eth3d
FSD synthetic data
FoundationStereo
FSD
IsaacReal synthetic data
FoundationStereo
IsaacRealDataset
Crestereo synthetic data
FoundationStereo
Crestereo
Other / non-canonical layout
FoundationStereo
GenericDataset
See Training Requirements → Formats for the full registered-class list. The same
dataset_name
value applies across train and evaluate actions (all of which use 3-column or 4-column annotations with GT disparity). The deploy-side
evaluate
action follows the same rule — see
references/tao-deploy-foundation-stereo.md
. For inference with 2-column annotations (left + right, no GT), use
dataset_name: GenericDataset
regardless of data layout — the dataset-specific classes (
Middlebury
/
Kitti
/
Eth3d
/
FSD
/
IsaacRealDataset
/
Crestereo
) require 3-column input and reject 2-column annotations at the dataloader level. For inference with 3-column annotations (left + right + GT), the dataset-specific class is fine.

Step 3 — Write spec yaml from Typical Spec Overrides

Copy the action block from
references/foundation-stereo-spec-overrides.md
(per-action
spec_overrides
, mandatory data sources). Replace:
  • model.model_type
    from Step 2 (typically
    FoundationStereo
    )
  • dataset.<...>.data_sources[*].dataset_name
    from Step 2
  • dataset.<...>.data_sources[*].data_file
    with the path from Step 1
  • For deploy-side
    evaluate
    : enforce
    dataset.test_dataset.batch_size: 1
    (see
    references/tao-deploy-foundation-stereo.md
    ).
Shape consistency: the
crop_size
in
dataset.test_dataset.augmentation.crop_size
should match
export.input_height
/
input_width
so the trained-model evaluator and the deploy-side TensorRT evaluator operate at the same shape — see
references/foundation-stereo-troubleshooting.md
.

Step 4 — Run

docker run --gpus 'device=0' --shm-size 16G --ipc=host \
  --user $(id -u):$(id -g) \
  -v <data_root>:<data_root>:ro \
  -v <output_dir>:<output_dir> \
  <container> \
  depth_net <action> -e <spec.yaml>
Without
--user $(id -u):$(id -g)
the container writes outputs as
nobody:nogroup
, blocking host-side cleanup / retry.

Step 5 — Verify

  • Container exit code 0
  • status.json
    kpi
    block populated
  • For
    train
    : inspect per-step
    train_loss
    directly (the entrypoint reports
    Execution status: PASS
    even when loss is NaN)
  • For
    evaluate
    : rely on
    epe
    /
    bp1
    /
    bp2
    /
    bp3
    /
    d1
    /
    rmse
    (the evaluator also emits
    abs_rel
    /
    sq_rel
    /
    rmse_log
    which are non-meaningful for stereo — see
    references/foundation-stereo-parameters.md
    )
  • For
    inference
    : artifacts under
    results_dir
For TAO Deploy TensorRT actions (
gen_trt_engine
, TensorRT
evaluate
, and TensorRT
inference
), read
references/tao-deploy-foundation-stereo.md
first. Deploy spec templates live in this skill's
references/
folder with the
spec_template_deploy_*.yaml
prefix.

Training Requirements

  • Valid
    dataset_name
    values for stereo
    data_sources
    (case-insensitive):
    FSD
    ,
    IsaacRealDataset
    ,
    Crestereo
    ,
    Middlebury
    ,
    Eth3d
    ,
    Kitti
    ,
    GenericDataset
  • Monitoring metric: val/loss

Per-Action Dataset Requirements

ActionSpec KeySourceFilesList?
evaluatedataset.test_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
inferencedataset.infer_dataset.data_sourcesinference_datasetdata_file: annotations.txt + dataset_nameYes
quantizedataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
quantizedataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
quantizedataset.quant_calibration_dataset.images_dirtrain_datasetsimages.tar.gzNo
traindataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
traindataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes

Typical Spec Overrides

Data source overrides are mandatory for every action — the agent MUST construct data source paths from the Per-Action Dataset Requirements table above and include them in
spec_overrides
. Each
data_sources
entry is a dict with two mandatory fields:
data_file
and
dataset_name
.
See
references/foundation-stereo-spec-overrides.md
for the full per-action
spec_overrides
blocks (train, evaluate, export, gen_trt_engine, inference, quantize) with
S3_TRAIN
/
S3_EVAL
placeholders.

Eval Dataset

Optional. Val dataset configured via
dataset.val_dataset.data_sources
(each entry needs
data_file
and
dataset_name
).

Important Parameters

Key defaults:
model.model_type
=
FoundationStereo
(only selectable type);
model.encoder
(top-level, not under
stereo_backbone
) schema default
vitl
but FS small NGC ckpt requires
vits
, override explicitly
;
model.max_disparity
default 416;
train.optim.lr
default 1e-4;
train.precision
fp32 (recommended) or fp16 (no bf16);
export.batch_size
default
-1
. The
workers
field name is
workers
, not
num_workers
.
See
references/foundation-stereo-parameters.md
for the full parameter glossary (all
model.*
,
dataset.*
,
train.*
,
export.*
fields with defaults and ranges) and the Evaluation Metrics reference (which
epe
/
bp*
/
d1
/
rmse
to trust and why
abs_rel
/
sq_rel
/
rmse_log
are non-meaningful for stereo).

Multi-GPU / Multi-Node

Launch method: Lightning-managed (single
python
process, Lightning spawns workers).
Spec KeyDescriptionDefault
train.num_gpus
Number of GPUs1
train.gpu_ids
GPU device indices[0]
train.num_nodes
Number of nodes1
train.distributed_strategy
ddp
or
fsdp
ddp
Same DDP/FSDP behavior as depth-net-mono. Multi-node requires
WORLD_SIZE
,
NODE_RANK
,
MASTER_ADDR
,
MASTER_PORT
env vars.

Export / TRT Defaults

TRT data types FP32 / FP16. Static-shape ONNX (
export.batch_size: 1
) and batch-only dynamic ONNX (
export.batch_size: -1
) both support
fp16
; height and width are always pinned to the trace shape (H/W-dynamic engines are not supported — build separate engines per (H, W)). For the NGC release (576×960), set
export.batch_size: 1
,
export.opset_version: 17
,
export.on_cpu: True
.
See
references/foundation-stereo-export-trt-hardware.md
for the full export / TRT defaults (the opset-vs-
on_cpu
pairing rules, determinism notes,
on_cpu
GPU-memory thresholds) and the Hardware requirements. See
references/tao-deploy-foundation-stereo.md
for the three supported deploy paths and the validation table.
Full TAO Deploy reference: tao-deploy-foundation-stereo.

Error Patterns

Common issues: disparity overflow (reduce
model.max_disparity
); missing pretrained paths (set both
model.stereo_backbone.depth_anything_v2_pretrained_path
and
model.stereo_backbone.edgenext_pretrained_path
);
Key 'encoder' not in 'StereoBackBone'
(
encoder
is top-level
model.encoder
);
Key 'dataset_name' is not in struct
(each
data_sources
entry needs both
data_file
and
dataset_name
);
bash: exec: depth_net_stereo: not found
(entrypoint is
depth_net
, no suffix).
See
references/foundation-stereo-troubleshooting.md
for the full error patterns plus the pyt-vs-deploy
crop_size
discussion (the pyt
evaluate
path runs at native image resolution and ignores
crop_size
, with the Middlebury resolution guidance) and the Shape consistency rule.

Spec Param / Parent Model Inference

Model-specific inference mappings belong in MD, not in
config.json
. Generated runners read these mappings and apply them with SDK helpers before
create_job()
(mirrors the old microservices
infer_params.py
flow). For
parent_model
/
parent_model_folder
, pass the upstream train/export/AutoML child job id as
parent_job_id
; the SDK lists the parent result folder, filters checkpoint artifacts, and returns the selected model file or folder. Do not add these mappings back to
config.json
and do not patch generated runner scripts to guess checkpoint paths.
See
references/foundation-stereo-spec-param-inference.md
for the full per-action inference-mapping table (train / evaluate / inference / export / gen_trt_engine / quantize, including the train pretrained-path link/destination and resume-checkpoint mappings).