Depth Net Mono
Monocular depth estimation using Metric Depth Anything v2 or Relative Depth Anything architectures. Predicts per-pixel depth from single RGB images.
Pretrained checkpoint loading varies by model variant and use case — see the
Pretrained checkpoint loading — use case matrix in
.
The mono and stereo skills both invoke the unified TAO
CLI inside the container; the mono/stereo family is selected via
(full parameter glossary in
).
For TAO Deploy TensorRT actions (
, TensorRT
, and TensorRT
), read
references/tao-deploy-depth-anything-v2.md
first. The deploy spec template lives in this skill's
references/spec_template_deploy.yaml
.
Train Action Policy
This model is AutoML-enabled at the model layer. Before handling any train-stage request, read
references/skill_info.yaml
and resolve the run override from either an explicit
value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as
for this run only; otherwise default to
. When
,
, and both
schemas/train.schema.json
and
references/spec_template_train.yaml
are packaged, route the train action through
tao-skill-bank:tao-run-automl
by default with this model's
. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and
. Use direct model training only when
or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.
Non-train actions such as
,
,
, and deploy flows stay in this model skill. The per-run
override does not change model metadata.
Workflow
Prerequisites — data accessibility
Your dataset (RGB images + GT depth files) must be reachable from inside the container:
- SDK runner: place files at the S3 paths the runner resolves (the / placeholders shown in the spec overrides). The runner handles S3 → container-path mounting transparently.
- Direct (e.g. local testing): mount the host dataset root read-only at the same in-container path:
docker run ... -v <host_data_root>:<host_data_root>:ro <container> ...
The same accessibility requirement applies to the
written by all actions.
Step 1 — Annotation file
Per-line annotation file referenced by
data_sources[*].data_file
:
| Columns | Format | Use |
|---|
| 1 | | Mono inference (no GT) |
| 2 | | Mono with GT |
If you already have one, point to it. Otherwise generate via
:
depth_net convert -e <convert_spec.yaml>
yaml
data_root: <directory whose immediate children are scene/sample folders that contain your image+depth files; convert walks data_root recursively but expects per-scene subdirectories at one level below>
image_dir_pattern: [<substring matching left/RGB image paths>]
depth_dir_pattern: [<substring matching GT depth paths>]
image_extension: '' # optional .endswith filter, e.g. '.jpg'
depth_extension: '' # optional, swapped during depth derivation, e.g. '.png'
split_ratio: 0.0 # 0.0/1.0 = test-only; 0.8 = 80/20 train+val
walks
recursively, selects paths whose path-string contains
all substrings in
(AND-filter), then derives the depth path by replacing
with
and
with
. Inspect your dataset's directory layout and identify the substring distinguishing RGB images from depth files (e.g.
vs
).
must point at the parent that contains the per-scene subdirectories (e.g. for NYU eval, use
, not
/data/nyu_v2/eval/test/bathroom
— the latter limits the walk to a single scene). Always include the leading dot in
/
(e.g.
not
); the substring swap is form-sensitive and a mismatch silently corrupts derived paths.
Step 2 — Pair and based on your data
Default — generic class for each task:
| Data category | | |
|---|
| Disparity-encoded data (pixels) | | |
| Metric depth (meters) | | |
| Mono inference (no GT, any image) | matches train choice | or |
Dataset-specific class — switch when the data needs preprocessing the generic class does not perform:
| Special case | | | What the class adds |
|---|
| NYU (raw uint16 millimetres) — relative | | | mm→m unit conversion + Eigen evaluation crop |
| NYU (raw uint16 millimetres) — metric | | | same |
Using a generic class on data that requires unit conversion (e.g. raw NYU uint16 PNGs) results in an empty valid mask and silent
. Match the class to your data's encoding.
Step 3 — Write spec yaml from the spec overrides
Copy the action block from
references/spec-overrides.md
. Replace:
- from Step 2
dataset.<...>.data_sources[*].dataset_name
from Step 2
data_sources[*].data_file
with the path from Step 1 (S3 path under SDK runner, host path for direct docker)
- For metric finetune: additionally apply the Metric Variant Finetuning Recipe in .
For mono training set
(recommended) or
(Ampere SM80+, alternative).
Step 4 — Run
docker run --gpus 'device=0' --shm-size 16G --ipc=host \
--user $(id -u):$(id -g) \
-v <data_root>:<data_root>:ro \
-v <output_dir>:<output_dir> \
<container> \
depth_net <action> -e <spec.yaml>
Without
the container writes outputs as
, blocking host-side cleanup and retry.
Step 5 — Verify
- Container exit code 0
- block populated
- For : inspect per-step directly — the entrypoint reports even when (see the Sanity-run PASS criteria in )
- For / : artifacts under
For TAO Deploy TensorRT actions (
, TensorRT
, and TensorRT
), read
references/tao-deploy-depth-anything-v2.md
first. Deploy spec templates live in this skill's
folder with the
spec_template_deploy_*.yaml
prefix.
Training Requirements
- Valid values for mono (case-insensitive): , , , , , , , , , . carries metric depth GT (meters) — pair with ; is the same data with relative-depth conventions — pair with .
- Monitoring metric: val/loss
Per-Action Dataset Requirements
| Action | Spec Key | Source | Files | List? |
|---|
| evaluate | dataset.test_dataset.data_sources | eval_dataset | data_file: annotations.txt + dataset_name | Yes |
| inference | dataset.infer_dataset.data_sources | inference_dataset | data_file: annotations.txt + dataset_name | Yes |
| quantize | dataset.train_dataset.data_sources | train_datasets | data_file: annotations.txt + dataset_name | Yes |
| quantize | dataset.val_dataset.data_sources | eval_dataset | data_file: annotations.txt + dataset_name | Yes |
| quantize | dataset.quant_calibration_dataset.images_dir | train_datasets | images.tar.gz | No |
| train | dataset.train_dataset.data_sources | train_datasets | data_file: annotations.txt + dataset_name | Yes |
| train | dataset.val_dataset.data_sources | eval_dataset | data_file: annotations.txt + dataset_name | Yes |
Spec Overrides
Data source overrides are
mandatory for every action — construct the data source paths from the Per-Action Dataset Requirements table above and include them in
; each
entry is a dict with the two mandatory fields
and
. See
references/spec-overrides.md
for the full per-action
/
/
/
/
override blocks and the precision recommendations.
Eval Dataset
Optional. Val dataset configured via
dataset.val_dataset.data_sources
(each entry needs
and
).
Important Parameters
Full parameter glossary (
,
,
,
,
fields with options, defaults, and sources) plus the
Pretrained checkpoint loading — use case matrix live in
. Key starting points:
(default
),
(default
),
(default 1e-4, AdamW),
(
recommended),
dataset.{train,val,test,infer}_dataset.augmentation.crop_size
(default
).
Finetuning Recipes
Relative and Metric variant finetuning recipes — including required spec keys, the metric
dataset.{normalize_depth, min_depth, max_depth}
block required in both train AND export specs, trainer-enforced defaults (
,
,
), sanity-run overrides, and the
Sanity-run PASS criteria for catching silent
— are in
. Both recipes use
with
(the AdamW default
is too aggressive when finetuning from a converged/pretrained backbone).
Multi-GPU / Multi-Node
Launch method: Lightning-managed (single
process, Lightning spawns workers).
| Spec Key | Description | Default |
|---|
| Number of GPUs | 1 |
| GPU device indices | [0] |
| Number of nodes | 1 |
train.distributed_strategy
| or | |
- with activation checkpointing:
find_unused_parameters=False
- without:
find_unused_parameters=True
- forces precision to FP16
Multi-node env vars (set by orchestrator):
,
,
,
,
.
Export / TRT Defaults
- TRT data types: FP32, BF16 (Ampere SM80+). FP16 is not supported for the ViT-L mono backbone.
- Recommended TRT precision: . Use if BF16 hardware is unavailable.
Full TAO Deploy reference: tao-deploy-depth-anything-v2.
Hardware
Minimum 1 GPU(s), recommended 2 GPU(s). 24GB+ VRAM per GPU. ViT-Large encoder is memory intensive. Use
(recommended) or
(Ampere SM80+, alternative) for training. Activation checkpointing is available for larger inputs.
Error Patterns
Common failure signatures and fixes — depth range mismatch, missing pretrained weights,
Key 'encoder' not in 'MonoBackBone'
, missing
,
depth_net_mono: not found
, metric variant hyperparameter sourcing, and the export refuse-to-overwrite ONNX error — are documented in
references/troubleshooting.md
.
Spec Param / Parent Model Inference
Model-specific inference mappings (the full
depth_net_mono.config.json
per-action spec-field → inference-function table, plus
/
resolution guidance) are in
references/spec-param-inference.md
. These mappings belong in MD, not in
; generated runners should read that reference and apply the mappings with SDK helpers before
.