tao-train-dino

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DINO

DINO

DINO (DETR with Improved DeNoising Anchor Boxes) for 2D object detection. Transformer-based detector with denoising training, multi-scale features, and optional distillation support.
Uses pretrained backbone weights (e.g. ResNet-50 ImageNet). Set
model.pretrained_backbone_path
for backbone-only or
train.pretrained_model_path
for full model.
For TAO Deploy TensorRT actions (
gen_trt_engine
, TensorRT
evaluate
, and TensorRT
inference
), read
references/tao-deploy-dino.md
first. Deploy spec templates live in this skill's
references/
folder with the
spec_template_deploy_*.yaml
prefix.
Generated TAO Core schemas are packaged in
schemas/<action>.schema.json
(with
schemas/manifest.json
listing actions); each schema emits a matching
references/spec_template_<action>.yaml
. See
references/sdk_orchestration.md
for the full dataclass-schema, spec-template, data-sources, and parent-model inference details used by SDK orchestration.
DINO(带改进去噪锚框的DETR)用于2D目标检测。这是一款基于Transformer的检测器,支持去噪训练、多尺度特征,还可选支持蒸馏。
使用预训练的骨干网络权重(如ResNet-50 ImageNet)。若仅需骨干网络预训练权重,设置
model.pretrained_backbone_path
;若需完整模型预训练权重,设置
train.pretrained_model_path
对于TAO Deploy TensorRT相关操作(
gen_trt_engine
、TensorRT
evaluate
和TensorRT
inference
),请先阅读
references/tao-deploy-dino.md
。部署规格模板存放在本技能的
references/
文件夹下,文件名以
spec_template_deploy_*.yaml
为前缀。
生成的TAO Core schema打包在
schemas/<action>.schema.json
中(
schemas/manifest.json
列出所有操作);每个schema都会生成对应的
references/spec_template_<action>.yaml
文件。有关SDK编排使用的数据类schema、规格模板、数据源和父模型推理的详细信息,请查看
references/sdk_orchestration.md

Train Action Policy

训练操作策略

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read
references/skill_info.yaml
and resolve the run override from either an explicit
automl_policy
value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as
automl_policy: off
for this run only; otherwise default to
auto
. When
automl_policy: auto
,
automl_enabled: true
, and both
schemas/train.schema.json
and
references/spec_template_train.yaml
are packaged, route the train action through
tao-skill-bank:tao-run-automl
by default with this model's
skill_dir
. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and
automl_policy
. Use direct model training only when
automl_policy: off
or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.
Non-train actions such as
evaluate
,
inference
,
export
, and deploy flows stay in this model skill. The per-run
automl_policy
override does not change model metadata.
该模型在模型层支持AutoML。处理任何训练阶段请求前,请先阅读
references/skill_info.yaml
,并通过显式的
automl_policy
值或用户的工作流请求确定运行覆盖规则。将“turn off AutoML”、“disable AutoML”、“no HPO”或“plain training”这类表述视为本次运行的
automl_policy: off
;否则默认设置为
auto
。当
automl_policy: auto
automl_enabled: true
,且
schemas/train.schema.json
references/spec_template_train.yaml
均已打包时,默认将训练操作路由至
tao-skill-bank:tao-run-automl
,并传入该模型的
skill_dir
。保留数据集、规格、输出目录、GPU/平台设置、父检查点和
automl_policy
的工作流/应用覆盖规则。仅当
automl_policy: off
或打包的训练schema/模板缺失时,才使用直接模型训练;若schema缺失,需告知用户AutoML已启用,但在生成schema前无法针对该模型运行。
非训练操作(如
evaluate
inference
export
及部署流程)仍在本模型技能中执行。每次运行的
automl_policy
覆盖规则不会更改模型元数据。

Training Requirements

训练要求

The agent MUST read this section before generating any training or AutoML script for DINO.
  • Dataset type: object_detection
  • Formats: coco, coco_raw
  • Accepted dataset intents: training, evaluation, testing, calibration
  • Monitoring metric: val_mAP50
Required datasets — MUST resolve both:
DatasetRequiredWhy
Train dataset URIYesTraining data (COCO format)
Validation dataset URIYes — ALWAYSDINO unconditionally builds a val dataloader. Omitting
val_data_sources
causes
FileNotFoundError
at startup regardless of the metric or workflow. If the user has no separate eval split, reuse the train URI.
Required inputs before generating any training spec:
  1. Train dataset URI — S3 path to COCO-format training data
  2. Validation dataset URI — S3 path to COCO-format val data (can be same as train)
  3. num_classes
    — How many object classes? Default 91 (COCO). Must be >=
    max(category_id) + 1
    . Too low causes
    CUDA error: device-side assert triggered
    .
Resolve these from the user request or the default profile below. Prompt only for values that are still missing after applying the profile rules.
Bankable local default profile for DINO AutoML smoke runs:
Use this profile only when the user asks to run DINO AutoML and does not provide dataset or class-count inputs. This profile is intentionally small and local to this skill bank; it is for smoke/iteration runs, not a production benchmark. Do not search previous runners, logs, session state, shell history, or the home directory to recover these values.
python
DINO_AUTOML_PROFILE = {
    "train_dataset_uri": "s3://nvcf-storage-handling/data/tao_od_synthetic_subset_train_no_convert",
    "validation_dataset_uri": "s3://nvcf-storage-handling/data/tao_od_synthetic_subset_val_no_convert",
    "object_classes": 4,
    "dataset_num_classes": 5,
    "image_archive": "images.tar.gz",
    "annotation_file": "annotations.json",
    "max_recommendations": 10,
    "train_num_epochs": 10,
    "train_checkpoint_interval": 10,
    "train_validation_interval": 1,
    "train_num_gpus": 1,
}
If the user supplies any dataset URI or class-count value, prefer the user value and ask for any remaining required DINO value. Do not partially mix a user's custom dataset with this profile's class count unless the user confirms it.
Do not prompt for image layout for the standard DINO dataset. The standard TAO DINO dataset artifact is
images.tar.gz
plus
annotations.json
. Use
images.tar.gz
in the remote
image_dir
spec override. The SDK downloads the archive and rewrites the runtime spec to the extracted folder named after the archive stem (
images.tar.gz
->
images
). Only deviate if the user explicitly provides a different image artifact name.
Agent在生成任何DINO训练或AutoML脚本前,必须阅读本节内容。
  • 数据集类型: object_detection
  • 格式: coco, coco_raw
  • 可接受的数据集用途: training, evaluation, testing, calibration
  • 监控指标: val_mAP50
必填数据集 — 必须同时满足:
数据集是否必填原因
训练数据集URI训练数据(COCO格式)
验证数据集URI是 — 必须提供DINO会无条件构建验证数据加载器。无论指标或工作流如何,若省略
val_data_sources
,启动时都会触发
FileNotFoundError
。若用户没有单独的评估拆分数据集,可复用训练数据集URI。
生成训练规格前需确认的必填输入:
  1. 训练数据集URI — COCO格式训练数据的S3路径
  2. 验证数据集URI — COCO格式验证数据的S3路径(可与训练数据集相同)
  3. num_classes
    — 目标类别数量?默认值为91(COCO数据集)。必须大于等于
    max(category_id) + 1
    。数值过低会触发
    CUDA error: device-side assert triggered
从用户请求或下方默认配置中获取以上信息。仅在应用配置规则后仍有缺失值时,才向用户询问。
DINO AutoML冒烟测试的本地默认配置:
仅当用户要求运行DINO AutoML且未提供数据集或类别数量输入时,才使用此配置。该配置特意设置为小型本地配置,仅用于冒烟测试/迭代运行,而非生产基准测试。请勿从之前的运行记录、日志、会话状态、shell历史或主目录中恢复这些值。
python
DINO_AUTOML_PROFILE = {
    "train_dataset_uri": "s3://nvcf-storage-handling/data/tao_od_synthetic_subset_train_no_convert",
    "validation_dataset_uri": "s3://nvcf-storage-handling/data/tao_od_synthetic_subset_val_no_convert",
    "object_classes": 4,
    "dataset_num_classes": 5,
    "image_archive": "images.tar.gz",
    "annotation_file": "annotations.json",
    "max_recommendations": 10,
    "train_num_epochs": 10,
    "train_checkpoint_interval": 10,
    "train_validation_interval": 1,
    "train_num_gpus": 1,
}
若用户提供了任何数据集URI或类别数量值,优先使用用户提供的值,并询问剩余必填的DINO参数。除非用户确认,否则请勿将用户的自定义数据集与本配置的类别数量混合使用。
请勿询问标准DINO数据集的图像布局。 标准TAO DINO数据集工件为
images.tar.gz
annotations.json
。在远程
image_dir
规格覆盖中使用
images.tar.gz
。SDK会下载该归档文件,并将运行时规格重写为归档文件名对应的提取文件夹(
images.tar.gz
->
images
)。仅当用户明确提供不同的图像工件名称时,才可以偏离此规则。

Spec Overrides, Datasets, and Parameters

规格覆盖、数据集与参数

Data source overrides are mandatory for every action — DINO's
config.json
has empty
data_sources
because the runner cannot auto-resolve array-of-objects spec keys. The agent MUST build data source paths and include them in
spec_overrides
.
See
references/spec_overrides.md
for: the per-action dataset requirements table; the mandatory
spec_overrides
blocks for
train
,
evaluate
,
export
,
gen_trt_engine
,
inference
,
quantize
, and
distill
; checkpoint resolution via
parent_model
inference and the
results_dir/train/dino_model_latest.pth
fallback; the COCO dataset format and
images.tar.gz
archive-stem rules; per-action data-source layouts; the full Important Parameters list (num_classes, backbone and its supported values, lr/lr_backbone, num_epochs, lr_steps, num_queries, batch_size); Default Values (num_epochs 10, batch_size 4, learning_rate 2e-4, lr_backbone 2e-5, num_classes 91, backbone resnet_50); Evaluate Defaults; Export Defaults (input 640x640, opset 17, trt_data_types [FP32, FP16, INT8], trt_workspace_size_mb 1024); and Hardware guidance (1 GPU minimum, 4 recommended, 24GB+ A100). Full TAO Deploy reference: tao-deploy-dino.
When generating an
evaluate
spec, carry forward the winning AutoML rec's structural model settings (
model.backbone
,
model.num_queries
,
model.dropout_ratio
,
dataset.num_classes
) so the checkpoint shape matches the evaluation model.
数据源覆盖对每个操作都是强制性的 — DINO的
config.json
data_sources
为空,因为运行器无法自动解析对象数组类型的规格键。Agent必须构建数据源路径,并将其包含在
spec_overrides
中。
有关以下内容,请查看
references/spec_overrides.md
:各操作的数据集要求表;
train
evaluate
export
gen_trt_engine
inference
quantize
distill
的必填
spec_overrides
块;通过
parent_model
推理和
results_dir/train/dino_model_latest.pth
备选方案解析检查点;COCO数据集格式和
images.tar.gz
归档文件名规则;各操作的数据源布局;完整的重要参数列表(num_classes、骨干网络及其支持值、lr/lr_backbone、num_epochs、lr_steps、num_queries、batch_size);默认值(num_epochs为10,batch_size为4,learning_rate为2e-4,lr_backbone为2e-5,num_classes为91,backbone为resnet_50);评估默认值导出默认值(输入尺寸640x640,opset为17,trt_data_types为[FP32, FP16, INT8],trt_workspace_size_mb为1024);以及硬件指南(最少1块GPU,推荐4块,需24GB+显存的A100)。完整TAO Deploy参考文档:tao-deploy-dino
生成
evaluate
规格时,请沿用AutoML最优推荐的结构化模型设置(
model.backbone
model.num_queries
model.dropout_ratio
dataset.num_classes
),确保检查点形状与评估模型匹配。

Error Patterns

错误模式

Common failures include CUDA OOM,
num_select < num_queries * num_classes
, spec/schema merge errors, dataset-smaller-than-batch,
return_interm_indices
vs
num_feature_levels
mismatch,
FileNotFoundError
on images or missing val data,
CUDA device-side assert
from low
num_classes
, S3 inputs not downloaded, and evaluate checkpoint not found at the result root. See
references/troubleshooting.md
for each error pattern and its fix.
常见故障包括CUDA内存不足、
num_select < num_queries * num_classes
、规格/schema合并错误、数据集小于批量大小、
return_interm_indices
num_feature_levels
不匹配、图像文件未找到或验证数据缺失导致的
FileNotFoundError
num_classes
过低导致的
CUDA device-side assert
、S3输入未下载、评估检查点在结果根目录未找到等。有关每种错误模式及其修复方法,请查看
references/troubleshooting.md

AutoML / HPO Notes

AutoML / HPO说明

AutoML runs training — all Training Requirements above apply, and the no-input case uses
DINO_AUTOML_PROFILE
. Do not inspect previous AutoML runs to infer dataset URIs,
num_classes
, recommendation count, or interval settings. Use explicit
metric="mAP50"
with
direction="maximize"
and a custom
metric_extractor
reading
Validation mAP50
rather than
metric="kpi"
. See
references/automl.md
for the recommended metric extractor, hyperparameter list,
custom_param_ranges
, the
train.optim.weight_decay
note, and the backbone-constraint guidance.
AutoML会执行训练 — 需满足上述所有训练要求,无输入时使用
DINO_AUTOML_PROFILE
。请勿检查之前的AutoML运行记录来推断数据集URI、
num_classes
、推荐数量或间隔设置。使用显式的
metric="mAP50"
direction="maximize"
,并使用自定义
metric_extractor
读取
Validation mAP50
,而非
metric="kpi"
。有关推荐的指标提取器、超参数列表、
custom_param_ranges
train.optim.weight_decay
说明和骨干网络约束指南,请查看
references/automl.md

Optional: running via the TAO SDK

可选:通过TAO SDK运行

The SDK
script_runner
orchestration, S3 I/O wrapping, AutoML internals, spec-template generation, the data-sources gap, the
config.json
inputs
declarations, and the full per-action spec-param / parent-model inference mapping table are documented in
references/sdk_orchestration.md
. Skip this when running locally with
docker run
.
SDK的
script_runner
编排、S3 I/O封装、AutoML内部机制、规格模板生成、数据源缺口、
config.json
inputs
声明以及完整的各操作规格参数/父模型推理映射表,均记录在
references/sdk_orchestration.md
中。使用
docker run
本地运行时可跳过此部分。