Loading...
Loading...
CLIP vision-language model for image-text retrieval, zero-shot classification, embedding extraction, ONNX export, and TensorRT deployment. Use when fine-tuning or training CLIP, running zero-shot classification, computing image embeddings, or deploying CLIP to ONNX/TensorRT.
npx skill4agent add promptingcompany/nv-skills tao-finetune-cliptrain.pretrained_model_pathevaluate.checkpointinference.checkpointexport.checkpointtorch.hubtrainevaluateinferenceexportgen_trt_enginereferences/skill_info.yamlautoml_policyautoml_policy: offautoautoml_policy: autoautoml_enabled: trueschemas/train.schema.jsonreferences/spec_template_train.yamltao-skill-bank:tao-run-automlskill_dirautoml_policyautoml_policy: offevaluateinferenceexportautoml_policydefaults.jsonconfig.jsonreferences/spec_template.yamlreferences/model_info.yamlspec_overridesexportgen_trt_enginegen_trt_engineevaluateinferencegen_trt_engineevaluateinferencereferences/tao-deploy-clip.mdreferences/spec_template_deploy_*.yamlsiglip2-so400m-patch16-256siglip2-so400m-patch14-224siglip2-so400m-patch14-384siglip2-so400m-patch16-384siglip2-so400m-patch16-512siglip2-so400m-patch16-naflexc-radio_v3-bc-radio_v3-lc-radio_v3-hc-radio_v3-gViT-L-14-SigLIP-CLIPA-224ViT-L-14-SigLIP-CLIPA-336ViT-H-14-SigLIP-CLIPA-224ViT-H-14-SigLIP-CLIPA-336ViT-H-14-SigLIP-CLIPA-574model.adaptor_namesiglipclip| Action | Spec Key | Source | Files | List? |
|---|---|---|---|---|
| train | dataset.train.datasets | train_datasets | image_dir: images.tar.gz, image_list_file: image_list.txt, caption_dir: captions.tar.gz | Yes |
| train | dataset.train.wds.root_dir | train_wds_dataset | root directory containing | No |
| train | dataset.train.wds.shard_list_file | train_wds_dataset | shards.txt listing shard paths | No |
| train | dataset.val.datasets | eval_dataset | image_dir: images.tar.gz, image_list_file: image_list.txt, caption_dir: captions.tar.gz | Yes |
| evaluate | dataset.val.datasets | eval_dataset | image_dir: images.tar.gz, image_list_file: image_list.txt, caption_dir: captions.tar.gz | Yes |
| inference | inference.datasets | inference_dataset | image_dir: images.tar.gz | Yes |
| inference | inference.text_file | inference_dataset | prompts.txt | No |
| export | export.checkpoint | parent train job or explicit checkpoint | checkpoint .pth, optional for pretrained export | No |
| gen_trt_engine | gen_trt_engine.onnx_file | parent export job or explicit ONNX | clip_model.onnx | No |
dataset.train.type: customdataset.train.datasetscaption_file_suffix.txtimage_list_filedataset.train.type: wdsdataset.train.wds.root_dirdataset.train.wds.shard_list_fileroot_dir.tarshard_list_fileroot_dirroot_dirdataset.val.datasetsspec_overridesinference.datasetsinference.text_fileS3_TRAIN = "s3://bucket/data/train"
S3_WDS = "s3://bucket/data/wds"
S3_EVAL = "s3://bucket/data/eval"
S3_INFER = "s3://bucket/data/infer"{
"train.num_epochs": 10,
"dataset.train.type": "custom",
"dataset.train.datasets": [{"image_dir": f"{S3_TRAIN}/images.tar.gz", "image_list_file": f"{S3_TRAIN}/image_list.txt", "caption_dir": f"{S3_TRAIN}/captions.tar.gz"}],
"dataset.val.datasets": [{"image_dir": f"{S3_EVAL}/images.tar.gz", "image_list_file": f"{S3_EVAL}/image_list.txt", "caption_dir": f"{S3_EVAL}/captions.tar.gz"}],
}{
"train.num_epochs": 10,
"dataset.train.type": "wds",
"dataset.train.wds.root_dir": f"{S3_WDS}",
"dataset.train.wds.shard_list_file": f"{S3_WDS}/shards.txt",
"dataset.train.wds.samples_per_shard": 10000,
"dataset.val.datasets": [{"image_dir": f"{S3_EVAL}/images.tar.gz", "image_list_file": f"{S3_EVAL}/image_list.txt", "caption_dir": f"{S3_EVAL}/captions.tar.gz"}],
}{
"dataset.val.datasets": [{"image_dir": f"{S3_EVAL}/images.tar.gz", "image_list_file": f"{S3_EVAL}/image_list.txt", "caption_dir": f"{S3_EVAL}/captions.tar.gz"}],
}evaluate.checkpointevaluate.trt_engineevaluate.checkpoint{
"inference.datasets": [{"image_dir": f"{S3_INFER}/images.tar.gz"}],
"inference.text_file": f"{S3_INFER}/prompts.txt",
}image_embeddings.h5text_embeddings.h5results_dir{
"export.onnx_file": "${results_dir}/export/clip_model.onnx",
"export.encoder_type": "combined",
"export.batch_size": -1,
}export.encoder_type: separate_vision.onnx_text.onnxexport.onnx_file{
"gen_trt_engine.onnx_file": "${results_dir}/export/clip_model.onnx",
"gen_trt_engine.trt_engine": "${results_dir}/deploy/clip_model.engine",
"gen_trt_engine.batch_size": -1,
"gen_trt_engine.tensorrt.data_type": "fp16",
"gen_trt_engine.tensorrt.min_batch_size": 1,
"gen_trt_engine.tensorrt.opt_batch_size": 1,
"gen_trt_engine.tensorrt.max_batch_size": 16,
}evaluategen_trt_enginemodel_info["actions"]["gen_trt_engine"]clip gen_trt_engine -e {config_path}tao deploy clip gen_trt_engine -e /path/to/spec.yaml_vision_textgen_trt_engine.onnx_filegen_trt_engine.trt_engineevaluate.trt_engineinference.trt_enginetao deploy clip evaluatetao deploy clip inferencesiglipclipsiglipclipcombinedseparatefp16fp32dataset.train.batch_sizedataset.val.batch_sizeexport.input_heightexport.input_widthtrain.optim.vision_lrtrain.optim.text_lrtrain.optim.warmup_stepsbatch_size * num_gpusdataset.val.batch_sizedataset.train.batch_sizemodel.adaptor_namesiglipclipsiglip2-so400m-patch16-naflexsiglip2-so400m-patch16-384gen_trt_engineattention_masknum_epochsnum_gpustrain.*model.image_sizeexport.input_heightexport.input_widthconfig.jsoncreate_job()infer_params.pyclip.config.json| Action | Spec Field | Inference Function | Meaning |
|---|---|---|---|
| evaluate | | | encryption key |
| evaluate | | | model file inferred from the parent job results folder |
| evaluate | | | model file inferred from the parent job results folder |
| evaluate | | | current job results directory |
| export | | | encryption key |
| export | | | model file inferred from the parent job results folder |
| export | | | output ONNX path |
| export | | | current job results directory |
| gen_trt_engine | | | encryption key |
| gen_trt_engine | | | model file inferred from the parent job results folder |
| gen_trt_engine | | | output TensorRT engine path |
| gen_trt_engine | | | current job results directory |
| inference | | | encryption key |
| inference | | | model file inferred from the parent job results folder |
| inference | | | model file inferred from the parent job results folder |
| inference | | | current job results directory |
| train | | | encryption key |
| train | | | current job results directory |
| train | | | PTM when no resume checkpoint exists |
| train | | | model file inferred from the current job results folder |
parent_modelparent_model_folderparent_job_idconfig.json