Loading...
Loading...
Cosmos-Embed1 video-text embedding for text-to-video retrieval, video-to-video search, semantic deduplication, and fine-tuning. Use when the user asks to "fine-tune Cosmos-Embed1", "run cosmos-embed inference", "export Cosmos-Embed1", "embed videos", or "search videos with text".
npx skill4agent add nvidia/skills tao-finetune-cosmos-embedcosmos-embed1trainevaluateinferenceexportreferences/skill_info.yamlreferences/spec_template_*.yamlreferences/skill_info.yamlautoml_policyautoml_policy: offautoautoml_policy: autoautoml_enabled: trueschemas/train.schema.jsonreferences/spec_template_train.yamltao-skill-bank:tao-run-automlskill_dirautoml_policyautoml_policy: offevaluateinferenceexportautoml_policyreferences/skill_info.yamlversions.yamlTAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"workspace/
├── data/
│ ├── msrvtt_test_1k.json
│ └── video/
│ ├── video7020.mp4
│ └── ...
├── model/
│ └── Cosmos-Embed1-224p/ # optional if using HF repo id
├── specs/
│ ├── train.yaml
│ ├── evaluate.yaml
│ ├── inference.yaml
│ ├── export_onnx.yaml
│ └── export_hf.yaml
└── results/TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
--rm --gpus all --ipc=host --network=host
--shm-size=64g
--ulimit memlock=-1
--ulimit stack=67108864
-e HF_TOKEN
-v "$RUN_ROOT/data:/data:ro"
-v "$RUN_ROOT/model:/model"
-v "$RUN_ROOT/specs:/specs:ro"
-v "$RUN_ROOT/results:/results"
)docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 train -e /specs/train.yaml results_dir=/resultsdocker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/resultsdocker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 inference -e /specs/inference.yaml \
'inference.query.input_texts=["a man is singing on stage"]' \
inference.k=5 \
results_dir=/resultsdocker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 export -e /specs/export_onnx.yaml \
export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
results_dir=/resultsdocker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 export -e /specs/export_hf.yaml \
export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
results_dir=/resultstrain.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0model.pretrained_model_path=nullevaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0dataset:
train_dataset:
dataset_type: msrvtt
mp4_urls: /data/video/*.mp4
metadata: /data/msrvtt_test_1k.jsonvideocaption{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}.mp4mp4_urlsvideo/modelmodel.pretrained_model_path=/model/Cosmos-Embed1-224pmodel.pretrained_model_path=nvidia/Cosmos-Embed1-224pHF_TOKEN/results/train/cosmos_embed1_model_latest.pth| Variant | Resolution | Frames | Embedding dim |
|---|---|---|---|
| 224 x 224 | 8 | 256 |
| 336 x 336 | 8 | 768 |
| 448 x 448 | 8 | 768 |
model.network.embed_dimmodel.input_hwmodel.network.spatial_resolution| Parameter | Notes |
|---|---|
| |
| Main training length. Use |
| |
| Enables LoRA. Set |
| LoRA rank. Start with |
| LoRA scaling factor. Start with |
| LoRA dropout. Start with |
| Bias policy: |
| Optional LoRA variants. Enable one at a time and record the setting with the checkpoint. |
| Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets. |
| Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head. |
| Cache evaluation embeddings. |
| Cache the search database for repeated retrieval. |
| |
| Recommended for export to avoid device mismatch issues. |
model.lora.enabled=truemodel.network.visual_encoder.transformer_engine=falsetarget_modulesmodules_to_saves3://.../*.mp4/data/video/*.mp4s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
├── video7020.mp4
└── ...data/results/
├── train/
│ ├── cosmos_embed1_model_latest.pth
│ ├── cosmos_embed1_model_<iter>.pth
│ └── experiment.yaml
├── evaluate/
│ ├── metrics.json
│ └── experiment.yaml
├── inference/
│ ├── results.json
│ └── experiment.yaml
├── export/
│ ├── cosmos_embed1_combined.onnx
│ └── export_config.yaml
└── export_hf/
└── cosmos_embed1_hf/| Symptom | Cause | Fix |
|---|---|---|
| | Mount data into the container and set |
| HF download/auth failure | Missing or invalid | Accept the model terms and pass |
| LoRA injection failure | Transformer Engine visual encoder is enabled. | Set |
| ONNX/HF export complains about missing components | Export checkpoint is partial or adapter-only. | Use a full checkpoint or configure pretrained visual/text sources before export. |
| CUDA OOM | Batch/resolution too high for the GPU. | Reduce batch size, use 224p, enable LoRA, or use more GPUs. |