tao-finetune-cosmos-embed
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCosmos-Embed
Cosmos-Embed
Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is and supports , , , and .
cosmos-embed1trainevaluateinferenceexportContainer image and per-action commands are in . Compact starting specs are in .
references/skill_info.yamlreferences/spec_template_*.yamlCosmos-Embed1 是一款联合视频-文本嵌入模型,适用于text-to-video retrieval、video-to-video search、零样本/kNN分类以及语义去重。其打包后的CLI为,支持、、和操作。
cosmos-embed1trainevaluateinferenceexport容器镜像及各操作命令可在中查看。精简初始配置文件位于。
references/skill_info.yamlreferences/spec_template_*.yamlTrain Action Policy
Train Action Policy
This model is AutoML-enabled at the model layer. Before handling any train-stage request, read and resolve the run override from either an explicit value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as for this run only; otherwise default to . When , , and both and are packaged, route the train action through by default with this model's . Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and . Use direct model training only when or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.
references/skill_info.yamlautoml_policyautoml_policy: offautoautoml_policy: autoautoml_enabled: trueschemas/train.schema.jsonreferences/spec_template_train.yamltao-skill-bank:tao-run-automlskill_dirautoml_policyautoml_policy: offNon-train actions such as , , , and deploy flows stay in this model skill. The per-run override does not change model metadata.
evaluateinferenceexportautoml_policy该模型在模型层支持AutoML。处理任何训练阶段的请求前,请先阅读,并通过显式的值或用户的工作流请求确定运行覆盖规则。将「关闭AutoML」、「禁用AutoML」、「无需HPO」或「普通训练」这类表述视为本次运行的;否则默认设为。当、,且和均已打包时,默认将训练操作通过路由,并传入该模型的。保留数据集、配置文件、输出目录、GPU/平台设置、父检查点以及的工作流/应用覆盖规则。仅当或打包的训练schema/模板缺失时,才使用直接模型训练;若schema缺失,需告知用户AutoML已启用,但该模型暂无法运行AutoML,需先生成schema。
references/skill_info.yamlautoml_policyautoml_policy: offautoautoml_policy: autoautoml_enabled: trueschemas/train.schema.jsonreferences/spec_template_train.yamltao-skill-bank:tao-run-automlskill_dirautoml_policyautoml_policy: off非训练操作(如、、及部署流程)仍在本模型技能中执行。单次运行的覆盖规则不会修改模型元数据。
evaluateinferenceexportautoml_policyQuick Start
Quick Start
Use the published Cosmos-Embed container declared by
and resolved through . Do not build from the private
Cosmos-Embed1 source tree for normal skill use; build from source only when
developing the container itself.
references/skill_info.yamlversions.yamlbash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"Expected local workspace layout:
text
workspace/
├── data/
│ ├── msrvtt_test_1k.json
│ └── video/
│ ├── video7020.mp4
│ └── ...
├── model/
│ └── Cosmos-Embed1-224p/ # optional if using HF repo id
├── specs/
│ ├── train.yaml
│ ├── evaluate.yaml
│ ├── inference.yaml
│ ├── export_onnx.yaml
│ └── export_hf.yaml
└── results/Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:
bash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
--rm --gpus all --ipc=host --network=host
--shm-size=64g
--ulimit memlock=-1
--ulimit stack=67108864
-e HF_TOKEN
-v "$RUN_ROOT/data:/data:ro"
-v "$RUN_ROOT/model:/model"
-v "$RUN_ROOT/specs:/specs:ro"
-v "$RUN_ROOT/results:/results"
)Train:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 train -e /specs/train.yaml results_dir=/resultsEvaluate:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/resultsInference:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 inference -e /specs/inference.yaml \
'inference.query.input_texts=["a man is singing on stage"]' \
inference.k=5 \
results_dir=/resultsExport ONNX:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 export -e /specs/export_onnx.yaml \
export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
results_dir=/resultsExport HuggingFace format:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 export -e /specs/export_hf.yaml \
export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
results_dir=/results使用中声明并通过解析的已发布Cosmos-Embed容器镜像。正常使用技能时请勿从私有Cosmos-Embed1源码树构建镜像;仅在开发容器本身时才从源码构建。
references/skill_info.yamlversions.yamlbash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"预期的本地工作区结构:
text
workspace/
├── data/
│ ├── msrvtt_test_1k.json
│ └── video/
│ ├── video7020.mp4
│ └── ...
├── model/
│ └── Cosmos-Embed1-224p/ # optional if using HF repo id
├── specs/
│ ├── train.yaml
│ ├── evaluate.yaml
│ ├── inference.yaml
│ ├── export_onnx.yaml
│ └── export_hf.yaml
└── results/除非本地Docker/平台技能给出更严格的环境特定命令,否则所有操作均使用以下Docker选项:
bash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
--skill-bank "$TAO_SKILL_BANK_PATH" \
--model cosmos-embed \
--action train \
--format json |
python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
--rm --gpus all --ipc=host --network=host
--shm-size=64g
--ulimit memlock=-1
--ulimit stack=67108864
-e HF_TOKEN
-v "$RUN_ROOT/data:/data:ro"
-v "$RUN_ROOT/model:/model"
-v "$RUN_ROOT/specs:/specs:ro"
-v "$RUN_ROOT/results:/results"
)训练:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 train -e /specs/train.yaml results_dir=/results评估:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results推理:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 inference -e /specs/inference.yaml \
'inference.query.input_texts=["a man is singing on stage"]' \
inference.k=5 \
results_dir=/results导出ONNX格式:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 export -e /specs/export_onnx.yaml \
export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
results_dir=/results导出HuggingFace格式:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
cosmos-embed1 export -e /specs/export_hf.yaml \
export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
results_dir=/resultsSmoke Overrides
Smoke Overrides
For a small functional check, keep the same specs and override the expensive knobs:
bash
train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0If no local Cosmos-Embed1 pretrained checkpoint or HuggingFace token is available, set for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised.
model.pretrained_model_path=nullFor evaluation and inference smoke tests on a tiny subset:
bash
evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0如需进行小型功能检查,请保留原配置文件并覆盖以下资源消耗较大的参数:
bash
train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0若本地无Cosmos-Embed1预训练检查点或HuggingFace令牌,请设置以仅进行流程验证性的冒烟训练。该模式下模型质量无意义,但可验证训练/评估/推理/导出的操作流程。
model.pretrained_model_path=null针对小型子集的评估和推理冒烟测试:
bash
evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0Data Format
Data Format
The MSR-VTT path expects a local video glob and a JSON metadata file:
yaml
dataset:
train_dataset:
dataset_type: msrvtt
mp4_urls: /data/video/*.mp4
metadata: /data/msrvtt_test_1k.jsonList-format metadata rows must include at least and :
videocaptionjson
{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}The dataset loader derives the video id from the local filename and filters to videos present in the metadata. If a run finds zero videos, check that points to a container-local glob and that metadata names match the filenames.
.mp4mp4_urlsvideoMSR-VTT路径需要本地视频通配符和JSON元数据文件:
yaml
dataset:
train_dataset:
dataset_type: msrvtt
mp4_urls: /data/video/*.mp4
metadata: /data/msrvtt_test_1k.json列表格式的元数据行至少需包含和字段:
videocaptionjson
{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}数据集加载器从本地文件名提取视频ID,并筛选出元数据中存在的视频。若运行时发现零个视频,请检查是否指向容器本地的通配符路径,以及元数据中的名称是否与文件名匹配。
.mp4mp4_urlsvideoModel Weights
Model Weights
- Local HF directory: mount it under and set
/model.model.pretrained_model_path=/model/Cosmos-Embed1-224p - HuggingFace repo: set and pass
model.pretrained_model_path=nvidia/Cosmos-Embed1-224pif access is gated.HF_TOKEN - Fine-tuned checkpoint: downstream actions default to .
/results/train/cosmos_embed1_model_latest.pth
Variants:
| Variant | Resolution | Frames | Embedding dim |
|---|---|---|---|
| 224 x 224 | 8 | 256 |
| 336 x 336 | 8 | 768 |
| 448 x 448 | 8 | 768 |
Keep , , and aligned with the selected variant.
model.network.embed_dimmodel.input_hwmodel.network.spatial_resolution- 本地HF目录:将其挂载到下,并设置
/model。model.pretrained_model_path=/model/Cosmos-Embed1-224p - HuggingFace仓库:设置,若访问受限需传入
model.pretrained_model_path=nvidia/Cosmos-Embed1-224p。HF_TOKEN - 微调后的检查点:下游操作默认使用。
/results/train/cosmos_embed1_model_latest.pth
变体:
| Variant | Resolution | Frames | Embedding dim |
|---|---|---|---|
| 224 x 224 | 8 | 256 |
| 336 x 336 | 8 | 768 |
| 448 x 448 | 8 | 768 |
请确保、和与所选变体一致。
model.network.embed_dimmodel.input_hwmodel.network.spatial_resolutionImportant Parameters
Important Parameters
| Parameter | Notes |
|---|---|
| |
| Main training length. Use |
| |
| Enables LoRA. Set |
| LoRA rank. Start with |
| LoRA scaling factor. Start with |
| LoRA dropout. Start with |
| Bias policy: |
| Optional LoRA variants. Enable one at a time and record the setting with the checkpoint. |
| Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets. |
| Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head. |
| Cache evaluation embeddings. |
| Cache the search database for repeated retrieval. |
| |
| Recommended for export to avoid device mismatch issues. |
| Parameter | Notes |
|---|---|
| |
| 主要训练时长。仅在冒烟测试时设为 |
| 若可用, |
| 启用LoRA。启用LoRA时需设置 |
| LoRA秩。初始值设为 |
| LoRA缩放因子。初始值设为 |
| LoRA dropout率。初始值设为 |
| 偏置策略: |
| 可选LoRA变体。每次仅启用一个,并将设置记录到检查点中。 |
| LoRA注入的可选模块名称模式。默认ViT + Q-Former注意力/MLP目标,无需修改。 |
| 与LoRA一起保持完全可训练的可选模块。除非需要保留特定任务头,否则留空。 |
| 缓存评估嵌入向量。 |
| 缓存搜索数据库以用于重复检索。 |
| |
| 推荐在导出时启用,以避免设备不匹配问题。 |
LoRA and AutoML Notes
LoRA and AutoML Notes
For parameter-efficient fine-tuning, set and keep
; TAO Core's
Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer
Engine layers. Treat the LoRA fields above as the first candidate parameters
for manual tuning or AutoML-style search before unfreezing larger model blocks.
Avoid changing or unless the user explicitly
needs custom adapter placement.
model.lora.enabled=truemodel.network.visual_encoder.transformer_engine=falsetarget_modulesmodules_to_save对于参数高效微调,设置并保持;TAO Core的Cosmos-Embed1配置说明PEFT无法将适配器注入Transformer Engine层。在解冻更大模型块之前,先将上述LoRA字段作为手动调优或AutoML式搜索的首要候选参数。除非用户明确需要自定义适配器位置,否则请勿修改或。
model.lora.enabled=truemodel.network.visual_encoder.transformer_engine=falsetarget_modulesmodules_to_saveS3 Staging
S3 Staging
The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as in the spec.
s3://.../*.mp4/data/video/*.mp4Recommended S3 layout for staged MSR-VTT data:
text
s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
├── video7020.mp4
└── ...After downloading/syncing that prefix into the mounted directory, use the same Docker commands above.
data/Cosmos-Embed1 CLI仅支持本地路径和Python通配符,不支持原始 URI。对于基于S3的运行,需先将部分或全部数据集暂存到执行主机/容器文件系统,然后在配置文件中使用这类本地路径。
s3://.../*.mp4/data/video/*.mp4推荐的暂存MSR-VTT数据的S3结构:
text
s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
├── video7020.mp4
└── ...将该前缀下载/同步到挂载的目录后,即可使用上述相同的Docker命令。
data/Outputs
Outputs
text
results/
├── train/
│ ├── cosmos_embed1_model_latest.pth
│ ├── cosmos_embed1_model_<iter>.pth
│ └── experiment.yaml
├── evaluate/
│ ├── metrics.json
│ └── experiment.yaml
├── inference/
│ ├── results.json
│ └── experiment.yaml
├── export/
│ ├── cosmos_embed1_combined.onnx
│ └── export_config.yaml
└── export_hf/
└── cosmos_embed1_hf/text
results/
├── train/
│ ├── cosmos_embed1_model_latest.pth
│ ├── cosmos_embed1_model_<iter>.pth
│ └── experiment.yaml
├── evaluate/
│ ├── metrics.json
│ └── experiment.yaml
├── inference/
│ ├── results.json
│ └── experiment.yaml
├── export/
│ ├── cosmos_embed1_combined.onnx
│ └── export_config.yaml
└── export_hf/
└── cosmos_embed1_hf/Known Pitfalls
Known Pitfalls
| Symptom | Cause | Fix |
|---|---|---|
| | Mount data into the container and set |
| HF download/auth failure | Missing or invalid | Accept the model terms and pass |
| LoRA injection failure | Transformer Engine visual encoder is enabled. | Set |
| ONNX/HF export complains about missing components | Export checkpoint is partial or adapter-only. | Use a full checkpoint or configure pretrained visual/text sources before export. |
| CUDA OOM | Batch/resolution too high for the GPU. | Reduce batch size, use 224p, enable LoRA, or use more GPUs. |
| Symptom | Cause | Fix |
|---|---|---|
| | 将数据挂载到容器中,并设置 |
| HF下载/认证失败 | 缺失或无效的 | 接受模型条款并传入 |
| LoRA注入失败 | 启用了Transformer Engine视觉编码器。 | 设置 |
| ONNX/HF导出时提示组件缺失 | 导出的检查点不完整或仅包含适配器。 | 使用完整检查点,或在导出前配置预训练视觉/文本源。 |
| CUDA OOM | 批处理大小/分辨率超出GPU承载能力。 | 减小批处理大小、使用224p分辨率、启用LoRA,或使用更多GPU。 |