tao-finetune-cosmos-embed

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cosmos-Embed

Cosmos-Embed

Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is
cosmos-embed1
and supports
train
,
evaluate
,
inference
, and
export
.
Container image and per-action commands are in
references/skill_info.yaml
. Compact starting specs are in
references/spec_template_*.yaml
.
Cosmos-Embed1 是一款联合视频-文本嵌入模型,适用于text-to-video retrieval、video-to-video search、零样本/kNN分类以及语义去重。其打包后的CLI为
cosmos-embed1
,支持
train
evaluate
inference
export
操作。
容器镜像及各操作命令可在
references/skill_info.yaml
中查看。精简初始配置文件位于
references/spec_template_*.yaml

Train Action Policy

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read
references/skill_info.yaml
and resolve the run override from either an explicit
automl_policy
value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as
automl_policy: off
for this run only; otherwise default to
auto
. When
automl_policy: auto
,
automl_enabled: true
, and both
schemas/train.schema.json
and
references/spec_template_train.yaml
are packaged, route the train action through
tao-skill-bank:tao-run-automl
by default with this model's
skill_dir
. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and
automl_policy
. Use direct model training only when
automl_policy: off
or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.
Non-train actions such as
evaluate
,
inference
,
export
, and deploy flows stay in this model skill. The per-run
automl_policy
override does not change model metadata.
该模型在模型层支持AutoML。处理任何训练阶段的请求前,请先阅读
references/skill_info.yaml
,并通过显式的
automl_policy
值或用户的工作流请求确定运行覆盖规则。将「关闭AutoML」、「禁用AutoML」、「无需HPO」或「普通训练」这类表述视为本次运行的
automl_policy: off
;否则默认设为
auto
。当
automl_policy: auto
automl_enabled: true
,且
schemas/train.schema.json
references/spec_template_train.yaml
均已打包时,默认将训练操作通过
tao-skill-bank:tao-run-automl
路由,并传入该模型的
skill_dir
。保留数据集、配置文件、输出目录、GPU/平台设置、父检查点以及
automl_policy
的工作流/应用覆盖规则。仅当
automl_policy: off
或打包的训练schema/模板缺失时,才使用直接模型训练;若schema缺失,需告知用户AutoML已启用,但该模型暂无法运行AutoML,需先生成schema。
非训练操作(如
evaluate
inference
export
及部署流程)仍在本模型技能中执行。单次运行的
automl_policy
覆盖规则不会修改模型元数据。

Quick Start

Quick Start

Use the published Cosmos-Embed container declared by
references/skill_info.yaml
and resolved through
versions.yaml
. Do not build from the private Cosmos-Embed1 source tree for normal skill use; build from source only when developing the container itself.
bash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"
Expected local workspace layout:
text
workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/
Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:
bash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)
Train:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results
Evaluate:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results
Inference:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results
Export ONNX:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results
Export HuggingFace format:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results
使用
references/skill_info.yaml
中声明并通过
versions.yaml
解析的已发布Cosmos-Embed容器镜像。正常使用技能时请勿从私有Cosmos-Embed1源码树构建镜像;仅在开发容器本身时才从源码构建。
bash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"
预期的本地工作区结构:
text
workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/
除非本地Docker/平台技能给出更严格的环境特定命令,否则所有操作均使用以下Docker选项:
bash
TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)
训练:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results
评估:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results
推理:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results
导出ONNX格式:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results
导出HuggingFace格式:
bash
docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results

Smoke Overrides

Smoke Overrides

For a small functional check, keep the same specs and override the expensive knobs:
bash
train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0
If no local Cosmos-Embed1 pretrained checkpoint or HuggingFace token is available, set
model.pretrained_model_path=null
for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised.
For evaluation and inference smoke tests on a tiny subset:
bash
evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0
如需进行小型功能检查,请保留原配置文件并覆盖以下资源消耗较大的参数:
bash
train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0
若本地无Cosmos-Embed1预训练检查点或HuggingFace令牌,请设置
model.pretrained_model_path=null
以仅进行流程验证性的冒烟训练。该模式下模型质量无意义,但可验证训练/评估/推理/导出的操作流程。
针对小型子集的评估和推理冒烟测试:
bash
evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0

Data Format

Data Format

The MSR-VTT path expects a local video glob and a JSON metadata file:
yaml
dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json
List-format metadata rows must include at least
video
and
caption
:
json
{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}
The dataset loader derives the video id from the local
.mp4
filename and filters to videos present in the metadata. If a run finds zero videos, check that
mp4_urls
points to a container-local glob and that metadata
video
names match the filenames.
MSR-VTT路径需要本地视频通配符和JSON元数据文件:
yaml
dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json
列表格式的元数据行至少需包含
video
caption
字段:
json
{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}
数据集加载器从本地
.mp4
文件名提取视频ID,并筛选出元数据中存在的视频。若运行时发现零个视频,请检查
mp4_urls
是否指向容器本地的通配符路径,以及元数据中的
video
名称是否与文件名匹配。

Model Weights

Model Weights

  • Local HF directory: mount it under
    /model
    and set
    model.pretrained_model_path=/model/Cosmos-Embed1-224p
    .
  • HuggingFace repo: set
    model.pretrained_model_path=nvidia/Cosmos-Embed1-224p
    and pass
    HF_TOKEN
    if access is gated.
  • Fine-tuned checkpoint: downstream actions default to
    /results/train/cosmos_embed1_model_latest.pth
    .
Variants:
VariantResolutionFramesEmbedding dim
Cosmos-Embed1-224p
224 x 2248256
Cosmos-Embed1-336p
336 x 3368768
Cosmos-Embed1-448p
448 x 4488768
Keep
model.network.embed_dim
,
model.input_hw
, and
model.network.spatial_resolution
aligned with the selected variant.
  • 本地HF目录:将其挂载到
    /model
    下,并设置
    model.pretrained_model_path=/model/Cosmos-Embed1-224p
  • HuggingFace仓库:设置
    model.pretrained_model_path=nvidia/Cosmos-Embed1-224p
    ,若访问受限需传入
    HF_TOKEN
  • 微调后的检查点:下游操作默认使用
    /results/train/cosmos_embed1_model_latest.pth
变体:
VariantResolutionFramesEmbedding dim
Cosmos-Embed1-224p
224 x 2248256
Cosmos-Embed1-336p
336 x 3368768
Cosmos-Embed1-448p
448 x 4488768
请确保
model.network.embed_dim
model.input_hw
model.network.spatial_resolution
与所选变体一致。

Important Parameters

Important Parameters

ParameterNotes
train.num_gpus
1
for single GPU,
>1
auto-launches
torchrun
,
-1
auto-detects visible GPUs.
train.max_iter
Main training length. Use
1
only for smoke testing.
train.optim.optim
fused_adamw
is faster when available;
adamw
is safer for smoke and portability.
model.lora.enabled
Enables LoRA. Set
model.network.visual_encoder.transformer_engine=false
when LoRA is on.
model.lora.lora_rank
LoRA rank. Start with
8
; try
4
,
8
, or
16
for manual or AutoML-style sweeps.
model.lora.lora_alpha
LoRA scaling factor. Start with
16
; keep near
2 * lora_rank
unless experiments show otherwise.
model.lora.lora_dropout
LoRA dropout. Start with
0.1
; sweep
0.0
,
0.05
, and
0.1
for small datasets.
model.lora.bias
Bias policy:
none
,
all
, or
lora_only
. Keep
none
unless intentionally training biases.
model.lora.use_rslora
/
use_dora
Optional LoRA variants. Enable one at a time and record the setting with the checkpoint.
model.lora.target_modules
Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets.
model.lora.modules_to_save
Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head.
evaluate.load_dataset_pkl
/
save_dataset_pkl
Cache evaluation embeddings.
inference.load_dataset_pkl
/
save_dataset_pkl
Cache the search database for repeated retrieval.
export.mode
video
,
text
,
combined
, or
huggingface
.
export.on_cpu
Recommended for export to avoid device mismatch issues.
ParameterNotes
train.num_gpus
1
表示单GPU,
>1
自动启动
torchrun
-1
自动检测可见GPU。
train.max_iter
主要训练时长。仅在冒烟测试时设为
1
train.optim.optim
若可用,
fused_adamw
速度更快;
adamw
更适合冒烟测试和可移植性。
model.lora.enabled
启用LoRA。启用LoRA时需设置
model.network.visual_encoder.transformer_engine=false
model.lora.lora_rank
LoRA秩。初始值设为
8
;手动或AutoML式调优时可尝试
4
8
16
model.lora.lora_alpha
LoRA缩放因子。初始值设为
16
;除非实验表明其他值更优,否则保持接近
2 * lora_rank
model.lora.lora_dropout
LoRA dropout率。初始值设为
0.1
;针对小型数据集可尝试
0.0
0.05
0.1
model.lora.bias
偏置策略:
none
all
lora_only
。除非有意训练偏置,否则保持
none
model.lora.use_rslora
/
use_dora
可选LoRA变体。每次仅启用一个,并将设置记录到检查点中。
model.lora.target_modules
LoRA注入的可选模块名称模式。默认ViT + Q-Former注意力/MLP目标,无需修改。
model.lora.modules_to_save
与LoRA一起保持完全可训练的可选模块。除非需要保留特定任务头,否则留空。
evaluate.load_dataset_pkl
/
save_dataset_pkl
缓存评估嵌入向量。
inference.load_dataset_pkl
/
save_dataset_pkl
缓存搜索数据库以用于重复检索。
export.mode
video
text
combined
huggingface
export.on_cpu
推荐在导出时启用,以避免设备不匹配问题。

LoRA and AutoML Notes

LoRA and AutoML Notes

For parameter-efficient fine-tuning, set
model.lora.enabled=true
and keep
model.network.visual_encoder.transformer_engine=false
; TAO Core's Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer Engine layers. Treat the LoRA fields above as the first candidate parameters for manual tuning or AutoML-style search before unfreezing larger model blocks. Avoid changing
target_modules
or
modules_to_save
unless the user explicitly needs custom adapter placement.
对于参数高效微调,设置
model.lora.enabled=true
并保持
model.network.visual_encoder.transformer_engine=false
;TAO Core的Cosmos-Embed1配置说明PEFT无法将适配器注入Transformer Engine层。在解冻更大模型块之前,先将上述LoRA字段作为手动调优或AutoML式搜索的首要候选参数。除非用户明确需要自定义适配器位置,否则请勿修改
target_modules
modules_to_save

S3 Staging

S3 Staging

The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw
s3://.../*.mp4
URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as
/data/video/*.mp4
in the spec.
Recommended S3 layout for staged MSR-VTT data:
text
s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...
After downloading/syncing that prefix into the mounted
data/
directory, use the same Docker commands above.
Cosmos-Embed1 CLI仅支持本地路径和Python通配符,不支持原始
s3://.../*.mp4
URI。对于基于S3的运行,需先将部分或全部数据集暂存到执行主机/容器文件系统,然后在配置文件中使用
/data/video/*.mp4
这类本地路径。
推荐的暂存MSR-VTT数据的S3结构:
text
s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...
将该前缀下载/同步到挂载的
data/
目录后,即可使用上述相同的Docker命令。

Outputs

Outputs

text
results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/
text
results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/

Known Pitfalls

Known Pitfalls

SymptomCauseFix
MSRVTTDataset: 0 videos found
mp4_urls
is not a local glob or metadata filenames do not match videos.
Mount data into the container and set
mp4_urls=/data/video/*.mp4
.
HF download/auth failureMissing or invalid
HF_TOKEN
, or model agreement not accepted.
Accept the model terms and pass
-e HF_TOKEN
.
LoRA injection failureTransformer Engine visual encoder is enabled.Set
model.network.visual_encoder.transformer_engine=false
.
ONNX/HF export complains about missing componentsExport checkpoint is partial or adapter-only.Use a full checkpoint or configure pretrained visual/text sources before export.
CUDA OOMBatch/resolution too high for the GPU.Reduce batch size, use 224p, enable LoRA, or use more GPUs.
SymptomCauseFix
MSRVTTDataset: 0 videos found
mp4_urls
不是本地通配符路径,或元数据文件名与视频不匹配。
将数据挂载到容器中,并设置
mp4_urls=/data/video/*.mp4
HF下载/认证失败缺失或无效的
HF_TOKEN
,或未接受模型协议。
接受模型条款并传入
-e HF_TOKEN
LoRA注入失败启用了Transformer Engine视觉编码器。设置
model.network.visual_encoder.transformer_engine=false
ONNX/HF导出时提示组件缺失导出的检查点不完整或仅包含适配器。使用完整检查点,或在导出前配置预训练视觉/文本源。
CUDA OOM批处理大小/分辨率超出GPU承载能力。减小批处理大小、使用224p分辨率、启用LoRA,或使用更多GPU。