tao-finetune-cosmos-embed

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cosmos-Embed

Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is

cosmos-embed1

and supports

train

evaluate

inference

, and

export

Container image and per-action commands are in

references/skill_info.yaml

. Compact starting specs are in

references/spec_template_*.yaml

Cosmos-Embed1 是一款联合视频-文本嵌入模型，适用于text-to-video retrieval、video-to-video search、零样本/kNN分类以及语义去重。其打包后的CLI为

cosmos-embed1

，支持

train

、

evaluate

、

inference

和

export

操作。

容器镜像及各操作命令可在

references/skill_info.yaml

中查看。精简初始配置文件位于

references/spec_template_*.yaml

。

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read

references/skill_info.yaml

and resolve the run override from either an explicit

automl_policy

value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as

automl_policy: off

for this run only; otherwise default to

auto

. When

automl_policy: auto

automl_enabled: true

, and both

schemas/train.schema.json

and

references/spec_template_train.yaml

are packaged, route the train action through

tao-skill-bank:tao-run-automl

by default with this model's

skill_dir

. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and

automl_policy

. Use direct model training only when

automl_policy: off

or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as

evaluate

inference

export

, and deploy flows stay in this model skill. The per-run

automl_policy

override does not change model metadata.

该模型在模型层支持AutoML。处理任何训练阶段的请求前，请先阅读

references/skill_info.yaml

，并通过显式的

automl_policy

值或用户的工作流请求确定运行覆盖规则。将「关闭AutoML」、「禁用AutoML」、「无需HPO」或「普通训练」这类表述视为本次运行的

automl_policy: off

；否则默认设为

auto

。当

automl_policy: auto

、

automl_enabled: true

，且

schemas/train.schema.json

和

references/spec_template_train.yaml

均已打包时，默认将训练操作通过

tao-skill-bank:tao-run-automl

路由，并传入该模型的

skill_dir

。保留数据集、配置文件、输出目录、GPU/平台设置、父检查点以及

automl_policy

的工作流/应用覆盖规则。仅当

automl_policy: off

或打包的训练schema/模板缺失时，才使用直接模型训练；若schema缺失，需告知用户AutoML已启用，但该模型暂无法运行AutoML，需先生成schema。

非训练操作（如

evaluate

、

inference

、

export

及部署流程）仍在本模型技能中执行。单次运行的

automl_policy

覆盖规则不会修改模型元数据。

Quick Start

Use the published Cosmos-Embed container declared by

references/skill_info.yaml

and resolved through

versions.yaml

. Do not build from the private Cosmos-Embed1 source tree for normal skill use; build from source only when developing the container itself.

bash

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"

Expected local workspace layout:

text

workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/

Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:

bash

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)

Train:

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results

Evaluate:

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results

Inference:

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results

Export ONNX:

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results

Export HuggingFace format:

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results

使用

references/skill_info.yaml

中声明并通过

versions.yaml

解析的已发布Cosmos-Embed容器镜像。正常使用技能时请勿从私有Cosmos-Embed1源码树构建镜像；仅在开发容器本身时才从源码构建。

bash

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"

预期的本地工作区结构：

text

workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/

除非本地Docker/平台技能给出更严格的环境特定命令，否则所有操作均使用以下Docker选项：

bash

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)

训练：

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results

评估：

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results

推理：

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results

导出ONNX格式：

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results

导出HuggingFace格式：

bash

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results

Smoke Overrides

For a small functional check, keep the same specs and override the expensive knobs:

bash

train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0

If no local Cosmos-Embed1 pretrained checkpoint or HuggingFace token is available, set

model.pretrained_model_path=null

for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised.

For evaluation and inference smoke tests on a tiny subset:

bash

evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0

如需进行小型功能检查，请保留原配置文件并覆盖以下资源消耗较大的参数：

bash

train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0

若本地无Cosmos-Embed1预训练检查点或HuggingFace令牌，请设置

model.pretrained_model_path=null

以仅进行流程验证性的冒烟训练。该模式下模型质量无意义，但可验证训练/评估/推理/导出的操作流程。

针对小型子集的评估和推理冒烟测试：

bash

evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0

Data Format

The MSR-VTT path expects a local video glob and a JSON metadata file:

yaml

dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json

List-format metadata rows must include at least

video

and

caption

json

{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}

The dataset loader derives the video id from the local

.mp4

filename and filters to videos present in the metadata. If a run finds zero videos, check that

mp4_urls

points to a container-local glob and that metadata

video

names match the filenames.

MSR-VTT路径需要本地视频通配符和JSON元数据文件：

yaml

dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json

列表格式的元数据行至少需包含

video

和

caption

字段：

json

{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}

数据集加载器从本地

.mp4

文件名提取视频ID，并筛选出元数据中存在的视频。若运行时发现零个视频，请检查

mp4_urls

是否指向容器本地的通配符路径，以及元数据中的

video

名称是否与文件名匹配。

Model Weights

Local HF directory: mount it under

/model

and set

model.pretrained_model_path=/model/Cosmos-Embed1-224p

HuggingFace repo: set

model.pretrained_model_path=nvidia/Cosmos-Embed1-224p

and pass

HF_TOKEN

if access is gated.

Fine-tuned checkpoint: downstream actions default to
```
/results/train/cosmos_embed1_model_latest.pth
```
.

Variants:

Variant	Resolution	Frames	Embedding dim
`Cosmos-Embed1-224p`	224 x 224	8	256
`Cosmos-Embed1-336p`	336 x 336	8	768
`Cosmos-Embed1-448p`	448 x 448	8	768

Keep

model.network.embed_dim

model.input_hw

, and

model.network.spatial_resolution

aligned with the selected variant.

本地HF目录：将其挂载到

/model

下，并设置

model.pretrained_model_path=/model/Cosmos-Embed1-224p

。

HuggingFace仓库：设置

model.pretrained_model_path=nvidia/Cosmos-Embed1-224p

，若访问受限需传入

HF_TOKEN

。

微调后的检查点：下游操作默认使用
```
/results/train/cosmos_embed1_model_latest.pth
```
。

变体：

Variant	Resolution	Frames	Embedding dim
`Cosmos-Embed1-224p`	224 x 224	8	256
`Cosmos-Embed1-336p`	336 x 336	8	768
`Cosmos-Embed1-448p`	448 x 448	8	768

请确保

model.network.embed_dim

、

model.input_hw

和

model.network.spatial_resolution

与所选变体一致。

Important Parameters

Parameter	Notes
`train.num_gpus`	`1` for single GPU, `>1` auto-launches `torchrun` , `-1` auto-detects visible GPUs.
`train.max_iter`	Main training length. Use `1` only for smoke testing.
`train.optim.optim`	`fused_adamw` is faster when available; `adamw` is safer for smoke and portability.
`model.lora.enabled`	Enables LoRA. Set `model.network.visual_encoder.transformer_engine=false` when LoRA is on.
`model.lora.lora_rank`	LoRA rank. Start with `8` ; try `4` , `8` , or `16` for manual or AutoML-style sweeps.
`model.lora.lora_alpha`	LoRA scaling factor. Start with `16` ; keep near `2 * lora_rank` unless experiments show otherwise.
`model.lora.lora_dropout`	LoRA dropout. Start with `0.1` ; sweep `0.0` , `0.05` , and `0.1` for small datasets.
`model.lora.bias`	Bias policy: `none` , `all` , or `lora_only` . Keep `none` unless intentionally training biases.
`model.lora.use_rslora` / `use_dora`	Optional LoRA variants. Enable one at a time and record the setting with the checkpoint.
`model.lora.target_modules`	Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets.
`model.lora.modules_to_save`	Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head.
`evaluate.load_dataset_pkl` / `save_dataset_pkl`	Cache evaluation embeddings.
`inference.load_dataset_pkl` / `save_dataset_pkl`	Cache the search database for repeated retrieval.
`export.mode`	`video` , `text` , `combined` , or `huggingface` .
`export.on_cpu`	Recommended for export to avoid device mismatch issues.

Parameter	Notes
`train.num_gpus`	`1` 表示单GPU， `>1` 自动启动 `torchrun` ， `-1` 自动检测可见GPU。
`train.max_iter`	主要训练时长。仅在冒烟测试时设为 `1` 。
`train.optim.optim`	若可用， `fused_adamw` 速度更快； `adamw` 更适合冒烟测试和可移植性。
`model.lora.enabled`	启用LoRA。启用LoRA时需设置 `model.network.visual_encoder.transformer_engine=false` 。
`model.lora.lora_rank`	LoRA秩。初始值设为 `8` ；手动或AutoML式调优时可尝试 `4` 、 `8` 或 `16` 。
`model.lora.lora_alpha`	LoRA缩放因子。初始值设为 `16` ；除非实验表明其他值更优，否则保持接近 `2 * lora_rank` 。
`model.lora.lora_dropout`	LoRA dropout率。初始值设为 `0.1` ；针对小型数据集可尝试 `0.0` 、 `0.05` 和 `0.1` 。
`model.lora.bias`	偏置策略： `none` 、 `all` 或 `lora_only` 。除非有意训练偏置，否则保持 `none` 。
`model.lora.use_rslora` / `use_dora`	可选LoRA变体。每次仅启用一个，并将设置记录到检查点中。
`model.lora.target_modules`	LoRA注入的可选模块名称模式。默认ViT + Q-Former注意力/MLP目标，无需修改。
`model.lora.modules_to_save`	与LoRA一起保持完全可训练的可选模块。除非需要保留特定任务头，否则留空。
`evaluate.load_dataset_pkl` / `save_dataset_pkl`	缓存评估嵌入向量。
`inference.load_dataset_pkl` / `save_dataset_pkl`	缓存搜索数据库以用于重复检索。
`export.mode`	`video` 、 `text` 、 `combined` 或 `huggingface` 。
`export.on_cpu`	推荐在导出时启用，以避免设备不匹配问题。

LoRA and AutoML Notes

For parameter-efficient fine-tuning, set

model.lora.enabled=true

and keep

model.network.visual_encoder.transformer_engine=false

; TAO Core's Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer Engine layers. Treat the LoRA fields above as the first candidate parameters for manual tuning or AutoML-style search before unfreezing larger model blocks. Avoid changing

target_modules

modules_to_save

unless the user explicitly needs custom adapter placement.

对于参数高效微调，设置

model.lora.enabled=true

并保持

model.network.visual_encoder.transformer_engine=false

；TAO Core的Cosmos-Embed1配置说明PEFT无法将适配器注入Transformer Engine层。在解冻更大模型块之前，先将上述LoRA字段作为手动调优或AutoML式搜索的首要候选参数。除非用户明确需要自定义适配器位置，否则请勿修改

target_modules

或

modules_to_save

。

S3 Staging

The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw

s3://.../*.mp4

URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as

/data/video/*.mp4

in the spec.

Recommended S3 layout for staged MSR-VTT data:

text

s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...

After downloading/syncing that prefix into the mounted

data/

directory, use the same Docker commands above.

Cosmos-Embed1 CLI仅支持本地路径和Python通配符，不支持原始

s3://.../*.mp4

URI。对于基于S3的运行，需先将部分或全部数据集暂存到执行主机/容器文件系统，然后在配置文件中使用

/data/video/*.mp4

这类本地路径。

推荐的暂存MSR-VTT数据的S3结构：

text

s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...

将该前缀下载/同步到挂载的

data/

目录后，即可使用上述相同的Docker命令。

Outputs

text

results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/

text

results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/

Known Pitfalls

Symptom	Cause	Fix
`MSRVTTDataset: 0 videos found`	`mp4_urls` is not a local glob or metadata filenames do not match videos.	Mount data into the container and set `mp4_urls=/data/video/*.mp4` .
HF download/auth failure	Missing or invalid `HF_TOKEN` , or model agreement not accepted.	Accept the model terms and pass `-e HF_TOKEN` .
LoRA injection failure	Transformer Engine visual encoder is enabled.	Set `model.network.visual_encoder.transformer_engine=false` .
ONNX/HF export complains about missing components	Export checkpoint is partial or adapter-only.	Use a full checkpoint or configure pretrained visual/text sources before export.
CUDA OOM	Batch/resolution too high for the GPU.	Reduce batch size, use 224p, enable LoRA, or use more GPUs.

Symptom	Cause	Fix
`MSRVTTDataset: 0 videos found`	`mp4_urls` 不是本地通配符路径，或元数据文件名与视频不匹配。	将数据挂载到容器中，并设置 `mp4_urls=/data/video/*.mp4` 。
HF下载/认证失败	缺失或无效的 `HF_TOKEN` ，或未接受模型协议。	接受模型条款并传入 `-e HF_TOKEN` 。
LoRA注入失败	启用了Transformer Engine视觉编码器。	设置 `model.network.visual_encoder.transformer_engine=false` 。
ONNX/HF导出时提示组件缺失	导出的检查点不完整或仅包含适配器。	使用完整检查点，或在导出前配置预训练视觉/文本源。
CUDA OOM	批处理大小/分辨率超出GPU承载能力。	减小批处理大小、使用224p分辨率、启用LoRA，或使用更多GPU。