tao-generate-video-reasoning-annotations

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Video Reasoning Annotation Pipeline

视频推理标注流水线

Generate Chain-of-Thought training datasets from videos by producing multi-level captions, structured descriptions, and QA pairs (MCQ, binary, open-ended) with step-by-step reasoning traces. Domain-agnostic by default — customize prompts for any video domain.

通过生成多级字幕、结构化描述，以及带分步推理轨迹的QA对（选择题、二元题、开放式题），将视频转换为思维链（Chain-of-Thought）训练数据集。默认支持全领域，可针对任意视频领域自定义提示词。

Purpose

用途

Transform raw videos into CoT Q&A training data for video understanding models. VLMs (e.g., Gemini, Qwen) act as "teacher" annotators: Steps 0–1 require the model to see the video (VLM calls); Steps 2–3 are text-to-text (cheaper LLM calls).

将原始视频转换为面向视频理解模型的CoT问答训练数据。VLM（如Gemini、Qwen）充当「教师」标注器：步骤0–1需要模型读取视频（调用VLM）；步骤2–3为文本转文本（调用成本更低的LLM）。

Pipeline architecture

流水线架构

Step 0:  [Optional] Filter & classify videos  → Keep domain-relevant, classify anomaly vs normal
Step 1a: Global + dense captions               → VLM: narrative summary + timestamped events
Step 1b: Chunk captions                         → VLM: fixed-duration segment micro-captions
Step 1c: [Optional, anomaly only] Highlight     → LLM extracts anomaly timestamp, VLM captions clip
Step 2:  Description synthesis                  → LLM: synthesize captions into structured narrative
Step 3:  QA generation                          → LLM: MCQ, binary, open-ended with reasoning
Step 4:  Parse outputs                          → Per-task `tao-vl-reason-v1.0` JSON files

Steps are individually selectable via

workflow.steps

. The pipeline has built-in resume — each step skips already-processed videos, so re-running after a prompt tweak is safe.

Step 0:  [可选] 过滤与分类视频  → 保留领域相关视频，区分异常与正常视频
Step 1a: 全局+密集字幕生成               → VLM：生成叙事摘要+带时间戳的事件
Step 1b: 字幕分块                         → VLM：生成固定时长片段的微字幕
Step 1c: [可选，仅异常视频] 异常高亮     → LLM提取异常时间戳，VLM生成对应片段字幕
Step 2: 描述合成                  → LLM：将字幕合成为结构化叙事文本
Step 3: QA生成                          → LLM：生成带推理过程的选择题、二元题、开放式题
Step 4: 输出解析                          → 生成对应任务的`tao-vl-reason-v1.0`格式JSON文件

可通过

workflow.steps

单独选择执行的步骤。流水线支持断点续跑——每个步骤会跳过已处理的视频，因此调整提示词后重新运行是安全的。

Initial consultation

初始咨询

When the user invokes this skill, walk through these questions in order. Don't skip — getting domain and VLM access right up front prevents wasted runs.

当用户调用该技能时，请按顺序完成以下问题询问。不要跳过——提前确认领域和VLM访问权限可避免无效运行。

1. Videos

1. 视频信息

Path to the video directory and/or a JSONL with
```
{"video_path": "..."}
```
per line.
Confirm format (
```
.mp4
```
preferred;
```
.avi
```
,
```
.mov
```
,
```
.mkv
```
also walked).

视频目录路径，或每行包含
```
{"video_path": "..."}
```
的JSONL文件。
确认视频格式（推荐
```
.mp4
```
；也支持
```
.avi
```
、
```
.mov
```
、
```
.mkv
```
）。

2. Domain — drives prompt selection

2. 领域——决定提示词选择

Ask the user: "What domain are these videos from?" Choose one of the following branches:

Domain	What to do
general	Use the default prompts. Set `prompts_module: ""` (or omit). The built-in `nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts` covers domain-agnostic content.
traffic (CCTV intersections, highways; dashcam excluded)	Use the reference module. Set `prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_traffic"` , or copy `references/prompts_traffic.py` into the user's project and tune for their specific camera angles, then point `prompts_module` at the copy.
warehouse (industrial site CCTV — safety, operations, security)	Same pattern. Set `prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_warehouse"` , or copy `references/prompts_warehouse.py` and tune.
custom (any other domain)	Run the workshop in references/domain_adaptation.md. It walks through: Phase 1 — question types the user wants the model to answer; Phase 2 — caption-requirements checklist; Phase 3 — fill the `[PLACEHOLDER]` markers in `nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template` . The two reference modules above are working examples to model after. Do this before any pipeline runs.

询问用户：「这些视频属于哪个领域？」 选择以下分支之一：

领域	操作说明
通用领域	使用默认提示词。设置 `prompts_module: ""` （或省略）。内置的 `nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts` 适用于全领域内容。
交通领域（CCTV路口、高速路；不包含行车记录仪）	使用参考模块。设置 `prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_traffic"` ，或将 `references/prompts_traffic.py` 复制到用户项目中，针对特定摄像头角度调整后，将 `prompts_module` 指向该副本。
仓储领域（工业场地CCTV——安全、运营、安防）	操作模式相同。设置 `prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_warehouse"` ，或复制 `references/prompts_warehouse.py` 并调整。
自定义领域（其他任意领域）	运行references/domain_adaptation.md中的教程。教程包含：阶段1——用户希望模型回答的问题类型；阶段2——字幕要求清单；阶段3——填充 `nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template` 中的 `[PLACEHOLDER]` 标记。上述两个参考模块可作为示例参考。请在运行流水线前完成此步骤。

3. Anomaly / normal / mixed

3. 异常/正常/混合数据集

Mixed dataset →
```
workflow.mode: "auto"
```
(Step 0 classifies each video).
Pre-split anomaly only →
```
workflow.mode: "anomaly"
```
, drop Step 0.
Pre-split normal only →
```
workflow.mode: "normal"
```
, drop Steps 0 and 1c.

混合数据集 → 设置
```
workflow.mode: "auto"
```
（步骤0会对每个视频进行分类）。
已拆分的仅异常数据集 → 设置
```
workflow.mode: "anomaly"
```
，跳过步骤0。
已拆分的仅正常数据集 → 设置
```
workflow.mode: "normal"
```
，跳过步骤0和1c。

4. VLM / LLM endpoint — confirm access before running

4. VLM/LLM端点——运行前确认访问权限

Gemini (default for both
```
vlm.backend
```
and
```
llm.backend
```
): user needs
```
GOOGLE_API_KEY
```
set, or to put the key in the YAML.
OpenAI-compatible (Qwen via vLLM, NIM endpoint, etc.): user provides
```
base_url
```
,
```
model_name
```
, and
```
api_key
```
.
Steps 2–3 are text-only — a smaller/cheaper LLM is fine for
```
llm.backend
```
even when
```
vlm.backend
```
is a frontier video model.

If the user has no endpoint at all and wants to self-host, point them at the

skills/applications/tao-run-inference-service

skill — a workflow that stands up a network-specific TAO inference microservice locally and exposes an OpenAI-compatible endpoint. Should support Cosmos, Qwen, and Gemma. Check

skills/applications/tao-run-inference-service/references/service.yaml

for the current

valid_network_arch_config_basenames

list before relying on a specific model.

If the user doesn't have endpoint access ready and isn't ready to set one up, stop here and help them figure it out first.

Gemini（
```
vlm.backend
```
和
```
llm.backend
```
的默认选项）：用户需要设置
```
GOOGLE_API_KEY
```
，或在YAML文件中配置密钥。
兼容OpenAI的端点（如通过vLLM的Qwen、NIM端点等）：用户需提供
```
base_url
```
、
```
model_name
```
和
```
api_key
```
。
步骤2–3仅处理文本——即使
```
vlm.backend
```
使用前沿视频模型，
```
llm.backend
```
使用更小/更便宜的LLM即可。

如果用户完全没有可用端点且希望自托管，请引导其使用

skills/applications/tao-run-inference-service

技能——该工作流可在本地搭建特定网络的TAO推理微服务，并暴露兼容OpenAI的端点。支持Cosmos、Qwen和Gemma模型。在依赖特定模型前，请查看

skills/applications/tao-run-inference-service/references/service.yaml

中的

valid_network_arch_config_basenames

列表。

如果用户尚未准备好端点访问权限且暂不打算配置，请在此停止，先协助解决该问题。

5. Pilot vs full run

5. 试点运行 vs 全量运行

Recommend a 5–10 video pilot when domain is
```
custom
```
, when any prompt was edited, or when this is the user's first run.
Full-run is fine for
```
general
```
/
```
traffic
```
/
```
warehouse
```
once the user has previously verified output quality on the same data type.
The pipeline has built-in resume, so a pilot followed by a full run does not re-process the pilot videos.

当领域为
```
custom
```
、提示词有修改，或用户首次运行时，建议先进行5–10个视频的试点运行。
对于
```
general
```
/
```
traffic
```
/
```
warehouse
```
领域，若用户已验证过同类型数据的输出质量，可直接进行全量运行。
流水线支持断点续跑，因此试点运行后再进行全量运行不会重复处理试点视频。

Quick start

快速开始

The pipeline runs inside the TAO Toolkit container via the

auto_label

CLI:

bash

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    video_reasoning_annotation.data.video_root=/videos \
    video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
    video_reasoning_annotation.workflow.mode=auto

Generate a default spec to start from:

bash

auto_label default_specs results_dir=/results module_name=auto_label

该流水线在TAO Toolkit容器内通过

auto_label

CLI运行：

bash

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    video_reasoning_annotation.data.video_root=/videos \
    video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
    video_reasoning_annotation.workflow.mode=auto

生成默认配置文件作为起点：

bash

auto_label default_specs results_dir=/results module_name=auto_label

then set: autolabel_type: "video_reasoning_annotation"

然后设置: autolabel_type: "video_reasoning_annotation"


All fields support Hydra dot-notation overrides on the command line. For the full YAML reference (every field, model/endpoint setup, error patterns), see [references/configuration.md](references/configuration.md).


所有字段支持通过命令行使用Hydra点标记法覆盖。完整YAML参考（包含所有字段、模型/端点配置、错误模式）请查看[references/configuration.md](references/configuration.md)。

Pilot workflow

试点工作流

Use this when running a 5–10 video pilot:

Run the pipeline on the pilot subset with the chosen
```
prompts_module
```
and
```
workflow.mode
```
.
Inspect
```
results_dir/step_1a_caption/captions.jsonl
```
— captions accurate, capturing the right level of detail?
Inspect
```
results_dir/step_3_qa/qa_output.jsonl
```
— questions meaningful, answers correct, reasoning logical?
If quality is insufficient: adjust the prompts (in
```
prompts_module
```
if domain-customized, or fall back to
```
general
```
if a domain module is over-tuned), and re-run. The pipeline auto-skips already-processed videos.
Once satisfied, scale to the full dataset by pointing
```
data.video_root
```
(or
```
data.input_jsonl_files
```
) at the full set and re-running with the same
```
results_dir
```
(resume) or a fresh one (full re-run).

Quality compounds downstream — bad captions produce bad descriptions which produce bad QA. Focus iteration on Step 1a/1b output first; descriptions and QA usually improve once captions are right.

进行5–10个视频的试点运行时，请遵循以下步骤：

使用选定的
```
prompts_module
```
和
```
workflow.mode
```
在试点子集上运行流水线。
检查
```
results_dir/step_1a_caption/captions.jsonl
```
——字幕是否准确，是否捕捉到合适的细节粒度？
检查
```
results_dir/step_3_qa/qa_output.jsonl
```
——问题是否有意义，答案是否正确，推理是否合理？
如果质量不足：调整提示词（若为自定义领域则修改
```
prompts_module
```
，若领域模块过度调优则切换回
```
general
```
），然后重新运行。流水线会自动跳过已处理的视频。
满意后，将
```
data.video_root
```
（或
```
data.input_jsonl_files
```
）指向全量数据集，使用相同的
```
results_dir
```
（续跑）或新目录（全量重跑）重新运行，完成全量处理。

质量会向下游传递——糟糕的字幕会生成糟糕的描述，进而生成糟糕的QA。请优先迭代优化步骤1a/1b的输出；字幕质量达标后，描述和QA通常会随之改善。

Configuration summary

配置摘要

Key fields (full reference in references/configuration.md):

Field	Default	Description
`workflow.steps`	`["0","1a","1b","1c","2","3","4"]`	Which pipeline steps to execute
`workflow.mode`	`"auto"`	`"auto"` , `"anomaly"` , or `"normal"`
`vlm.backend`	`"gemini"`	`"gemini"` or `"openai"` (OpenAI-compatible)
`llm.backend`	`"gemini"`	Same options; text-only, cheaper model works
`workflow.max_workers`	`4`	Parallel threads per step (watch API rate limits)
`license`	`""`	Optional: written to `metadata.license` in step 4 outputs (e.g. `"CC-BY-4.0"` )
`description_extra`	`""`	Optional: extra text appended to per-task descriptions in step 4 metadata
`prompts_module`	`""`	Dotted import path to custom prompts module

关键字段（完整参考见references/configuration.md）：

字段	默认值	描述
`workflow.steps`	`["0","1a","1b","1c","2","3","4"]`	要执行的流水线步骤
`workflow.mode`	`"auto"`	可选值： `"auto"` 、 `"anomaly"` 或 `"normal"`
`vlm.backend`	`"gemini"`	可选值： `"gemini"` 或 `"openai"` （兼容OpenAI的端点）
`llm.backend`	`"gemini"`	可选值同上；仅处理文本，使用更便宜的模型即可
`workflow.max_workers`	`4`	每个步骤的并行线程数（注意API速率限制）
`license`	`""`	可选：会写入步骤4输出的 `metadata.license` 字段（如 `"CC-BY-4.0"` ）
`description_extra`	`""`	可选：附加到步骤4元数据中每个任务的描述文本
`prompts_module`	`""`	自定义提示词模块的点分隔导入路径

Prompts

提示词

Built-in (general):

nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts

— domain-agnostic, used by default.

Template:

nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template

— same 26 keys with

[PLACEHOLDER]

markers for domain customization.

Reference modules (working examples for the consultation's
```
traffic
```
/
```
warehouse
```
branches): references/prompts_traffic.py, references/prompts_warehouse.py.
Custom domains: see references/domain_adaptation.md for the full workshop and placeholder reference.

内置（通用领域）：
```
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts
```
——全领域通用，默认使用。

模板：

nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template

——包含26个带

[PLACEHOLDER]

标记的字段，用于领域自定义。

参考模块（对应咨询环节中
```
traffic
```
/
```
warehouse
```
分支的可用示例）：references/prompts_traffic.py、references/prompts_warehouse.py。
自定义领域：查看references/domain_adaptation.md获取完整教程和占位符参考。

Inputs

输入

video_root
: Directory of videos (walked recursively for
```
.mp4
```
,
```
.avi
```
,
```
.mov
```
,
```
.mkv
```
).
input_jsonl_files
: List of JSONL files with
```
{"video_path": "..."}
```
per line. The
```
video
```
key is also accepted; extra fields are allowed.
filter_field
: Optional boolean field to filter JSONL entries.

Provide

video_root

input_jsonl_files

, or both (lists merge).

video_root
：视频目录（会递归遍历
```
.mp4
```
、
```
.avi
```
、
```
.mov
```
、
```
.mkv
```
格式文件）。
input_jsonl_files
：JSONL文件列表，每行包含
```
{"video_path": "..."}
```
。也支持
```
video
```
字段；允许附加其他字段。
filter_field
：可选的布尔字段，用于过滤JSONL条目。

可提供

video_root

、

input_jsonl_files

，或同时提供（列表会合并）。

Outputs

输出

All outputs go to

results_dir/

with per-step subdirectories (

step_0_filter/

step_1a_caption/

, …,

step_4_output/

Steps 0–3: JSONL — one JSON object per video per line.

Step 4: One

<task>.json

per non-empty task type, in the tao-vl-reason-v1.0
envelope. Up to 10 files:

mcq.json

mcq_openended.json

bcq.json

bcq_openended.json

open_qa.json

causal_linkage.json

temporal_localization.json

temporal_description.json

scene_description.json

video_summarization.json

Each step 4 file looks like:

json

{
  "format": "tao-vl-reason-v1.0",
  "metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
               "description": "<per-task + description_extra>", "license": "<from config>"},
  "media_root": "<data.video_root>" | null,
  "items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}

media_root

mirrors

data.video_root

(or

null

when unset); each item's

video_id

is the entry's video path with the

video_root

prefix stripped. Set

license

and

description_extra

in the spec to populate the metadata.

所有输出均保存至

results_dir/

，每个步骤对应子目录（

step_0_filter/

、

step_1a_caption/

……

step_4_output/

）：

步骤0–3：JSONL格式——每行对应一个视频的JSON对象。

步骤4：每个非空任务类型对应一个

<task>.json

文件，采用**

tao-vl-reason-v1.0

**封装格式。最多生成10个文件：

mcq.json

、

mcq_openended.json

、

bcq.json

、

bcq_openended.json

、

open_qa.json

、

causal_linkage.json

、

temporal_localization.json

、

temporal_description.json

、

scene_description.json

、

video_summarization.json

。

步骤4的每个文件格式如下：

json

{
  "format": "tao-vl-reason-v1.0",
  "metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
               "description": "<per-task + description_extra>", "license": "<from config>"},
  "media_root": "<data.video_root>" | null,
  "items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}

media_root

与

data.video_root

一致（未设置时为

null

）；每个条目的

video_id

是去掉

video_root

前缀后的视频路径。可在配置文件中设置

license

和

description_extra

来填充元数据。

Prerequisites

前置条件

Container:

tao_toolkit.pyt

(resolves to

nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt

via

versions.yaml

ffmpeg / ffprobe: required for chunk captioning (Step 1b) and highlight extraction (Step 1c).
VLM endpoint: at least one — Gemini API key or OpenAI-compatible endpoint.

容器：

tao_toolkit.pyt

（通过

versions.yaml

解析为

nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt

）。

ffmpeg / ffprobe：字幕分块（步骤1b）和异常高亮提取（步骤1c）所需。
VLM端点：至少一个——Gemini API密钥或兼容OpenAI的端点。