tao-generate-video-reasoning-annotations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Video Reasoning Annotation Pipeline

视频推理标注流水线

Generate Chain-of-Thought training datasets from videos by producing multi-level captions, structured descriptions, and QA pairs (MCQ, binary, open-ended) with step-by-step reasoning traces. Domain-agnostic by default — customize prompts for any video domain.
通过生成多级字幕、结构化描述,以及带分步推理轨迹的QA对(选择题、二元题、开放式题),将视频转换为思维链(Chain-of-Thought)训练数据集。默认支持全领域,可针对任意视频领域自定义提示词。

Purpose

用途

Transform raw videos into CoT Q&A training data for video understanding models. VLMs (e.g., Gemini, Qwen) act as "teacher" annotators: Steps 0–1 require the model to see the video (VLM calls); Steps 2–3 are text-to-text (cheaper LLM calls).
将原始视频转换为面向视频理解模型的CoT问答训练数据。VLM(如Gemini、Qwen)充当「教师」标注器:步骤0–1需要模型读取视频(调用VLM);步骤2–3为文本转文本(调用成本更低的LLM)。

Pipeline architecture

流水线架构

Step 0:  [Optional] Filter & classify videos  → Keep domain-relevant, classify anomaly vs normal
Step 1a: Global + dense captions               → VLM: narrative summary + timestamped events
Step 1b: Chunk captions                         → VLM: fixed-duration segment micro-captions
Step 1c: [Optional, anomaly only] Highlight     → LLM extracts anomaly timestamp, VLM captions clip
Step 2:  Description synthesis                  → LLM: synthesize captions into structured narrative
Step 3:  QA generation                          → LLM: MCQ, binary, open-ended with reasoning
Step 4:  Parse outputs                          → Per-task `tao-vl-reason-v1.0` JSON files
Steps are individually selectable via
workflow.steps
. The pipeline has built-in resume — each step skips already-processed videos, so re-running after a prompt tweak is safe.
Step 0:  [可选] 过滤与分类视频  → 保留领域相关视频,区分异常与正常视频
Step 1a: 全局+密集字幕生成               → VLM:生成叙事摘要+带时间戳的事件
Step 1b: 字幕分块                         → VLM:生成固定时长片段的微字幕
Step 1c: [可选,仅异常视频] 异常高亮     → LLM提取异常时间戳,VLM生成对应片段字幕
Step 2: 描述合成                  → LLM:将字幕合成为结构化叙事文本
Step 3: QA生成                          → LLM:生成带推理过程的选择题、二元题、开放式题
Step 4: 输出解析                          → 生成对应任务的`tao-vl-reason-v1.0`格式JSON文件
可通过
workflow.steps
单独选择执行的步骤。流水线支持断点续跑——每个步骤会跳过已处理的视频,因此调整提示词后重新运行是安全的。

Initial consultation

初始咨询

When the user invokes this skill, walk through these questions in order. Don't skip — getting domain and VLM access right up front prevents wasted runs.
当用户调用该技能时,请按顺序完成以下问题询问。不要跳过——提前确认领域和VLM访问权限可避免无效运行。

1. Videos

1. 视频信息

  • Path to the video directory and/or a JSONL with
    {"video_path": "..."}
    per line.
  • Confirm format (
    .mp4
    preferred;
    .avi
    ,
    .mov
    ,
    .mkv
    also walked).
  • 视频目录路径,或每行包含
    {"video_path": "..."}
    的JSONL文件。
  • 确认视频格式(推荐
    .mp4
    ;也支持
    .avi
    .mov
    .mkv
    )。

2. Domain — drives prompt selection

2. 领域——决定提示词选择

Ask the user: "What domain are these videos from?" Choose one of the following branches:
DomainWhat to do
generalUse the default prompts. Set
prompts_module: ""
(or omit). The built-in
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts
covers domain-agnostic content.
traffic (CCTV intersections, highways; dashcam excluded)Use the reference module. Set
prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_traffic"
, or copy
references/prompts_traffic.py
into the user's project and tune for their specific camera angles, then point
prompts_module
at the copy.
warehouse (industrial site CCTV — safety, operations, security)Same pattern. Set
prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_warehouse"
, or copy
references/prompts_warehouse.py
and tune.
custom (any other domain)Run the workshop in references/domain_adaptation.md. It walks through: Phase 1 — question types the user wants the model to answer; Phase 2 — caption-requirements checklist; Phase 3 — fill the
[PLACEHOLDER]
markers in
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template
. The two reference modules above are working examples to model after. Do this before any pipeline runs.
询问用户:「这些视频属于哪个领域?」 选择以下分支之一:
领域操作说明
通用领域使用默认提示词。设置
prompts_module: ""
(或省略)。内置的
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts
适用于全领域内容。
交通领域(CCTV路口、高速路;不包含行车记录仪)使用参考模块。设置
prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_traffic"
references/prompts_traffic.py
复制到用户项目中,针对特定摄像头角度调整后,将
prompts_module
指向该副本。
仓储领域(工业场地CCTV——安全、运营、安防)操作模式相同。设置
prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_warehouse"
,或复制
references/prompts_warehouse.py
并调整。
自定义领域(其他任意领域)运行references/domain_adaptation.md中的教程。教程包含:阶段1——用户希望模型回答的问题类型;阶段2——字幕要求清单;阶段3——填充
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template
中的
[PLACEHOLDER]
标记。上述两个参考模块可作为示例参考。请在运行流水线前完成此步骤。

3. Anomaly / normal / mixed

3. 异常/正常/混合数据集

  • Mixed dataset →
    workflow.mode: "auto"
    (Step 0 classifies each video).
  • Pre-split anomaly only →
    workflow.mode: "anomaly"
    , drop Step 0.
  • Pre-split normal only →
    workflow.mode: "normal"
    , drop Steps 0 and 1c.
  • 混合数据集 → 设置
    workflow.mode: "auto"
    (步骤0会对每个视频进行分类)。
  • 已拆分的仅异常数据集 → 设置
    workflow.mode: "anomaly"
    ,跳过步骤0。
  • 已拆分的仅正常数据集 → 设置
    workflow.mode: "normal"
    ,跳过步骤0和1c。

4. VLM / LLM endpoint — confirm access before running

4. VLM/LLM端点——运行前确认访问权限

  • Gemini (default for both
    vlm.backend
    and
    llm.backend
    ): user needs
    GOOGLE_API_KEY
    set, or to put the key in the YAML.
  • OpenAI-compatible (Qwen via vLLM, NIM endpoint, etc.): user provides
    base_url
    ,
    model_name
    , and
    api_key
    .
  • Steps 2–3 are text-only — a smaller/cheaper LLM is fine for
    llm.backend
    even when
    vlm.backend
    is a frontier video model.
If the user has no endpoint at all and wants to self-host, point them at the
skills/applications/tao-run-inference-service
skill — a workflow that stands up a network-specific TAO inference microservice locally and exposes an OpenAI-compatible endpoint. Should support Cosmos, Qwen, and Gemma. Check
skills/applications/tao-run-inference-service/references/service.yaml
for the current
valid_network_arch_config_basenames
list before relying on a specific model.
If the user doesn't have endpoint access ready and isn't ready to set one up, stop here and help them figure it out first.
  • Gemini
    vlm.backend
    llm.backend
    的默认选项):用户需要设置
    GOOGLE_API_KEY
    ,或在YAML文件中配置密钥。
  • 兼容OpenAI的端点(如通过vLLM的Qwen、NIM端点等):用户需提供
    base_url
    model_name
    api_key
  • 步骤2–3仅处理文本——即使
    vlm.backend
    使用前沿视频模型,
    llm.backend
    使用更小/更便宜的LLM即可。
如果用户完全没有可用端点且希望自托管,请引导其使用
skills/applications/tao-run-inference-service
技能——该工作流可在本地搭建特定网络的TAO推理微服务,并暴露兼容OpenAI的端点。支持Cosmos、Qwen和Gemma模型。在依赖特定模型前,请查看
skills/applications/tao-run-inference-service/references/service.yaml
中的
valid_network_arch_config_basenames
列表。
如果用户尚未准备好端点访问权限且暂不打算配置,请在此停止,先协助解决该问题。

5. Pilot vs full run

5. 试点运行 vs 全量运行

  • Recommend a 5–10 video pilot when domain is
    custom
    , when any prompt was edited, or when this is the user's first run.
  • Full-run is fine for
    general
    /
    traffic
    /
    warehouse
    once the user has previously verified output quality on the same data type.
  • The pipeline has built-in resume, so a pilot followed by a full run does not re-process the pilot videos.
  • 当领域为
    custom
    、提示词有修改,或用户首次运行时,建议先进行5–10个视频的试点运行
  • 对于
    general
    /
    traffic
    /
    warehouse
    领域,若用户已验证过同类型数据的输出质量,可直接进行全量运行
  • 流水线支持断点续跑,因此试点运行后再进行全量运行不会重复处理试点视频。

Quick start

快速开始

The pipeline runs inside the TAO Toolkit container via the
auto_label
CLI:
bash
auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    video_reasoning_annotation.data.video_root=/videos \
    video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
    video_reasoning_annotation.workflow.mode=auto
Generate a default spec to start from:
bash
auto_label default_specs results_dir=/results module_name=auto_label
该流水线在TAO Toolkit容器内通过
auto_label
CLI运行:
bash
auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    video_reasoning_annotation.data.video_root=/videos \
    video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
    video_reasoning_annotation.workflow.mode=auto
生成默认配置文件作为起点:
bash
auto_label default_specs results_dir=/results module_name=auto_label

then set: autolabel_type: "video_reasoning_annotation"

然后设置: autolabel_type: "video_reasoning_annotation"


All fields support Hydra dot-notation overrides on the command line. For the full YAML reference (every field, model/endpoint setup, error patterns), see [references/configuration.md](references/configuration.md).

所有字段支持通过命令行使用Hydra点标记法覆盖。完整YAML参考(包含所有字段、模型/端点配置、错误模式)请查看[references/configuration.md](references/configuration.md)。

Pilot workflow

试点工作流

Use this when running a 5–10 video pilot:
  1. Run the pipeline on the pilot subset with the chosen
    prompts_module
    and
    workflow.mode
    .
  2. Inspect
    results_dir/step_1a_caption/captions.jsonl
    — captions accurate, capturing the right level of detail?
  3. Inspect
    results_dir/step_3_qa/qa_output.jsonl
    — questions meaningful, answers correct, reasoning logical?
  4. If quality is insufficient: adjust the prompts (in
    prompts_module
    if domain-customized, or fall back to
    general
    if a domain module is over-tuned), and re-run. The pipeline auto-skips already-processed videos.
  5. Once satisfied, scale to the full dataset by pointing
    data.video_root
    (or
    data.input_jsonl_files
    ) at the full set and re-running with the same
    results_dir
    (resume) or a fresh one (full re-run).
Quality compounds downstream — bad captions produce bad descriptions which produce bad QA. Focus iteration on Step 1a/1b output first; descriptions and QA usually improve once captions are right.
进行5–10个视频的试点运行时,请遵循以下步骤:
  1. 使用选定的
    prompts_module
    workflow.mode
    在试点子集上运行流水线。
  2. 检查
    results_dir/step_1a_caption/captions.jsonl
    ——字幕是否准确,是否捕捉到合适的细节粒度?
  3. 检查
    results_dir/step_3_qa/qa_output.jsonl
    ——问题是否有意义,答案是否正确,推理是否合理?
  4. 如果质量不足:调整提示词(若为自定义领域则修改
    prompts_module
    ,若领域模块过度调优则切换回
    general
    ),然后重新运行。流水线会自动跳过已处理的视频。
  5. 满意后,将
    data.video_root
    (或
    data.input_jsonl_files
    )指向全量数据集,使用相同的
    results_dir
    (续跑)或新目录(全量重跑)重新运行,完成全量处理。
质量会向下游传递——糟糕的字幕会生成糟糕的描述,进而生成糟糕的QA。请优先迭代优化步骤1a/1b的输出;字幕质量达标后,描述和QA通常会随之改善。

Configuration summary

配置摘要

Key fields (full reference in references/configuration.md):
FieldDefaultDescription
workflow.steps
["0","1a","1b","1c","2","3","4"]
Which pipeline steps to execute
workflow.mode
"auto"
"auto"
,
"anomaly"
, or
"normal"
vlm.backend
"gemini"
"gemini"
or
"openai"
(OpenAI-compatible)
llm.backend
"gemini"
Same options; text-only, cheaper model works
workflow.max_workers
4
Parallel threads per step (watch API rate limits)
license
""
Optional: written to
metadata.license
in step 4 outputs (e.g.
"CC-BY-4.0"
)
description_extra
""
Optional: extra text appended to per-task descriptions in step 4 metadata
prompts_module
""
Dotted import path to custom prompts module
关键字段(完整参考见references/configuration.md):
字段默认值描述
workflow.steps
["0","1a","1b","1c","2","3","4"]
要执行的流水线步骤
workflow.mode
"auto"
可选值:
"auto"
"anomaly"
"normal"
vlm.backend
"gemini"
可选值:
"gemini"
"openai"
(兼容OpenAI的端点)
llm.backend
"gemini"
可选值同上;仅处理文本,使用更便宜的模型即可
workflow.max_workers
4
每个步骤的并行线程数(注意API速率限制)
license
""
可选:会写入步骤4输出的
metadata.license
字段(如
"CC-BY-4.0"
description_extra
""
可选:附加到步骤4元数据中每个任务的描述文本
prompts_module
""
自定义提示词模块的点分隔导入路径

Prompts

提示词

  • Built-in (general):
    nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts
    — domain-agnostic, used by default.
  • Template:
    nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template
    — same 26 keys with
    [PLACEHOLDER]
    markers for domain customization.
  • Reference modules (working examples for the consultation's
    traffic
    /
    warehouse
    branches): references/prompts_traffic.py, references/prompts_warehouse.py.
  • Custom domains: see references/domain_adaptation.md for the full workshop and placeholder reference.
  • 内置(通用领域)
    nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts
    ——全领域通用,默认使用。
  • 模板
    nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template
    ——包含26个带
    [PLACEHOLDER]
    标记的字段,用于领域自定义。
  • 参考模块(对应咨询环节中
    traffic
    /
    warehouse
    分支的可用示例):references/prompts_traffic.pyreferences/prompts_warehouse.py
  • 自定义领域:查看references/domain_adaptation.md获取完整教程和占位符参考。

Inputs

输入

  • video_root
    : Directory of videos (walked recursively for
    .mp4
    ,
    .avi
    ,
    .mov
    ,
    .mkv
    ).
  • input_jsonl_files
    : List of JSONL files with
    {"video_path": "..."}
    per line. The
    video
    key is also accepted; extra fields are allowed.
  • filter_field
    : Optional boolean field to filter JSONL entries.
Provide
video_root
,
input_jsonl_files
, or both (lists merge).
  • video_root
    :视频目录(会递归遍历
    .mp4
    .avi
    .mov
    .mkv
    格式文件)。
  • input_jsonl_files
    :JSONL文件列表,每行包含
    {"video_path": "..."}
    。也支持
    video
    字段;允许附加其他字段。
  • filter_field
    :可选的布尔字段,用于过滤JSONL条目。
可提供
video_root
input_jsonl_files
,或同时提供(列表会合并)。

Outputs

输出

All outputs go to
results_dir/
with per-step subdirectories (
step_0_filter/
,
step_1a_caption/
, …,
step_4_output/
):
  • Steps 0–3: JSONL — one JSON object per video per line.
  • Step 4: One
    <task>.json
    per non-empty task type, in the
    tao-vl-reason-v1.0
    envelope. Up to 10 files:
    mcq.json
    ,
    mcq_openended.json
    ,
    bcq.json
    ,
    bcq_openended.json
    ,
    open_qa.json
    ,
    causal_linkage.json
    ,
    temporal_localization.json
    ,
    temporal_description.json
    ,
    scene_description.json
    ,
    video_summarization.json
    .
Each step 4 file looks like:
json
{
  "format": "tao-vl-reason-v1.0",
  "metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
               "description": "<per-task + description_extra>", "license": "<from config>"},
  "media_root": "<data.video_root>" | null,
  "items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}
media_root
mirrors
data.video_root
(or
null
when unset); each item's
video_id
is the entry's video path with the
video_root
prefix stripped. Set
license
and
description_extra
in the spec to populate the metadata.
所有输出均保存至
results_dir/
,每个步骤对应子目录(
step_0_filter/
step_1a_caption/
……
step_4_output/
):
  • 步骤0–3:JSONL格式——每行对应一个视频的JSON对象。
  • 步骤4:每个非空任务类型对应一个
    <task>.json
    文件,采用**
    tao-vl-reason-v1.0
    **封装格式。最多生成10个文件:
    mcq.json
    mcq_openended.json
    bcq.json
    bcq_openended.json
    open_qa.json
    causal_linkage.json
    temporal_localization.json
    temporal_description.json
    scene_description.json
    video_summarization.json
步骤4的每个文件格式如下:
json
{
  "format": "tao-vl-reason-v1.0",
  "metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
               "description": "<per-task + description_extra>", "license": "<from config>"},
  "media_root": "<data.video_root>" | null,
  "items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}
media_root
data.video_root
一致(未设置时为
null
);每个条目的
video_id
是去掉
video_root
前缀后的视频路径。可在配置文件中设置
license
description_extra
来填充元数据。

Prerequisites

前置条件

  • Container:
    tao_toolkit.pyt
    (resolves to
    nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
    via
    versions.yaml
    ).
  • ffmpeg / ffprobe: required for chunk captioning (Step 1b) and highlight extraction (Step 1c).
  • VLM endpoint: at least one — Gemini API key or OpenAI-compatible endpoint.
  • 容器
    tao_toolkit.pyt
    (通过
    versions.yaml
    解析为
    nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
    )。
  • ffmpeg / ffprobe:字幕分块(步骤1b)和异常高亮提取(步骤1c)所需。
  • VLM端点:至少一个——Gemini API密钥或兼容OpenAI的端点。