tao-generate-video-reasoning-annotations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVideo Reasoning Annotation Pipeline
视频推理标注流水线
Generate Chain-of-Thought training datasets from videos by producing multi-level captions, structured descriptions, and QA pairs (MCQ, binary, open-ended) with step-by-step reasoning traces. Domain-agnostic by default — customize prompts for any video domain.
通过生成多级字幕、结构化描述,以及带分步推理轨迹的QA对(选择题、二元题、开放式题),将视频转换为思维链(Chain-of-Thought)训练数据集。默认支持全领域,可针对任意视频领域自定义提示词。
Purpose
用途
Transform raw videos into CoT Q&A training data for video understanding models. VLMs (e.g., Gemini, Qwen) act as "teacher" annotators: Steps 0–1 require the model to see the video (VLM calls); Steps 2–3 are text-to-text (cheaper LLM calls).
将原始视频转换为面向视频理解模型的CoT问答训练数据。VLM(如Gemini、Qwen)充当「教师」标注器:步骤0–1需要模型读取视频(调用VLM);步骤2–3为文本转文本(调用成本更低的LLM)。
Pipeline architecture
流水线架构
Step 0: [Optional] Filter & classify videos → Keep domain-relevant, classify anomaly vs normal
Step 1a: Global + dense captions → VLM: narrative summary + timestamped events
Step 1b: Chunk captions → VLM: fixed-duration segment micro-captions
Step 1c: [Optional, anomaly only] Highlight → LLM extracts anomaly timestamp, VLM captions clip
Step 2: Description synthesis → LLM: synthesize captions into structured narrative
Step 3: QA generation → LLM: MCQ, binary, open-ended with reasoning
Step 4: Parse outputs → Per-task `tao-vl-reason-v1.0` JSON filesSteps are individually selectable via . The pipeline has built-in resume — each step skips already-processed videos, so re-running after a prompt tweak is safe.
workflow.stepsStep 0: [可选] 过滤与分类视频 → 保留领域相关视频,区分异常与正常视频
Step 1a: 全局+密集字幕生成 → VLM:生成叙事摘要+带时间戳的事件
Step 1b: 字幕分块 → VLM:生成固定时长片段的微字幕
Step 1c: [可选,仅异常视频] 异常高亮 → LLM提取异常时间戳,VLM生成对应片段字幕
Step 2: 描述合成 → LLM:将字幕合成为结构化叙事文本
Step 3: QA生成 → LLM:生成带推理过程的选择题、二元题、开放式题
Step 4: 输出解析 → 生成对应任务的`tao-vl-reason-v1.0`格式JSON文件可通过单独选择执行的步骤。流水线支持断点续跑——每个步骤会跳过已处理的视频,因此调整提示词后重新运行是安全的。
workflow.stepsInitial consultation
初始咨询
When the user invokes this skill, walk through these questions in order. Don't skip — getting domain and VLM access right up front prevents wasted runs.
当用户调用该技能时,请按顺序完成以下问题询问。不要跳过——提前确认领域和VLM访问权限可避免无效运行。
1. Videos
1. 视频信息
- Path to the video directory and/or a JSONL with per line.
{"video_path": "..."} - Confirm format (preferred;
.mp4,.avi,.movalso walked)..mkv
- 视频目录路径,或每行包含的JSONL文件。
{"video_path": "..."} - 确认视频格式(推荐;也支持
.mp4、.avi、.mov)。.mkv
2. Domain — drives prompt selection
2. 领域——决定提示词选择
Ask the user: "What domain are these videos from?" Choose one of the following branches:
| Domain | What to do |
|---|---|
| general | Use the default prompts. Set |
| traffic (CCTV intersections, highways; dashcam excluded) | Use the reference module. Set |
| warehouse (industrial site CCTV — safety, operations, security) | Same pattern. Set |
| custom (any other domain) | Run the workshop in references/domain_adaptation.md. It walks through: Phase 1 — question types the user wants the model to answer; Phase 2 — caption-requirements checklist; Phase 3 — fill the |
询问用户:「这些视频属于哪个领域?」 选择以下分支之一:
| 领域 | 操作说明 |
|---|---|
| 通用领域 | 使用默认提示词。设置 |
| 交通领域(CCTV路口、高速路;不包含行车记录仪) | 使用参考模块。设置 |
| 仓储领域(工业场地CCTV——安全、运营、安防) | 操作模式相同。设置 |
| 自定义领域(其他任意领域) | 运行references/domain_adaptation.md中的教程。教程包含:阶段1——用户希望模型回答的问题类型;阶段2——字幕要求清单;阶段3——填充 |
3. Anomaly / normal / mixed
3. 异常/正常/混合数据集
- Mixed dataset → (Step 0 classifies each video).
workflow.mode: "auto" - Pre-split anomaly only → , drop Step 0.
workflow.mode: "anomaly" - Pre-split normal only → , drop Steps 0 and 1c.
workflow.mode: "normal"
- 混合数据集 → 设置(步骤0会对每个视频进行分类)。
workflow.mode: "auto" - 已拆分的仅异常数据集 → 设置,跳过步骤0。
workflow.mode: "anomaly" - 已拆分的仅正常数据集 → 设置,跳过步骤0和1c。
workflow.mode: "normal"
4. VLM / LLM endpoint — confirm access before running
4. VLM/LLM端点——运行前确认访问权限
- Gemini (default for both and
vlm.backend): user needsllm.backendset, or to put the key in the YAML.GOOGLE_API_KEY - OpenAI-compatible (Qwen via vLLM, NIM endpoint, etc.): user provides ,
base_url, andmodel_name.api_key - Steps 2–3 are text-only — a smaller/cheaper LLM is fine for even when
llm.backendis a frontier video model.vlm.backend
If the user has no endpoint at all and wants to self-host, point them at the skill — a workflow that stands up a network-specific TAO inference microservice locally and exposes an OpenAI-compatible endpoint. Should support Cosmos, Qwen, and Gemma. Check for the current list before relying on a specific model.
skills/applications/tao-run-inference-serviceskills/applications/tao-run-inference-service/references/service.yamlvalid_network_arch_config_basenamesIf the user doesn't have endpoint access ready and isn't ready to set one up, stop here and help them figure it out first.
- Gemini(和
vlm.backend的默认选项):用户需要设置llm.backend,或在YAML文件中配置密钥。GOOGLE_API_KEY - 兼容OpenAI的端点(如通过vLLM的Qwen、NIM端点等):用户需提供、
base_url和model_name。api_key - 步骤2–3仅处理文本——即使使用前沿视频模型,
vlm.backend使用更小/更便宜的LLM即可。llm.backend
如果用户完全没有可用端点且希望自托管,请引导其使用技能——该工作流可在本地搭建特定网络的TAO推理微服务,并暴露兼容OpenAI的端点。支持Cosmos、Qwen和Gemma模型。在依赖特定模型前,请查看中的列表。
skills/applications/tao-run-inference-serviceskills/applications/tao-run-inference-service/references/service.yamlvalid_network_arch_config_basenames如果用户尚未准备好端点访问权限且暂不打算配置,请在此停止,先协助解决该问题。
5. Pilot vs full run
5. 试点运行 vs 全量运行
- Recommend a 5–10 video pilot when domain is , when any prompt was edited, or when this is the user's first run.
custom - Full-run is fine for /
general/trafficonce the user has previously verified output quality on the same data type.warehouse - The pipeline has built-in resume, so a pilot followed by a full run does not re-process the pilot videos.
- 当领域为、提示词有修改,或用户首次运行时,建议先进行5–10个视频的试点运行。
custom - 对于/
general/traffic领域,若用户已验证过同类型数据的输出质量,可直接进行全量运行。warehouse - 流水线支持断点续跑,因此试点运行后再进行全量运行不会重复处理试点视频。
Quick start
快速开始
The pipeline runs inside the TAO Toolkit container via the CLI:
auto_labelbash
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
video_reasoning_annotation.data.video_root=/videos \
video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
video_reasoning_annotation.workflow.mode=autoGenerate a default spec to start from:
bash
auto_label default_specs results_dir=/results module_name=auto_label该流水线在TAO Toolkit容器内通过 CLI运行:
auto_labelbash
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
video_reasoning_annotation.data.video_root=/videos \
video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
video_reasoning_annotation.workflow.mode=auto生成默认配置文件作为起点:
bash
auto_label default_specs results_dir=/results module_name=auto_labelthen set: autolabel_type: "video_reasoning_annotation"
然后设置: autolabel_type: "video_reasoning_annotation"
All fields support Hydra dot-notation overrides on the command line. For the full YAML reference (every field, model/endpoint setup, error patterns), see [references/configuration.md](references/configuration.md).
所有字段支持通过命令行使用Hydra点标记法覆盖。完整YAML参考(包含所有字段、模型/端点配置、错误模式)请查看[references/configuration.md](references/configuration.md)。Pilot workflow
试点工作流
Use this when running a 5–10 video pilot:
- Run the pipeline on the pilot subset with the chosen and
prompts_module.workflow.mode - Inspect — captions accurate, capturing the right level of detail?
results_dir/step_1a_caption/captions.jsonl - Inspect — questions meaningful, answers correct, reasoning logical?
results_dir/step_3_qa/qa_output.jsonl - If quality is insufficient: adjust the prompts (in if domain-customized, or fall back to
prompts_moduleif a domain module is over-tuned), and re-run. The pipeline auto-skips already-processed videos.general - Once satisfied, scale to the full dataset by pointing (or
data.video_root) at the full set and re-running with the samedata.input_jsonl_files(resume) or a fresh one (full re-run).results_dir
Quality compounds downstream — bad captions produce bad descriptions which produce bad QA. Focus iteration on Step 1a/1b output first; descriptions and QA usually improve once captions are right.
进行5–10个视频的试点运行时,请遵循以下步骤:
- 使用选定的和
prompts_module在试点子集上运行流水线。workflow.mode - 检查——字幕是否准确,是否捕捉到合适的细节粒度?
results_dir/step_1a_caption/captions.jsonl - 检查——问题是否有意义,答案是否正确,推理是否合理?
results_dir/step_3_qa/qa_output.jsonl - 如果质量不足:调整提示词(若为自定义领域则修改,若领域模块过度调优则切换回
prompts_module),然后重新运行。流水线会自动跳过已处理的视频。general - 满意后,将(或
data.video_root)指向全量数据集,使用相同的data.input_jsonl_files(续跑)或新目录(全量重跑)重新运行,完成全量处理。results_dir
质量会向下游传递——糟糕的字幕会生成糟糕的描述,进而生成糟糕的QA。请优先迭代优化步骤1a/1b的输出;字幕质量达标后,描述和QA通常会随之改善。
Configuration summary
配置摘要
Key fields (full reference in references/configuration.md):
| Field | Default | Description |
|---|---|---|
| | Which pipeline steps to execute |
| | |
| | |
| | Same options; text-only, cheaper model works |
| | Parallel threads per step (watch API rate limits) |
| | Optional: written to |
| | Optional: extra text appended to per-task descriptions in step 4 metadata |
| | Dotted import path to custom prompts module |
关键字段(完整参考见references/configuration.md):
| 字段 | 默认值 | 描述 |
|---|---|---|
| | 要执行的流水线步骤 |
| | 可选值: |
| | 可选值: |
| | 可选值同上;仅处理文本,使用更便宜的模型即可 |
| | 每个步骤的并行线程数(注意API速率限制) |
| | 可选:会写入步骤4输出的 |
| | 可选:附加到步骤4元数据中每个任务的描述文本 |
| | 自定义提示词模块的点分隔导入路径 |
Prompts
提示词
- Built-in (general): — domain-agnostic, used by default.
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts - Template: — same 26 keys with
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_templatemarkers for domain customization.[PLACEHOLDER] - Reference modules (working examples for the consultation's /
trafficbranches): references/prompts_traffic.py, references/prompts_warehouse.py.warehouse - Custom domains: see references/domain_adaptation.md for the full workshop and placeholder reference.
- 内置(通用领域):——全领域通用,默认使用。
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts - 模板:——包含26个带
nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template标记的字段,用于领域自定义。[PLACEHOLDER] - 参考模块(对应咨询环节中/
traffic分支的可用示例):references/prompts_traffic.py、references/prompts_warehouse.py。warehouse - 自定义领域:查看references/domain_adaptation.md获取完整教程和占位符参考。
Inputs
输入
- : Directory of videos (walked recursively for
video_root,.mp4,.avi,.mov)..mkv - : List of JSONL files with
input_jsonl_filesper line. The{"video_path": "..."}key is also accepted; extra fields are allowed.video - : Optional boolean field to filter JSONL entries.
filter_field
Provide , , or both (lists merge).
video_rootinput_jsonl_files- :视频目录(会递归遍历
video_root、.mp4、.avi、.mov格式文件)。.mkv - :JSONL文件列表,每行包含
input_jsonl_files。也支持{"video_path": "..."}字段;允许附加其他字段。video - :可选的布尔字段,用于过滤JSONL条目。
filter_field
可提供、,或同时提供(列表会合并)。
video_rootinput_jsonl_filesOutputs
输出
All outputs go to with per-step subdirectories (, , …, ):
results_dir/step_0_filter/step_1a_caption/step_4_output/- Steps 0–3: JSONL — one JSON object per video per line.
- Step 4: One per non-empty task type, in the
<task>.jsonenvelope. Up to 10 files:tao-vl-reason-v1.0,mcq.json,mcq_openended.json,bcq.json,bcq_openended.json,open_qa.json,causal_linkage.json,temporal_localization.json,temporal_description.json,scene_description.json.video_summarization.json
Each step 4 file looks like:
json
{
"format": "tao-vl-reason-v1.0",
"metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
"description": "<per-task + description_extra>", "license": "<from config>"},
"media_root": "<data.video_root>" | null,
"items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}media_rootdata.video_rootnullvideo_idvideo_rootlicensedescription_extra所有输出均保存至,每个步骤对应子目录(、……):
results_dir/step_0_filter/step_1a_caption/step_4_output/- 步骤0–3:JSONL格式——每行对应一个视频的JSON对象。
- 步骤4:每个非空任务类型对应一个文件,采用**
<task>.json**封装格式。最多生成10个文件:tao-vl-reason-v1.0、mcq.json、mcq_openended.json、bcq.json、bcq_openended.json、open_qa.json、causal_linkage.json、temporal_localization.json、temporal_description.json、scene_description.json。video_summarization.json
步骤4的每个文件格式如下:
json
{
"format": "tao-vl-reason-v1.0",
"metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
"description": "<per-task + description_extra>", "license": "<from config>"},
"media_root": "<data.video_root>" | null,
"items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}media_rootdata.video_rootnullvideo_idvideo_rootlicensedescription_extraPrerequisites
前置条件
- Container: (resolves to
tao_toolkit.pytvianvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt).versions.yaml - ffmpeg / ffprobe: required for chunk captioning (Step 1b) and highlight extraction (Step 1c).
- VLM endpoint: at least one — Gemini API key or OpenAI-compatible endpoint.
- 容器:(通过
tao_toolkit.pyt解析为versions.yaml)。nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt - ffmpeg / ffprobe:字幕分块(步骤1b)和异常高亮提取(步骤1c)所需。
- VLM端点:至少一个——Gemini API密钥或兼容OpenAI的端点。