tao-mine-aoi-images
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDEFT Mining and Embedding Skill
DEFT 嵌入与挖掘Skill
You are the operator of the DEFT embed-then-mine workflow for VCN AOI. Your job is to take a parquet of weak target images (the gap-analysis or routing output) and a source pool, then produce a deduplicated parquet of mined source images that look similar to the targets — ready to feed into the next training round.
The workflow is fixed and deterministic: embed the targets, embed the source pool, then mine nearest neighbours. Each step's output parquet is the next step's input. There is no iterative search, no clustering pass, no human-in-the-loop selection — depth comes from picking the right encoder and the right , not from a multi-phase investigation.
topnThe whole skill is a thin wrapper around three direct invocations against the image declared in (resolved at runtime — see Setup). The container's entrypoint takes : for embedding and for mining. The flag points at a YAML of schema defaults; anything afterward is a bare Hydra override () applied per run. There is no keyword inside the container — that's the TAO launcher's pillar prefix and is dropped here. Schema keys can rename between data-services releases, so when in doubt introspect once per image with and . See for the full entrypoint contract, introspection, and the paste-and-edit end-to-end recipe.
docker runtao_toolkit.data_servicesversions.yaml<category> <action> -e <spec.yaml> [hydra overrides...]embedding image_embeddings -e <embedding_spec.yaml> …tmm nearest_neighbors -e <mining_spec.yaml> …-ekey=valuedatasetdocker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job... tmm nearest_neighbors --cfg=jobreferences/invocation.md--cfg=job你是面向VCN AOI的DEFT嵌入-挖掘工作流的操作者。你的任务是接收弱目标图像的parquet文件(差距分析或路由输出)和一个源池,然后生成去重后的挖掘源图像parquet文件,这些图像与目标图像相似,可直接用于下一轮训练。
该工作流是固定且确定性的:嵌入目标图像、嵌入源池、然后挖掘最近邻。每个步骤的输出parquet是下一步的输入。这里没有迭代搜索、聚类过程或人在环选择——深度来自选择正确的编码器和合适的参数,而非多阶段调研。
topn整个Skill是对中声明的镜像的三个直接调用的轻量封装(运行时解析——见设置部分)。容器的入口点采用格式:嵌入操作使用,挖掘操作使用。标志指向一个包含Schema默认值的YAML文件;后续内容是每次运行时应用的Hydra裸覆盖参数()。容器内部没有关键字——这是TAO启动器的支柱前缀,在此处会被丢弃。Schema键可能会在data-services版本更新时重命名,因此如有疑问,可通过和对镜像进行一次自省。完整的入口点约定、自省方法以及复制即用的端到端流程,请参考。
versions.yamltao_toolkit.data_servicesdocker run<category> <action> -e <spec.yaml> [hydra overrides...]embedding image_embeddings -e <embedding_spec.yaml> …tmm nearest_neighbors -e <mining_spec.yaml> …-ekey=valuedatasetdocker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job... tmm nearest_neighbors --cfg=job--cfg=jobreferences/invocation.mdInputs
输入
- Target parquet — the gap-analysis output, typically from
mining_gaps.parquet(ortao-route-visual-changenet-samplesfromgaps.parquetif routing was skipped). Required column:tao-analyze-gaps-visual-changenet. Iffilepathis also present, label-aware filtering during mining is available; otherwise the mining task silently no-ops the filter.label - Source pool — a parquet of candidate images to mine against, with a column. If the user only has a CSV, convert it to a parquet with the same columns before Step 2. For label-aware filtering, the pool must also carry a
filepathcolumn.label - Embedding spec file — a YAML containing ,
model,model_path, and (only whenbatch_sizeis a TAOmodel_path/.pth).ckpt. Reused across Steps 1 and 2;model_config_path/input_parquetare supplied per run as Hydra overrides. The same spec MUST drive both embedding steps — embeddings from different encoders are not comparable, and mismatched encoders are the most common cause of "the mined images look unrelated" reports.output_parquet - Mining spec file — a YAML containing ,
topn,knn_metric, and (rarely changed)filter_by_label/source_embed_column_name.target_embed_column_name/source_parquet/target_parquetare Hydra overrides at run time. SigLIP and CLIP embeddings should useoutput_parquet. Whenknn_metric: cosinebut either embedding parquet lacks afilter_by_label: truecolumn, the container logs a warning and proceeds without filtering.label
- 目标parquet——差距分析输出,通常是生成的
tao-route-visual-changenet-samples(如果跳过路由,则是mining_gaps.parquet生成的tao-analyze-gaps-visual-changenet)。必填列:gaps.parquet。如果还存在filepath列,则挖掘过程中可启用标签感知过滤;否则挖掘任务会静默跳过过滤操作。label - 源池——包含候选图像的parquet文件,需带有列。如果用户只有CSV文件,需先将其转换为列名相同的parquet文件再进入步骤2。若要启用标签感知过滤,源池还必须包含
filepath列。label - 嵌入配置文件——包含、
model、model_path以及(仅当batch_size是TAO的model_path/.pth文件时).ckpt的YAML文件。步骤1和步骤2会复用该文件;model_config_path/input_parquet会作为Hydra覆盖参数在每次运行时提供。必须使用同一个配置文件驱动两个嵌入步骤——不同编码器生成的嵌入不可比较,编码器不匹配是“挖掘出的图像看起来不相关”报告的最常见原因。output_parquet - 挖掘配置文件——包含、
topn、knn_metric以及(很少修改的)filter_by_label/source_embed_column_name的YAML文件。target_embed_column_name/source_parquet/target_parquet会作为Hydra覆盖参数在运行时提供。SigLIP和CLIP嵌入应使用output_parquet。当knn_metric: cosine但任一嵌入parquet缺少filter_by_label: true列时,容器会记录警告并不进行过滤继续执行。label
Setup
设置
The mining and embedding tasks live inside the image declared in . Resolve the concrete URI once at the top of the run, then confirm Docker, the NVIDIA container toolkit, and a GPU are present before anything else:
tao_toolkit.data_servicesversions.yamlbash
undefined挖掘和嵌入任务位于中声明的镜像内。在运行开始时先解析具体的URI,然后在执行其他操作前确认Docker、NVIDIA容器工具包和GPU是否存在:
versions.yamltao_toolkit.data_servicesbash
undefinedResolve tao_toolkit.data_services → concrete nvcr.io/... URI from versions.yaml
从versions.yaml中解析tao_toolkit.data_services → 具体的nvcr.io/... URI
DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])")
echo "DS_IMAGE=$DS_IMAGE"
docker info > /dev/null && echo "OK: docker"
nvidia-smi > /dev/null && echo "OK: GPU"
docker image inspect "$DS_IMAGE" > /dev/null
|| docker pull "$DS_IMAGE"
|| docker pull "$DS_IMAGE"
`TAO_SKILL_BANK_PATH` is exported by the plugin's `session_start` hook. If it is unset (e.g. running outside the Claude Code plugin), point it at the skill-bank repo root before resolving. A GPU is required for both the encoder forward pass and the cuML/cuDF k-NN search; both steps will fail without CUDA.
**Path mounting.** Every host path the container reads or writes — input parquets, output dirs, and the source-pool image root — must be bind-mounted. The simplest, most predictable approach mounts the workspace root with **identical paths** inside and outside the container so absolute paths in the parquet args resolve the same way on both sides:
```bash
WORKSPACE=<absolute path that contains all parquets, outputs, and the source-pool images>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"Reuse for the three invocations below.
$DOCKERCSV source pool. If the source pool is provided only as a CSV, convert it to a parquet up front with , preserving the column verbatim (and if present). Do not add a path prefix — the container reads input parquets as-is and the mount keeps host and container paths identical.
pd.read_csv(...).to_parquet(..., index=False)filepathlabel$WORKSPACEAuthor the two spec files once per iteration. Both files live under so the argument resolves on both sides of the mount. Per-run values stay out of the spec and are passed as Hydra overrides at invocation time. The defaults are , , for embedding, and , , (quoted — the schema reads it as a string) for mining. Use for SigLIP/CLIP, / otherwise; add only when is a TAO checkpoint. Any field can still be overridden inline at the CLI (e.g. ) — Hydra applies CLI overrides on top of the spec.
$WORKSPACE-emodel: SigLIPmodel_path: google/siglip-base-patch16-224batch_size: 64topn: 5knn_metric: cosinefilter_by_label: "false"cosineeuclideanmanhattanmodel_config_pathmodel_pathtopn=10See for the verbatim spec-file templates, the CSV conversion snippet, and the full mounting and image-resolution detail.
references/invocation.mdDS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])")
echo "DS_IMAGE=$DS_IMAGE"
docker info > /dev/null && echo "OK: docker"
nvidia-smi > /dev/null && echo "OK: GPU"
docker image inspect "$DS_IMAGE" > /dev/null
|| docker pull "$DS_IMAGE"
|| docker pull "$DS_IMAGE"
`TAO_SKILL_BANK_PATH`由插件的`session_start`钩子导出。如果未设置(例如在Claude Code插件外运行),需在解析前将其指向skill-bank仓库根目录。编码器前向传播和cuML/cuDF k-NN搜索都需要GPU;没有CUDA的话两个步骤都会失败。
**路径挂载**。容器读取或写入的每个主机路径——输入parquet、输出目录以及源池图像根目录——都必须进行绑定挂载。最简单、最可预测的方法是将工作区根目录以**内外路径完全相同**的方式挂载,这样parquet参数中的绝对路径在主机和容器两侧都能解析为相同路径:
```bash
WORKSPACE=<包含所有parquet、输出文件和源池图像的绝对路径>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"后续三个调用复用别名。
$DOCKERCSV源池。如果源池仅以CSV形式提供,需先通过将其转换为parquet文件,严格保留列(如果存在列也需保留)。不要添加路径前缀——容器会按原样读取输入parquet,挂载确保主机和容器路径一致。
pd.read_csv(...).to_parquet(..., index=False)filepathlabel$WORKSPACE每次迭代编写两个配置文件。两个文件都放在下,这样参数在挂载两侧都能解析。运行特定值不要写入配置文件,而是在调用时作为Hydra覆盖参数传递。嵌入的默认值为、、;挖掘的默认值为、、(带引号——Schema将其视为字符串)。SigLIP/CLIP使用,其他模型使用/;仅当是TAO检查点时才添加。任何字段仍可在CLI中直接覆盖(例如)——Hydra会在配置文件基础上应用CLI覆盖参数。
$WORKSPACE-emodel: SigLIPmodel_path: google/siglip-base-patch16-224batch_size: 64topn: 5knn_metric: cosinefilter_by_label: "false"cosineeuclideanmanhattanmodel_pathmodel_config_pathtopn=10完整的配置文件模板、CSV转换代码片段以及挂载和镜像解析的详细说明,请参考。
references/invocation.mdMethod
方法
Three commands, in order. Each command's output parquet is the next command's input. Run them as plain Bash; the alias from Setup handles the container, GPU, and mounts. Every invocation follows the same shape: for the baked-in defaults, then a handful of Hydra overrides for run-specific paths.
$DOCKER-e <spec>依次执行三个命令。每个命令的输出parquet是下一个命令的输入。以普通Bash命令运行;设置部分的别名会处理容器、GPU和挂载。每次调用都遵循相同的格式:指定内置默认值,然后添加少量Hydra覆盖参数用于运行特定路径。
$DOCKER-e <spec>Step 1 — Embed the target images
步骤1 — 嵌入目标图像
bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>Reads the gap-analysis / routing output and writes a parquet with , , and any extra metadata columns (e.g. , , ) carried forward verbatim. Print the output schema () to stdout so the script-check hook can confirm the embedding column exists. To override / / for one run without editing the spec, append them as Hydra overrides.
filepathembeddinglabelsiamese_scoreweaknesspd.read_parquet(...).columnsmodelmodel_pathbatch_sizebash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>读取差距分析/路由输出,写入包含、以及所有额外元数据列(例如、、)的parquet文件。将输出Schema()打印到标准输出,以便脚本检查钩子确认embedding列存在。如需在不修改配置文件的情况下为单次运行覆盖//,可将其作为Hydra覆盖参数追加到命令后。
filepathembeddinglabelsiamese_scoreweaknesspd.read_parquet(...).columnsmodelmodel_pathbatch_sizeStep 2 — Embed the source pool
步骤2 — 嵌入源池
bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>Same command shape as Step 1, applied to the source pool. Use the identical as Step 1, and do not override / / differently here — mismatched encoder configs across the two steps produce non-comparable embeddings.
embedding_spec.yamlmodelmodel_pathbatch_sizebash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>命令格式与步骤1相同,应用于源池。必须使用与步骤1完全相同的,且此处不要修改//的覆盖参数——两个步骤的编码器配置不匹配会生成不可比较的嵌入。
embedding_spec.yamlmodelmodel_pathbatch_sizeStep 3 — Mine nearest neighbours
步骤3 — 挖掘最近邻
bash
$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
source_parquet=<source_embeddings_parquet> \
target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>For each target embedding, finds the closest source embeddings under the chosen metric, deduplicates across targets, and writes a single-column () parquet of unique mined source paths. The container also drops a next to the output parquet with: query count, neighbour count, duplicates removed, and (when label filtering is on) kept-vs-dropped pair counts. Tweak , , or via inline Hydra override when sweeping — no need to rewrite the spec. When but one embedding parquet is missing the column, the container logs a warning and proceeds without filtering; if the mined output looks too large or contains cross-label pairs, scan the docker log for that warning first.
topnfilepathmining_summary.txttopnknn_metricfilter_by_labelfilter_by_label=truelabelSee for the complete paste-and-edit recipe that runs all three steps as one streamed Bash block with row-count sanity prints.
references/invocation.mdbash
$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
source_parquet=<source_embeddings_parquet> \
target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>针对每个目标嵌入,在选定的度量标准下找到个最接近的源嵌入,在所有目标中去重,然后写入包含唯一挖掘源路径的单列()parquet文件。容器还会在输出parquet旁边生成,包含:查询数量、邻居数量、移除的重复项数量以及(当启用标签过滤时)保留/丢弃的配对数量。在调参时可通过Hydra内联覆盖参数修改、或——无需重写配置文件。当但其中一个嵌入parquet缺少列时,容器会记录警告并继续执行而不进行过滤;如果挖掘输出看起来过大或包含跨标签配对,请先查看Docker日志中的该警告信息。
topnfilepathmining_summary.txttopnknn_metricfilter_by_labelfilter_by_label=truelabel完整的复制即用流程可参考,其中包含将三个步骤作为一个流式Bash块运行并打印行数检查信息的示例。
references/invocation.mdOutputs
输出
Write everything into a timestamped folder under the experiment / iteration directory. The packaging hook will add and automatically when is written.
mining_config/claude_session.jsonlMining_Report.md<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md # Full mining report
├── embedding_spec.yaml # The -e spec used for Steps 1 and 2
├── mining_spec.yaml # The -e spec used for Step 3
├── target_embeddings.parquet # Step 1 output (filepath, embedding, + carried metadata)
├── source_embeddings.parquet # Step 2 output (filepath, embedding, + carried metadata)
├── mined.parquet # Step 3 output — unique mined source filepaths
├── mining_summary.txt # Auto-emitted next to mined.parquet by the container
├── mining_config/ # Auto-copied by hook
└── claude_session.jsonl # Auto-copied by hookAt the start of the run, get the real timestamp by running in Bash. Do NOT hardcode or guess. If the user specifies a custom output path, use it directly but maintain the same internal layout.
date +%Y-%m-%d_%H%M%SThe mined parquet is the artifact downstream training consumes. The two embedding parquets are intermediate but worth retaining: they are reusable across multiple mining runs against the same source pool, and they are the only place to look when a "looks unrelated" report needs encoder-level debugging.
将所有内容写入实验/迭代目录下的时间戳文件夹。当写入时,打包钩子会自动添加和。
Mining_Report.mdmining_config/claude_session.jsonl<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md # 完整挖掘报告
├── embedding_spec.yaml # 步骤1和步骤2使用的-e配置文件
├── mining_spec.yaml # 步骤3使用的-e配置文件
├── target_embeddings.parquet # 步骤1输出(filepath、embedding + 携带的元数据)
├── source_embeddings.parquet # 步骤2输出(filepath、embedding + 携带的元数据)
├── mined.parquet # 步骤3输出——唯一的挖掘源文件路径
├── mining_summary.txt # 容器自动在mined.parquet旁边生成的文件
├── mining_config/ # 钩子自动复制的内容
└── claude_session.jsonl # 钩子自动复制的内容在运行开始时,通过Bash命令获取真实时间戳。不要硬编码或猜测。如果用户指定了自定义输出路径,请直接使用该路径但保持内部结构不变。
date +%Y-%m-%d_%H%M%S挖掘得到的parquet是下游训练使用的工件。两个嵌入parquet是中间产物但值得保留:它们可在针对同一源池的多次挖掘运行中复用,并且当出现“看起来不相关”的报告需要进行编码器级调试时,它们是唯一的排查依据。
Common pitfalls
常见陷阱
The single most common cause of garbage output is mismatched encoders — both embedding steps must consume the same , and any / / override must apply to both steps or neither. Other frequent issues: skipping an embedding step, a missing column under (silent no-op), spec files outside , unresolved sentinels, TAO checkpoints without , CSV pools not converted to parquet, host/container path mismatches, no GPU, the wrong image tag, and × N_targets exceeding the source size (expected, not a bug — report the actual mined count).
embedding_spec.yamlmodelmodel_pathbatch_sizelabelfilter_by_label=true$WORKSPACE???model_config_pathtopnSee for the full diagnosis and fix for each of these.
references/troubleshooting.md输出无效结果的最常见原因是编码器不匹配——两个嵌入步骤必须使用相同的,且//的覆盖参数必须同时应用于两个步骤或都不应用。其他常见问题包括:跳过嵌入步骤、时缺少列(静默跳过过滤)、配置文件不在下、未解析的占位符、缺少的TAO检查点、未转换为parquet的CSV源池、主机/容器路径不匹配、无GPU、错误的镜像标签以及×N_targets超过源池大小(这是正常情况而非bug——报告实际挖掘数量即可)。
embedding_spec.yamlmodelmodel_pathbatch_sizefilter_by_label=truelabel$WORKSPACE???model_config_pathtopn每个问题的完整诊断和修复方法,请参考。
references/troubleshooting.mdReport Structure
报告结构
Keep the report tight (600–1200 words). Mining is a deterministic pipeline; the value is making the encoder choice, the row counts, and any silent filter no-ops auditable — not narrative. The report has seven sections: Verdict, Inputs, Encoder Consistency, Mining Run, Per-Label Breakdown (skipped if the target parquet has no column), Output Sanity, and Recommended Actions.
labelSee for the complete fill-in report template with every section and field.
references/reporting_spec.md报告需简洁(600-1200字)。挖掘是确定性流水线,价值在于使编码器选择、行数统计和任何静默过滤操作可审计——而非叙事性内容。报告包含七个部分:结论、输入、编码器一致性、挖掘运行、按标签细分(如果目标parquet无列则跳过)、输出合理性检查、推荐操作。
label完整的填空式报告模板(包含所有部分和字段)请参考。
references/reporting_spec.mdExecution Order
执行顺序
- Resolve from
DS_IMAGE(versions.yaml), then runimages.tao_toolkit.data_services,docker info, andnvidia-smi(pulling if missing) once to confirm the environment. Abort with a clear message if any fail.docker image inspect "$DS_IMAGE" - Run to get the timestamp; create
date +%Y-%m-%d_%H%M%S.<output_dir>/mining_results/<timestamp>/ - Write and
embedding_spec.yamlinto the timestamped dir, filling in the encoder choice and mining knobs. Keep these undermining_spec.yamlso the$WORKSPACEpath resolves inside the container.-e - If the source pool is a CSV, convert to parquet first (preserve and
filepath).label - Run Step 1 (embed targets) via . Print the output parquet's row count and columns to stdout.
docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=… - Run Step 2 (embed source pool) with the identical as Step 1. Print output row count and columns.
embedding_spec.yaml - Run Step 3 (mine nearest neighbours) via . Confirm
docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…was written next tomining_summary.txt.mined.parquet - Compute the per-label breakdown (Section 5) by joining the target embeddings parquet with the mined output on filepath, if both carry .
label - Write last — writing it triggers the packaging hook, which copies session logs and skill config alongside.
Mining_Report.md
- 从(
versions.yaml)解析images.tao_toolkit.data_services,然后运行DS_IMAGE、docker info和nvidia-smi(如果缺失则拉取)以确认环境。如果任何检查失败,需给出清晰提示并终止执行。docker image inspect "$DS_IMAGE" - 运行获取时间戳;创建
date +%Y-%m-%d_%H%M%S目录。<output_dir>/mining_results/<timestamp>/ - 将和
embedding_spec.yaml写入时间戳目录,填写编码器选择和挖掘参数。确保这些文件位于mining_spec.yaml下,以便容器内的$WORKSPACE路径可解析。-e - 如果源池是CSV文件,先将其转换为parquet(保留和
filepath)。label - 通过运行步骤1(嵌入目标图像)。将输出parquet的行数和列名打印到标准输出。
docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=… - 使用与步骤1完全相同的运行步骤2(嵌入源池)。打印输出行数和列名。
embedding_spec.yaml - 通过运行步骤3(挖掘最近邻)。确认
docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…已写入mining_summary.txt旁边。mined.parquet - 如果目标嵌入parquet和挖掘输出都带有列,通过
label将两者关联以计算按标签细分的结果(第5部分)。filepath - 最后写入——写入该文件会触发打包钩子,自动复制会话日志和Skill配置文件到旁边。
Mining_Report.md