tao-mine-aoi-images

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DEFT Mining and Embedding Skill

DEFT 嵌入与挖掘Skill

You are the operator of the DEFT embed-then-mine workflow for VCN AOI. Your job is to take a parquet of weak target images (the gap-analysis or routing output) and a source pool, then produce a deduplicated parquet of mined source images that look similar to the targets — ready to feed into the next training round.

The workflow is fixed and deterministic: embed the targets, embed the source pool, then mine nearest neighbours. Each step's output parquet is the next step's input. There is no iterative search, no clustering pass, no human-in-the-loop selection — depth comes from picking the right encoder and the right

topn

, not from a multi-phase investigation.

The whole skill is a thin wrapper around three direct

docker run

invocations against the

tao_toolkit.data_services

image declared in

versions.yaml

(resolved at runtime — see Setup). The container's entrypoint takes

<category> <action> -e <spec.yaml> [hydra overrides...]

embedding image_embeddings -e <embedding_spec.yaml> …

for embedding and

tmm nearest_neighbors -e <mining_spec.yaml> …

for mining. The

-e

flag points at a YAML of schema defaults; anything afterward is a bare Hydra override (

key=value

) applied per run. There is no

dataset

keyword inside the container — that's the TAO launcher's pillar prefix and is dropped here. Schema keys can rename between data-services releases, so when in doubt introspect once per image with

docker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job

and

... tmm nearest_neighbors --cfg=job

. See

references/invocation.md

for the full entrypoint contract,

--cfg=job

introspection, and the paste-and-edit end-to-end recipe.

你是面向VCN AOI的DEFT嵌入-挖掘工作流的操作者。你的任务是接收弱目标图像的parquet文件（差距分析或路由输出）和一个源池，然后生成去重后的挖掘源图像parquet文件，这些图像与目标图像相似，可直接用于下一轮训练。

该工作流是固定且确定性的：嵌入目标图像、嵌入源池、然后挖掘最近邻。每个步骤的输出parquet是下一步的输入。这里没有迭代搜索、聚类过程或人在环选择——深度来自选择正确的编码器和合适的

topn

参数，而非多阶段调研。

整个Skill是对

versions.yaml

中声明的

tao_toolkit.data_services

镜像的三个直接

docker run

调用的轻量封装（运行时解析——见设置部分）。容器的入口点采用

<category> <action> -e <spec.yaml> [hydra overrides...]

格式：嵌入操作使用

embedding image_embeddings -e <embedding_spec.yaml> …

，挖掘操作使用

tmm nearest_neighbors -e <mining_spec.yaml> …

。

-e

标志指向一个包含Schema默认值的YAML文件；后续内容是每次运行时应用的Hydra裸覆盖参数（

key=value

）。容器内部没有

dataset

关键字——这是TAO启动器的支柱前缀，在此处会被丢弃。Schema键可能会在data-services版本更新时重命名，因此如有疑问，可通过

docker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job

和

... tmm nearest_neighbors --cfg=job

对镜像进行一次自省。完整的入口点约定、

--cfg=job

自省方法以及复制即用的端到端流程，请参考

references/invocation.md

。

Inputs

输入

Target parquet — the gap-analysis output, typically
```
mining_gaps.parquet
```
from
```
tao-route-visual-changenet-samples
```
(or
```
gaps.parquet
```
from
```
tao-analyze-gaps-visual-changenet
```
if routing was skipped). Required column:
```
filepath
```
. If
```
label
```
is also present, label-aware filtering during mining is available; otherwise the mining task silently no-ops the filter.
Source pool — a parquet of candidate images to mine against, with a
```
filepath
```
column. If the user only has a CSV, convert it to a parquet with the same columns before Step 2. For label-aware filtering, the pool must also carry a
```
label
```
column.
Embedding spec file — a YAML containing
```
model
```
,
```
model_path
```
,
```
batch_size
```
, and (only when
```
model_path
```
is a TAO
```
.pth
```
/
```
.ckpt
```
)
```
model_config_path
```
. Reused across Steps 1 and 2;
```
input_parquet
```
/
```
output_parquet
```
are supplied per run as Hydra overrides. The same spec MUST drive both embedding steps — embeddings from different encoders are not comparable, and mismatched encoders are the most common cause of "the mined images look unrelated" reports.
Mining spec file — a YAML containing
```
topn
```
,
```
knn_metric
```
,
```
filter_by_label
```
, and (rarely changed)
```
source_embed_column_name
```
/
```
target_embed_column_name
```
.
```
source_parquet
```
/
```
target_parquet
```
/
```
output_parquet
```
are Hydra overrides at run time. SigLIP and CLIP embeddings should use
```
knn_metric: cosine
```
. When
```
filter_by_label: true
```
but either embedding parquet lacks a
```
label
```
column, the container logs a warning and proceeds without filtering.

目标parquet——差距分析输出，通常是
```
tao-route-visual-changenet-samples
```
生成的
```
mining_gaps.parquet
```
（如果跳过路由，则是
```
tao-analyze-gaps-visual-changenet
```
生成的
```
gaps.parquet
```
）。必填列：
```
filepath
```
。如果还存在
```
label
```
列，则挖掘过程中可启用标签感知过滤；否则挖掘任务会静默跳过过滤操作。
源池——包含候选图像的parquet文件，需带有
```
filepath
```
列。如果用户只有CSV文件，需先将其转换为列名相同的parquet文件再进入步骤2。若要启用标签感知过滤，源池还必须包含
```
label
```
列。
嵌入配置文件——包含
```
model
```
、
```
model_path
```
、
```
batch_size
```
以及（仅当
```
model_path
```
是TAO的
```
.pth
```
/
```
.ckpt
```
文件时）
```
model_config_path
```
的YAML文件。步骤1和步骤2会复用该文件；
```
input_parquet
```
/
```
output_parquet
```
会作为Hydra覆盖参数在每次运行时提供。必须使用同一个配置文件驱动两个嵌入步骤——不同编码器生成的嵌入不可比较，编码器不匹配是“挖掘出的图像看起来不相关”报告的最常见原因。
挖掘配置文件——包含
```
topn
```
、
```
knn_metric
```
、
```
filter_by_label
```
以及（很少修改的）
```
source_embed_column_name
```
/
```
target_embed_column_name
```
的YAML文件。
```
source_parquet
```
/
```
target_parquet
```
/
```
output_parquet
```
会作为Hydra覆盖参数在运行时提供。SigLIP和CLIP嵌入应使用
```
knn_metric: cosine
```
。当
```
filter_by_label: true
```
但任一嵌入parquet缺少
```
label
```
列时，容器会记录警告并不进行过滤继续执行。

Setup

设置

The mining and embedding tasks live inside the

tao_toolkit.data_services

image declared in

versions.yaml

. Resolve the concrete URI once at the top of the run, then confirm Docker, the NVIDIA container toolkit, and a GPU are present before anything else:

bash

undefined

挖掘和嵌入任务位于

versions.yaml

中声明的

tao_toolkit.data_services

镜像内。在运行开始时先解析具体的URI，然后在执行其他操作前确认Docker、NVIDIA容器工具包和GPU是否存在：

bash

undefined

Resolve tao_toolkit.data_services → concrete nvcr.io/... URI from versions.yaml

从versions.yaml中解析tao_toolkit.data_services → 具体的nvcr.io/... URI

DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])") echo "DS_IMAGE=$DS_IMAGE"

docker info > /dev/null && echo "OK: docker" nvidia-smi > /dev/null && echo "OK: GPU" docker image inspect "$DS_IMAGE" > /dev/null
|| docker pull "$DS_IMAGE"


`TAO_SKILL_BANK_PATH` is exported by the plugin's `session_start` hook. If it is unset (e.g. running outside the Claude Code plugin), point it at the skill-bank repo root before resolving. A GPU is required for both the encoder forward pass and the cuML/cuDF k-NN search; both steps will fail without CUDA.

**Path mounting.** Every host path the container reads or writes — input parquets, output dirs, and the source-pool image root — must be bind-mounted. The simplest, most predictable approach mounts the workspace root with **identical paths** inside and outside the container so absolute paths in the parquet args resolve the same way on both sides:

```bash
WORKSPACE=<absolute path that contains all parquets, outputs, and the source-pool images>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"

Reuse

$DOCKER

for the three invocations below.

CSV source pool. If the source pool is provided only as a CSV, convert it to a parquet up front with

pd.read_csv(...).to_parquet(..., index=False)

, preserving the

filepath

column verbatim (and

label

if present). Do not add a path prefix — the container reads input parquets as-is and the

$WORKSPACE

mount keeps host and container paths identical.

Author the two spec files once per iteration. Both files live under

$WORKSPACE

so the

-e

argument resolves on both sides of the mount. Per-run values stay out of the spec and are passed as Hydra overrides at invocation time. The defaults are

model: SigLIP

model_path: google/siglip-base-patch16-224

batch_size: 64

for embedding, and

topn: 5

knn_metric: cosine

filter_by_label: "false"

(quoted — the schema reads it as a string) for mining. Use

cosine

for SigLIP/CLIP,

euclidean

manhattan

otherwise; add

model_config_path

only when

model_path

is a TAO checkpoint. Any field can still be overridden inline at the CLI (e.g.

topn=10

) — Hydra applies CLI overrides on top of the spec.

See

references/invocation.md

for the verbatim spec-file templates, the CSV conversion snippet, and the full mounting and image-resolution detail.

DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])") echo "DS_IMAGE=$DS_IMAGE"

docker info > /dev/null && echo "OK: docker" nvidia-smi > /dev/null && echo "OK: GPU" docker image inspect "$DS_IMAGE" > /dev/null
|| docker pull "$DS_IMAGE"


`TAO_SKILL_BANK_PATH`由插件的`session_start`钩子导出。如果未设置（例如在Claude Code插件外运行），需在解析前将其指向skill-bank仓库根目录。编码器前向传播和cuML/cuDF k-NN搜索都需要GPU；没有CUDA的话两个步骤都会失败。

**路径挂载**。容器读取或写入的每个主机路径——输入parquet、输出目录以及源池图像根目录——都必须进行绑定挂载。最简单、最可预测的方法是将工作区根目录以**内外路径完全相同**的方式挂载，这样parquet参数中的绝对路径在主机和容器两侧都能解析为相同路径：

```bash
WORKSPACE=<包含所有parquet、输出文件和源池图像的绝对路径>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"

后续三个调用复用

$DOCKER

别名。

CSV源池。如果源池仅以CSV形式提供，需先通过

pd.read_csv(...).to_parquet(..., index=False)

将其转换为parquet文件，严格保留

filepath

列（如果存在

label

列也需保留）。不要添加路径前缀——容器会按原样读取输入parquet，

$WORKSPACE

挂载确保主机和容器路径一致。

每次迭代编写两个配置文件。两个文件都放在

$WORKSPACE

下，这样

-e

参数在挂载两侧都能解析。运行特定值不要写入配置文件，而是在调用时作为Hydra覆盖参数传递。嵌入的默认值为

model: SigLIP

、

model_path: google/siglip-base-patch16-224

、

batch_size: 64

；挖掘的默认值为

topn: 5

、

knn_metric: cosine

、

filter_by_label: "false"

（带引号——Schema将其视为字符串）。SigLIP/CLIP使用

cosine

，其他模型使用

euclidean

manhattan

；仅当

model_path

是TAO检查点时才添加

model_config_path

。任何字段仍可在CLI中直接覆盖（例如

topn=10

）——Hydra会在配置文件基础上应用CLI覆盖参数。

完整的配置文件模板、CSV转换代码片段以及挂载和镜像解析的详细说明，请参考

references/invocation.md

。

Method

方法

Three commands, in order. Each command's output parquet is the next command's input. Run them as plain Bash; the

$DOCKER

alias from Setup handles the container, GPU, and mounts. Every invocation follows the same shape:

-e <spec>

for the baked-in defaults, then a handful of Hydra overrides for run-specific paths.

依次执行三个命令。每个命令的输出parquet是下一个命令的输入。以普通Bash命令运行；设置部分的

$DOCKER

别名会处理容器、GPU和挂载。每次调用都遵循相同的格式：

-e <spec>

指定内置默认值，然后添加少量Hydra覆盖参数用于运行特定路径。

Step 1 — Embed the target images

步骤1 — 嵌入目标图像

bash

$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>

Reads the gap-analysis / routing output and writes a parquet with

filepath

embedding

, and any extra metadata columns (e.g.

label

siamese_score

weakness

) carried forward verbatim. Print the output schema (

pd.read_parquet(...).columns

) to stdout so the script-check hook can confirm the embedding column exists. To override

model

model_path

batch_size

for one run without editing the spec, append them as Hydra overrides.

bash

$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>

读取差距分析/路由输出，写入包含

filepath

、

embedding

以及所有额外元数据列（例如

label

、

siamese_score

、

weakness

）的parquet文件。将输出Schema（

pd.read_parquet(...).columns

）打印到标准输出，以便脚本检查钩子确认embedding列存在。如需在不修改配置文件的情况下为单次运行覆盖

model

model_path

batch_size

，可将其作为Hydra覆盖参数追加到命令后。

Step 2 — Embed the source pool

步骤2 — 嵌入源池

bash

$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>

Same command shape as Step 1, applied to the source pool. Use the identical

embedding_spec.yaml

as Step 1, and do not override

model

model_path

batch_size

differently here — mismatched encoder configs across the two steps produce non-comparable embeddings.

bash

$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>

命令格式与步骤1相同，应用于源池。必须使用与步骤1完全相同的

embedding_spec.yaml

，且此处不要修改

model

model_path

batch_size

的覆盖参数——两个步骤的编码器配置不匹配会生成不可比较的嵌入。

Step 3 — Mine nearest neighbours

步骤3 — 挖掘最近邻

bash

$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
    source_parquet=<source_embeddings_parquet> \
    target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>

For each target embedding, finds the

topn

closest source embeddings under the chosen metric, deduplicates across targets, and writes a single-column (

filepath

) parquet of unique mined source paths. The container also drops a

mining_summary.txt

next to the output parquet with: query count, neighbour count, duplicates removed, and (when label filtering is on) kept-vs-dropped pair counts. Tweak

topn

knn_metric

, or

filter_by_label

via inline Hydra override when sweeping — no need to rewrite the spec. When

filter_by_label=true

but one embedding parquet is missing the

label

column, the container logs a warning and proceeds without filtering; if the mined output looks too large or contains cross-label pairs, scan the docker log for that warning first.

See

references/invocation.md

for the complete paste-and-edit recipe that runs all three steps as one streamed Bash block with row-count sanity prints.

bash

$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
    source_parquet=<source_embeddings_parquet> \
    target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>

针对每个目标嵌入，在选定的度量标准下找到

topn

个最接近的源嵌入，在所有目标中去重，然后写入包含唯一挖掘源路径的单列（

filepath

）parquet文件。容器还会在输出parquet旁边生成

mining_summary.txt

，包含：查询数量、邻居数量、移除的重复项数量以及（当启用标签过滤时）保留/丢弃的配对数量。在调参时可通过Hydra内联覆盖参数修改

topn

、

knn_metric

或

filter_by_label

——无需重写配置文件。当

filter_by_label=true

但其中一个嵌入parquet缺少

label

列时，容器会记录警告并继续执行而不进行过滤；如果挖掘输出看起来过大或包含跨标签配对，请先查看Docker日志中的该警告信息。

完整的复制即用流程可参考

references/invocation.md

，其中包含将三个步骤作为一个流式Bash块运行并打印行数检查信息的示例。

Outputs

输出

Write everything into a timestamped folder under the experiment / iteration directory. The packaging hook will add

mining_config/

and

claude_session.jsonl

automatically when

Mining_Report.md

is written.

<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md            # Full mining report
├── embedding_spec.yaml         # The -e spec used for Steps 1 and 2
├── mining_spec.yaml            # The -e spec used for Step 3
├── target_embeddings.parquet   # Step 1 output (filepath, embedding, + carried metadata)
├── source_embeddings.parquet   # Step 2 output (filepath, embedding, + carried metadata)
├── mined.parquet               # Step 3 output — unique mined source filepaths
├── mining_summary.txt          # Auto-emitted next to mined.parquet by the container
├── mining_config/              # Auto-copied by hook
└── claude_session.jsonl        # Auto-copied by hook

At the start of the run, get the real timestamp by running

date +%Y-%m-%d_%H%M%S

in Bash. Do NOT hardcode or guess. If the user specifies a custom output path, use it directly but maintain the same internal layout.

The mined parquet is the artifact downstream training consumes. The two embedding parquets are intermediate but worth retaining: they are reusable across multiple mining runs against the same source pool, and they are the only place to look when a "looks unrelated" report needs encoder-level debugging.

将所有内容写入实验/迭代目录下的时间戳文件夹。当写入

Mining_Report.md

时，打包钩子会自动添加

mining_config/

和

claude_session.jsonl

。

<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md            # 完整挖掘报告
├── embedding_spec.yaml         # 步骤1和步骤2使用的-e配置文件
├── mining_spec.yaml            # 步骤3使用的-e配置文件
├── target_embeddings.parquet   # 步骤1输出（filepath、embedding + 携带的元数据）
├── source_embeddings.parquet   # 步骤2输出（filepath、embedding + 携带的元数据）
├── mined.parquet               # 步骤3输出——唯一的挖掘源文件路径
├── mining_summary.txt          # 容器自动在mined.parquet旁边生成的文件
├── mining_config/              # 钩子自动复制的内容
└── claude_session.jsonl        # 钩子自动复制的内容

在运行开始时，通过Bash命令

date +%Y-%m-%d_%H%M%S

获取真实时间戳。不要硬编码或猜测。如果用户指定了自定义输出路径，请直接使用该路径但保持内部结构不变。

挖掘得到的parquet是下游训练使用的工件。两个嵌入parquet是中间产物但值得保留：它们可在针对同一源池的多次挖掘运行中复用，并且当出现“看起来不相关”的报告需要进行编码器级调试时，它们是唯一的排查依据。

Common pitfalls

常见陷阱

The single most common cause of garbage output is mismatched encoders — both embedding steps must consume the same

embedding_spec.yaml

, and any

model

model_path

batch_size

override must apply to both steps or neither. Other frequent issues: skipping an embedding step, a missing

label

column under

filter_by_label=true

(silent no-op), spec files outside

$WORKSPACE

, unresolved

???

sentinels, TAO checkpoints without

model_config_path

, CSV pools not converted to parquet, host/container path mismatches, no GPU, the wrong image tag, and

topn

× N_targets exceeding the source size (expected, not a bug — report the actual mined count).

See

references/troubleshooting.md

for the full diagnosis and fix for each of these.

输出无效结果的最常见原因是编码器不匹配——两个嵌入步骤必须使用相同的

embedding_spec.yaml

，且

model

model_path

batch_size

的覆盖参数必须同时应用于两个步骤或都不应用。其他常见问题包括：跳过嵌入步骤、

filter_by_label=true

时缺少

label

列（静默跳过过滤）、配置文件不在

$WORKSPACE

下、未解析的

???

占位符、缺少

model_config_path

的TAO检查点、未转换为parquet的CSV源池、主机/容器路径不匹配、无GPU、错误的镜像标签以及

topn

×N_targets超过源池大小（这是正常情况而非bug——报告实际挖掘数量即可）。

每个问题的完整诊断和修复方法，请参考

references/troubleshooting.md

。

Report Structure

报告结构

Keep the report tight (600–1200 words). Mining is a deterministic pipeline; the value is making the encoder choice, the row counts, and any silent filter no-ops auditable — not narrative. The report has seven sections: Verdict, Inputs, Encoder Consistency, Mining Run, Per-Label Breakdown (skipped if the target parquet has no

label

column), Output Sanity, and Recommended Actions.

See

references/reporting_spec.md

for the complete fill-in report template with every section and field.

报告需简洁（600-1200字）。挖掘是确定性流水线，价值在于使编码器选择、行数统计和任何静默过滤操作可审计——而非叙事性内容。报告包含七个部分：结论、输入、编码器一致性、挖掘运行、按标签细分（如果目标parquet无

label

列则跳过）、输出合理性检查、推荐操作。

完整的填空式报告模板（包含所有部分和字段）请参考

references/reporting_spec.md

。

Execution Order

执行顺序

Resolve
```
DS_IMAGE
```
from
```
versions.yaml
```
(
```
images.tao_toolkit.data_services
```
), then run
```
docker info
```
,
```
nvidia-smi
```
, and
```
docker image inspect "$DS_IMAGE"
```
(pulling if missing) once to confirm the environment. Abort with a clear message if any fail.

Run

date +%Y-%m-%d_%H%M%S

to get the timestamp; create

<output_dir>/mining_results/<timestamp>/

Write
```
embedding_spec.yaml
```
and
```
mining_spec.yaml
```
into the timestamped dir, filling in the encoder choice and mining knobs. Keep these under
```
$WORKSPACE
```
so the
```
-e
```
path resolves inside the container.
If the source pool is a CSV, convert to parquet first (preserve
```
filepath
```
and
```
label
```
).

Run Step 1 (embed targets) via

docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=…

. Print the output parquet's row count and columns to stdout.

Run Step 2 (embed source pool) with the identical
```
embedding_spec.yaml
```
as Step 1. Print output row count and columns.

Run Step 3 (mine nearest neighbours) via

docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…

. Confirm

mining_summary.txt

was written next to

mined.parquet

Compute the per-label breakdown (Section 5) by joining the target embeddings parquet with the mined output on filepath, if both carry
```
label
```
.
Write
```
Mining_Report.md
```
last — writing it triggers the packaging hook, which copies session logs and skill config alongside.

从
```
versions.yaml
```
（
```
images.tao_toolkit.data_services
```
）解析
```
DS_IMAGE
```
，然后运行
```
docker info
```
、
```
nvidia-smi
```
和
```
docker image inspect "$DS_IMAGE"
```
（如果缺失则拉取）以确认环境。如果任何检查失败，需给出清晰提示并终止执行。

运行

date +%Y-%m-%d_%H%M%S

获取时间戳；创建

<output_dir>/mining_results/<timestamp>/

目录。

将
```
embedding_spec.yaml
```
和
```
mining_spec.yaml
```
写入时间戳目录，填写编码器选择和挖掘参数。确保这些文件位于
```
$WORKSPACE
```
下，以便容器内的
```
-e
```
路径可解析。
如果源池是CSV文件，先将其转换为parquet（保留
```
filepath
```
和
```
label
```
）。
通过
```
docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=…
```
运行步骤1（嵌入目标图像）。将输出parquet的行数和列名打印到标准输出。
使用与步骤1完全相同的
```
embedding_spec.yaml
```
运行步骤2（嵌入源池）。打印输出行数和列名。

通过

docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…

运行步骤3（挖掘最近邻）。确认

mining_summary.txt

已写入

mined.parquet

旁边。

如果目标嵌入parquet和挖掘输出都带有
```
label
```
列，通过
```
filepath
```
将两者关联以计算按标签细分的结果（第5部分）。
最后写入
```
Mining_Report.md
```
——写入该文件会触发打包钩子，自动复制会话日志和Skill配置文件到旁边。