tao-mine-aoi-images

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DEFT Mining and Embedding Skill

DEFT 嵌入与挖掘Skill

You are the operator of the DEFT embed-then-mine workflow for VCN AOI. Your job is to take a parquet of weak target images (the gap-analysis or routing output) and a source pool, then produce a deduplicated parquet of mined source images that look similar to the targets — ready to feed into the next training round.
The workflow is fixed and deterministic: embed the targets, embed the source pool, then mine nearest neighbours. Each step's output parquet is the next step's input. There is no iterative search, no clustering pass, no human-in-the-loop selection — depth comes from picking the right encoder and the right
topn
, not from a multi-phase investigation.
The whole skill is a thin wrapper around three direct
docker run
invocations against the
tao_toolkit.data_services
image declared in
versions.yaml
(resolved at runtime — see Setup). The container's entrypoint takes
<category> <action> -e <spec.yaml> [hydra overrides...]
:
embedding image_embeddings -e <embedding_spec.yaml> …
for embedding and
tmm nearest_neighbors -e <mining_spec.yaml> …
for mining. The
-e
flag points at a YAML of schema defaults; anything afterward is a bare Hydra override (
key=value
) applied per run. There is no
dataset
keyword inside the container — that's the TAO launcher's pillar prefix and is dropped here. Schema keys can rename between data-services releases, so when in doubt introspect once per image with
docker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job
and
... tmm nearest_neighbors --cfg=job
. See
references/invocation.md
for the full entrypoint contract,
--cfg=job
introspection, and the paste-and-edit end-to-end recipe.

你是面向VCN AOI的DEFT嵌入-挖掘工作流的操作者。你的任务是接收弱目标图像的parquet文件(差距分析或路由输出)和一个源池,然后生成去重后的挖掘源图像parquet文件,这些图像与目标图像相似,可直接用于下一轮训练。
该工作流是固定且确定性的:嵌入目标图像、嵌入源池、然后挖掘最近邻。每个步骤的输出parquet是下一步的输入。这里没有迭代搜索、聚类过程或人在环选择——深度来自选择正确的编码器和合适的
topn
参数,而非多阶段调研。
整个Skill是对
versions.yaml
中声明的
tao_toolkit.data_services
镜像的三个直接
docker run
调用的轻量封装(运行时解析——见设置部分)。容器的入口点采用
<category> <action> -e <spec.yaml> [hydra overrides...]
格式:嵌入操作使用
embedding image_embeddings -e <embedding_spec.yaml> …
,挖掘操作使用
tmm nearest_neighbors -e <mining_spec.yaml> …
-e
标志指向一个包含Schema默认值的YAML文件;后续内容是每次运行时应用的Hydra裸覆盖参数(
key=value
)。容器内部没有
dataset
关键字——这是TAO启动器的支柱前缀,在此处会被丢弃。Schema键可能会在data-services版本更新时重命名,因此如有疑问,可通过
docker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job
... tmm nearest_neighbors --cfg=job
对镜像进行一次自省。完整的入口点约定、
--cfg=job
自省方法以及复制即用的端到端流程,请参考
references/invocation.md

Inputs

输入

  1. Target parquet — the gap-analysis output, typically
    mining_gaps.parquet
    from
    tao-route-visual-changenet-samples
    (or
    gaps.parquet
    from
    tao-analyze-gaps-visual-changenet
    if routing was skipped). Required column:
    filepath
    . If
    label
    is also present, label-aware filtering during mining is available; otherwise the mining task silently no-ops the filter.
  2. Source pool — a parquet of candidate images to mine against, with a
    filepath
    column. If the user only has a CSV, convert it to a parquet with the same columns before Step 2. For label-aware filtering, the pool must also carry a
    label
    column.
  3. Embedding spec file — a YAML containing
    model
    ,
    model_path
    ,
    batch_size
    , and (only when
    model_path
    is a TAO
    .pth
    /
    .ckpt
    )
    model_config_path
    . Reused across Steps 1 and 2;
    input_parquet
    /
    output_parquet
    are supplied per run as Hydra overrides. The same spec MUST drive both embedding steps — embeddings from different encoders are not comparable, and mismatched encoders are the most common cause of "the mined images look unrelated" reports.
  4. Mining spec file — a YAML containing
    topn
    ,
    knn_metric
    ,
    filter_by_label
    , and (rarely changed)
    source_embed_column_name
    /
    target_embed_column_name
    .
    source_parquet
    /
    target_parquet
    /
    output_parquet
    are Hydra overrides at run time. SigLIP and CLIP embeddings should use
    knn_metric: cosine
    . When
    filter_by_label: true
    but either embedding parquet lacks a
    label
    column, the container logs a warning and proceeds without filtering.

  1. 目标parquet——差距分析输出,通常是
    tao-route-visual-changenet-samples
    生成的
    mining_gaps.parquet
    (如果跳过路由,则是
    tao-analyze-gaps-visual-changenet
    生成的
    gaps.parquet
    )。必填列:
    filepath
    。如果还存在
    label
    列,则挖掘过程中可启用标签感知过滤;否则挖掘任务会静默跳过过滤操作。
  2. 源池——包含候选图像的parquet文件,需带有
    filepath
    列。如果用户只有CSV文件,需先将其转换为列名相同的parquet文件再进入步骤2。若要启用标签感知过滤,源池还必须包含
    label
    列。
  3. 嵌入配置文件——包含
    model
    model_path
    batch_size
    以及(仅当
    model_path
    是TAO的
    .pth
    /
    .ckpt
    文件时)
    model_config_path
    的YAML文件。步骤1和步骤2会复用该文件;
    input_parquet
    /
    output_parquet
    会作为Hydra覆盖参数在每次运行时提供。必须使用同一个配置文件驱动两个嵌入步骤——不同编码器生成的嵌入不可比较,编码器不匹配是“挖掘出的图像看起来不相关”报告的最常见原因。
  4. 挖掘配置文件——包含
    topn
    knn_metric
    filter_by_label
    以及(很少修改的)
    source_embed_column_name
    /
    target_embed_column_name
    的YAML文件。
    source_parquet
    /
    target_parquet
    /
    output_parquet
    会作为Hydra覆盖参数在运行时提供。SigLIP和CLIP嵌入应使用
    knn_metric: cosine
    。当
    filter_by_label: true
    但任一嵌入parquet缺少
    label
    列时,容器会记录警告并不进行过滤继续执行。

Setup

设置

The mining and embedding tasks live inside the
tao_toolkit.data_services
image declared in
versions.yaml
. Resolve the concrete URI once at the top of the run, then confirm Docker, the NVIDIA container toolkit, and a GPU are present before anything else:
bash
undefined
挖掘和嵌入任务位于
versions.yaml
中声明的
tao_toolkit.data_services
镜像内。在运行开始时先解析具体的URI,然后在执行其他操作前确认Docker、NVIDIA容器工具包和GPU是否存在:
bash
undefined

Resolve tao_toolkit.data_services → concrete nvcr.io/... URI from versions.yaml

从versions.yaml中解析tao_toolkit.data_services → 具体的nvcr.io/... URI

DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])") echo "DS_IMAGE=$DS_IMAGE"
docker info > /dev/null && echo "OK: docker" nvidia-smi > /dev/null && echo "OK: GPU" docker image inspect "$DS_IMAGE" > /dev/null
|| docker pull "$DS_IMAGE"

`TAO_SKILL_BANK_PATH` is exported by the plugin's `session_start` hook. If it is unset (e.g. running outside the Claude Code plugin), point it at the skill-bank repo root before resolving. A GPU is required for both the encoder forward pass and the cuML/cuDF k-NN search; both steps will fail without CUDA.

**Path mounting.** Every host path the container reads or writes — input parquets, output dirs, and the source-pool image root — must be bind-mounted. The simplest, most predictable approach mounts the workspace root with **identical paths** inside and outside the container so absolute paths in the parquet args resolve the same way on both sides:

```bash
WORKSPACE=<absolute path that contains all parquets, outputs, and the source-pool images>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"
Reuse
$DOCKER
for the three invocations below.
CSV source pool. If the source pool is provided only as a CSV, convert it to a parquet up front with
pd.read_csv(...).to_parquet(..., index=False)
, preserving the
filepath
column verbatim (and
label
if present). Do not add a path prefix — the container reads input parquets as-is and the
$WORKSPACE
mount keeps host and container paths identical.
Author the two spec files once per iteration. Both files live under
$WORKSPACE
so the
-e
argument resolves on both sides of the mount. Per-run values stay out of the spec and are passed as Hydra overrides at invocation time. The defaults are
model: SigLIP
,
model_path: google/siglip-base-patch16-224
,
batch_size: 64
for embedding, and
topn: 5
,
knn_metric: cosine
,
filter_by_label: "false"
(quoted — the schema reads it as a string) for mining. Use
cosine
for SigLIP/CLIP,
euclidean
/
manhattan
otherwise; add
model_config_path
only when
model_path
is a TAO checkpoint. Any field can still be overridden inline at the CLI (e.g.
topn=10
) — Hydra applies CLI overrides on top of the spec.
See
references/invocation.md
for the verbatim spec-file templates, the CSV conversion snippet, and the full mounting and image-resolution detail.

DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])") echo "DS_IMAGE=$DS_IMAGE"
docker info > /dev/null && echo "OK: docker" nvidia-smi > /dev/null && echo "OK: GPU" docker image inspect "$DS_IMAGE" > /dev/null
|| docker pull "$DS_IMAGE"

`TAO_SKILL_BANK_PATH`由插件的`session_start`钩子导出。如果未设置(例如在Claude Code插件外运行),需在解析前将其指向skill-bank仓库根目录。编码器前向传播和cuML/cuDF k-NN搜索都需要GPU;没有CUDA的话两个步骤都会失败。

**路径挂载**。容器读取或写入的每个主机路径——输入parquet、输出目录以及源池图像根目录——都必须进行绑定挂载。最简单、最可预测的方法是将工作区根目录以**内外路径完全相同**的方式挂载,这样parquet参数中的绝对路径在主机和容器两侧都能解析为相同路径:

```bash
WORKSPACE=<包含所有parquet、输出文件和源池图像的绝对路径>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"
后续三个调用复用
$DOCKER
别名。
CSV源池。如果源池仅以CSV形式提供,需先通过
pd.read_csv(...).to_parquet(..., index=False)
将其转换为parquet文件,严格保留
filepath
列(如果存在
label
列也需保留)。不要添加路径前缀——容器会按原样读取输入parquet,
$WORKSPACE
挂载确保主机和容器路径一致。
每次迭代编写两个配置文件。两个文件都放在
$WORKSPACE
下,这样
-e
参数在挂载两侧都能解析。运行特定值不要写入配置文件,而是在调用时作为Hydra覆盖参数传递。嵌入的默认值为
model: SigLIP
model_path: google/siglip-base-patch16-224
batch_size: 64
;挖掘的默认值为
topn: 5
knn_metric: cosine
filter_by_label: "false"
(带引号——Schema将其视为字符串)。SigLIP/CLIP使用
cosine
,其他模型使用
euclidean
/
manhattan
;仅当
model_path
是TAO检查点时才添加
model_config_path
。任何字段仍可在CLI中直接覆盖(例如
topn=10
)——Hydra会在配置文件基础上应用CLI覆盖参数。
完整的配置文件模板、CSV转换代码片段以及挂载和镜像解析的详细说明,请参考
references/invocation.md

Method

方法

Three commands, in order. Each command's output parquet is the next command's input. Run them as plain Bash; the
$DOCKER
alias from Setup handles the container, GPU, and mounts. Every invocation follows the same shape:
-e <spec>
for the baked-in defaults, then a handful of Hydra overrides for run-specific paths.
依次执行三个命令。每个命令的输出parquet是下一个命令的输入。以普通Bash命令运行;设置部分的
$DOCKER
别名会处理容器、GPU和挂载。每次调用都遵循相同的格式:
-e <spec>
指定内置默认值,然后添加少量Hydra覆盖参数用于运行特定路径。

Step 1 — Embed the target images

步骤1 — 嵌入目标图像

bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>
Reads the gap-analysis / routing output and writes a parquet with
filepath
,
embedding
, and any extra metadata columns (e.g.
label
,
siamese_score
,
weakness
) carried forward verbatim. Print the output schema (
pd.read_parquet(...).columns
) to stdout so the script-check hook can confirm the embedding column exists. To override
model
/
model_path
/
batch_size
for one run without editing the spec, append them as Hydra overrides.
bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>
读取差距分析/路由输出,写入包含
filepath
embedding
以及所有额外元数据列(例如
label
siamese_score
weakness
)的parquet文件。将输出Schema(
pd.read_parquet(...).columns
)打印到标准输出,以便脚本检查钩子确认embedding列存在。如需在不修改配置文件的情况下为单次运行覆盖
model
/
model_path
/
batch_size
,可将其作为Hydra覆盖参数追加到命令后。

Step 2 — Embed the source pool

步骤2 — 嵌入源池

bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>
Same command shape as Step 1, applied to the source pool. Use the identical
embedding_spec.yaml
as Step 1, and do not override
model
/
model_path
/
batch_size
differently here — mismatched encoder configs across the two steps produce non-comparable embeddings.
bash
$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>
命令格式与步骤1相同,应用于源池。必须使用与步骤1完全相同
embedding_spec.yaml
,且此处不要修改
model
/
model_path
/
batch_size
的覆盖参数——两个步骤的编码器配置不匹配会生成不可比较的嵌入。

Step 3 — Mine nearest neighbours

步骤3 — 挖掘最近邻

bash
$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
    source_parquet=<source_embeddings_parquet> \
    target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>
For each target embedding, finds the
topn
closest source embeddings under the chosen metric, deduplicates across targets, and writes a single-column (
filepath
) parquet of unique mined source paths. The container also drops a
mining_summary.txt
next to the output parquet with: query count, neighbour count, duplicates removed, and (when label filtering is on) kept-vs-dropped pair counts. Tweak
topn
,
knn_metric
, or
filter_by_label
via inline Hydra override when sweeping — no need to rewrite the spec. When
filter_by_label=true
but one embedding parquet is missing the
label
column, the container logs a warning and proceeds without filtering; if the mined output looks too large or contains cross-label pairs, scan the docker log for that warning first.
See
references/invocation.md
for the complete paste-and-edit recipe that runs all three steps as one streamed Bash block with row-count sanity prints.

bash
$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
    source_parquet=<source_embeddings_parquet> \
    target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>
针对每个目标嵌入,在选定的度量标准下找到
topn
个最接近的源嵌入,在所有目标中去重,然后写入包含唯一挖掘源路径的单列(
filepath
)parquet文件。容器还会在输出parquet旁边生成
mining_summary.txt
,包含:查询数量、邻居数量、移除的重复项数量以及(当启用标签过滤时)保留/丢弃的配对数量。在调参时可通过Hydra内联覆盖参数修改
topn
knn_metric
filter_by_label
——无需重写配置文件。当
filter_by_label=true
但其中一个嵌入parquet缺少
label
列时,容器会记录警告并继续执行而不进行过滤;如果挖掘输出看起来过大或包含跨标签配对,请先查看Docker日志中的该警告信息。
完整的复制即用流程可参考
references/invocation.md
,其中包含将三个步骤作为一个流式Bash块运行并打印行数检查信息的示例。

Outputs

输出

Write everything into a timestamped folder under the experiment / iteration directory. The packaging hook will add
mining_config/
and
claude_session.jsonl
automatically when
Mining_Report.md
is written.
<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md            # Full mining report
├── embedding_spec.yaml         # The -e spec used for Steps 1 and 2
├── mining_spec.yaml            # The -e spec used for Step 3
├── target_embeddings.parquet   # Step 1 output (filepath, embedding, + carried metadata)
├── source_embeddings.parquet   # Step 2 output (filepath, embedding, + carried metadata)
├── mined.parquet               # Step 3 output — unique mined source filepaths
├── mining_summary.txt          # Auto-emitted next to mined.parquet by the container
├── mining_config/              # Auto-copied by hook
└── claude_session.jsonl        # Auto-copied by hook
At the start of the run, get the real timestamp by running
date +%Y-%m-%d_%H%M%S
in Bash. Do NOT hardcode or guess. If the user specifies a custom output path, use it directly but maintain the same internal layout.
The mined parquet is the artifact downstream training consumes. The two embedding parquets are intermediate but worth retaining: they are reusable across multiple mining runs against the same source pool, and they are the only place to look when a "looks unrelated" report needs encoder-level debugging.

将所有内容写入实验/迭代目录下的时间戳文件夹。当写入
Mining_Report.md
时,打包钩子会自动添加
mining_config/
claude_session.jsonl
<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md            # 完整挖掘报告
├── embedding_spec.yaml         # 步骤1和步骤2使用的-e配置文件
├── mining_spec.yaml            # 步骤3使用的-e配置文件
├── target_embeddings.parquet   # 步骤1输出(filepath、embedding + 携带的元数据)
├── source_embeddings.parquet   # 步骤2输出(filepath、embedding + 携带的元数据)
├── mined.parquet               # 步骤3输出——唯一的挖掘源文件路径
├── mining_summary.txt          # 容器自动在mined.parquet旁边生成的文件
├── mining_config/              # 钩子自动复制的内容
└── claude_session.jsonl        # 钩子自动复制的内容
在运行开始时,通过Bash命令
date +%Y-%m-%d_%H%M%S
获取真实时间戳。不要硬编码或猜测。如果用户指定了自定义输出路径,请直接使用该路径但保持内部结构不变。
挖掘得到的parquet是下游训练使用的工件。两个嵌入parquet是中间产物但值得保留:它们可在针对同一源池的多次挖掘运行中复用,并且当出现“看起来不相关”的报告需要进行编码器级调试时,它们是唯一的排查依据。

Common pitfalls

常见陷阱

The single most common cause of garbage output is mismatched encoders — both embedding steps must consume the same
embedding_spec.yaml
, and any
model
/
model_path
/
batch_size
override must apply to both steps or neither. Other frequent issues: skipping an embedding step, a missing
label
column under
filter_by_label=true
(silent no-op), spec files outside
$WORKSPACE
, unresolved
???
sentinels, TAO checkpoints without
model_config_path
, CSV pools not converted to parquet, host/container path mismatches, no GPU, the wrong image tag, and
topn
× N_targets exceeding the source size (expected, not a bug — report the actual mined count).
See
references/troubleshooting.md
for the full diagnosis and fix for each of these.

输出无效结果的最常见原因是编码器不匹配——两个嵌入步骤必须使用相同的
embedding_spec.yaml
,且
model
/
model_path
/
batch_size
的覆盖参数必须同时应用于两个步骤或都不应用。其他常见问题包括:跳过嵌入步骤、
filter_by_label=true
时缺少
label
列(静默跳过过滤)、配置文件不在
$WORKSPACE
下、未解析的
???
占位符、缺少
model_config_path
的TAO检查点、未转换为parquet的CSV源池、主机/容器路径不匹配、无GPU、错误的镜像标签以及
topn
×N_targets超过源池大小(这是正常情况而非bug——报告实际挖掘数量即可)。
每个问题的完整诊断和修复方法,请参考
references/troubleshooting.md

Report Structure

报告结构

Keep the report tight (600–1200 words). Mining is a deterministic pipeline; the value is making the encoder choice, the row counts, and any silent filter no-ops auditable — not narrative. The report has seven sections: Verdict, Inputs, Encoder Consistency, Mining Run, Per-Label Breakdown (skipped if the target parquet has no
label
column), Output Sanity, and Recommended Actions.
See
references/reporting_spec.md
for the complete fill-in report template with every section and field.

报告需简洁(600-1200字)。挖掘是确定性流水线,价值在于使编码器选择、行数统计和任何静默过滤操作可审计——而非叙事性内容。报告包含七个部分:结论、输入、编码器一致性、挖掘运行、按标签细分(如果目标parquet无
label
列则跳过)、输出合理性检查、推荐操作。
完整的填空式报告模板(包含所有部分和字段)请参考
references/reporting_spec.md

Execution Order

执行顺序

  1. Resolve
    DS_IMAGE
    from
    versions.yaml
    (
    images.tao_toolkit.data_services
    ), then run
    docker info
    ,
    nvidia-smi
    , and
    docker image inspect "$DS_IMAGE"
    (pulling if missing) once to confirm the environment. Abort with a clear message if any fail.
  2. Run
    date +%Y-%m-%d_%H%M%S
    to get the timestamp; create
    <output_dir>/mining_results/<timestamp>/
    .
  3. Write
    embedding_spec.yaml
    and
    mining_spec.yaml
    into the timestamped dir, filling in the encoder choice and mining knobs. Keep these under
    $WORKSPACE
    so the
    -e
    path resolves inside the container.
  4. If the source pool is a CSV, convert to parquet first (preserve
    filepath
    and
    label
    ).
  5. Run Step 1 (embed targets) via
    docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=…
    . Print the output parquet's row count and columns to stdout.
  6. Run Step 2 (embed source pool) with the identical
    embedding_spec.yaml
    as Step 1. Print output row count and columns.
  7. Run Step 3 (mine nearest neighbours) via
    docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…
    . Confirm
    mining_summary.txt
    was written next to
    mined.parquet
    .
  8. Compute the per-label breakdown (Section 5) by joining the target embeddings parquet with the mined output on filepath, if both carry
    label
    .
  9. Write
    Mining_Report.md
    last — writing it triggers the packaging hook, which copies session logs and skill config alongside.
  1. versions.yaml
    images.tao_toolkit.data_services
    )解析
    DS_IMAGE
    ,然后运行
    docker info
    nvidia-smi
    docker image inspect "$DS_IMAGE"
    (如果缺失则拉取)以确认环境。如果任何检查失败,需给出清晰提示并终止执行。
  2. 运行
    date +%Y-%m-%d_%H%M%S
    获取时间戳;创建
    <output_dir>/mining_results/<timestamp>/
    目录。
  3. embedding_spec.yaml
    mining_spec.yaml
    写入时间戳目录,填写编码器选择和挖掘参数。确保这些文件位于
    $WORKSPACE
    下,以便容器内的
    -e
    路径可解析。
  4. 如果源池是CSV文件,先将其转换为parquet(保留
    filepath
    label
    )。
  5. 通过
    docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=…
    运行步骤1(嵌入目标图像)。将输出parquet的行数和列名打印到标准输出。
  6. 使用与步骤1完全相同
    embedding_spec.yaml
    运行步骤2(嵌入源池)。打印输出行数和列名。
  7. 通过
    docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…
    运行步骤3(挖掘最近邻)。确认
    mining_summary.txt
    已写入
    mined.parquet
    旁边。
  8. 如果目标嵌入parquet和挖掘输出都带有
    label
    列,通过
    filepath
    将两者关联以计算按标签细分的结果(第5部分)。
  9. 最后写入
    Mining_Report.md
    ——写入该文件会触发打包钩子,自动复制会话日志和Skill配置文件到旁边。