nv-generate-vae-finetune

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NV-Generate-VAE-Finetune

NV-Generate-VAE-Finetune

Purpose

用途

  • Used for finetuning the NV-Generate-CTMR MAISI VAE/autoencoder from user-supplied CT or MRI NIfTI training volumes.
  • Not for clinical interpretation, regulatory use, or approving synthetic data for production training.
  • Upstream currently documents VAE training in
    train_vae_tutorial.ipynb
    and provides configs/helpers, but not a
    scripts.train_vae
    CLI. This skill does not execute the notebook; it stages the required config/datalist glue locally and uses upstream helper APIs.
  • Manifest I/O: inputs are
    datalist
    and
    data_base_dir
    ; outputs are
    autoencoder_checkpoint
    ,
    discriminator_checkpoint
    , and
    result_json
    .
  • The underlying training contract is the upstream config/env JSON (
    config_maisi_vae_train.json
    +
    environment_maisi_vae_train.json
    , as used in
    train_vae_tutorial.ipynb
    ). The wrapper stages those JSON files for you and exposes the most-tuned fields as CLI flags; the sections below document the fields, their defaults, and how to monitor/tune a run.
  • 用于基于用户提供的CT或MRI NIfTI训练数据集微调NV-Generate-CTMR MAISI VAE/自动编码器。
  • 不用于临床解读、合规性使用,也不用于批准合成数据用于生产训练。
  • 上游项目目前在
    train_vae_tutorial.ipynb
    中记录了VAE训练流程,并提供配置文件/辅助工具,但未提供
    scripts.train_vae
    命令行界面(CLI)。本技能不执行该notebook;它会在本地准备所需的配置/数据列表关联文件,并使用上游的辅助API。
  • 清单输入输出:输入为
    datalist
    data_base_dir
    ;输出为
    autoencoder_checkpoint
    discriminator_checkpoint
    result_json
  • 底层训练规则遵循上游的配置/环境JSON文件(
    config_maisi_vae_train.json
    +
    environment_maisi_vae_train.json
    ,与
    train_vae_tutorial.ipynb
    中使用的一致)。该封装工具会为您准备这些JSON文件,并将最常用的可配置字段暴露为CLI参数;以下章节将记录这些字段、默认值以及如何监控/调优训练过程。

Instructions

操作说明

  • Read
    skill_manifest.yaml
    before changing arguments, side effects, or validation gates.
  • Run
    scripts/run_vae_finetune.py
    from the Medical AI Skills repo root.
  • If a host agent exposes
    run_script
    , use
    run_script("scripts/run_vae_finetune.py", args=[...])
    ; otherwise run the Bash/Python command below.
  • Use
    --preflight
    first when checking a new datalist; remove
    --preflight
    only when the user explicitly wants to launch GPU finetuning.
  • For a staged preflight input bundle directory, use
    BUNDLE/preflight_datalist.json
    as the datalist and
    BUNDLE/preflight_dataset
    as
    --data-base-dir
    when those files are present.
  • 在修改参数、副作用或验证规则前,请先阅读
    skill_manifest.yaml
  • 从Medical AI Skills仓库根目录运行
    scripts/run_vae_finetune.py
  • 如果宿主Agent提供
    run_script
    方法,请使用
    run_script("scripts/run_vae_finetune.py", args=[...])
    ;否则运行下方的Bash/Python命令。
  • 检查新数据列表时,请先使用
    --preflight
    参数;仅当用户明确希望启动GPU微调时,再移除该参数。
  • 对于已准备好的预检查输入包目录,当存在相关文件时,使用
    BUNDLE/preflight_datalist.json
    作为数据列表,
    BUNDLE/preflight_dataset
    作为
    --data-base-dir

Examples

示例

Validate and stage a preflight finetune check from an input bundle (the recommended first step — no GPU, no training). This is the single canonical command; replace
INPUT_BUNDLE
and
OUT_DIR
with your paths:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  INPUT_BUNDLE/preflight_datalist.json \
  --data-base-dir INPUT_BUNDLE/preflight_dataset \
  --output-dir OUT_DIR \
  --modality mri \
  --preflight
For real GPU finetuning and other variations, see Usage below.
从输入包中验证并准备预检查微调(推荐的第一步——无需GPU,不进行训练)。这是标准命令;请将
INPUT_BUNDLE
OUT_DIR
替换为您的路径:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  INPUT_BUNDLE/preflight_datalist.json \
  --data-base-dir INPUT_BUNDLE/preflight_dataset \
  --output-dir OUT_DIR \
  --modality mri \
  --preflight
如需进行真实的GPU微调或其他操作变体,请参阅下方的【2. 使用方式(一键训练)】章节。

Available Scripts

可用脚本

ScriptPurposeArguments
scripts/run_vae_finetune.py
Primary entrypoint declared by
skill_manifest.yaml
.
DATALIST.json --data-base-dir DATA_DIR --output-dir OUT_DIR [--epochs N] [--modality mri] [--patch-size 64,64,64] [--preflight]
脚本用途参数
scripts/run_vae_finetune.py
skill_manifest.yaml
中声明的主入口脚本
DATALIST.json --data-base-dir DATA_DIR --output-dir OUT_DIR [--epochs N] [--modality mri] [--patch-size 64,64,64] [--preflight]

Prerequisites

前置条件

  • NV_GENERATE_ROOT
    may point to a current checkout of
    https://github.com/NVIDIA-Medtech/NV-Generate-CTMR
    containing
    configs/config_maisi_vae_train.json
    ,
    scripts/transforms.py
    , and
    scripts/utils.py
    .
  • If
    NV_GENERATE_ROOT
    is unset, the wrapper searches
    .workbench_data/upstreams/NV-Generate-CTMR
    .
  • CUDA_VISIBLE_DEVICES
    is optional and can be used to select the GPU for real training.
  • Runtime requirements: NVIDIA CUDA GPU for real training, Python packages from the upstream
    requirements.txt
    ,
    lpips
    , and downloaded VAE weights unless using
    --train-from-scratch
    .
  • Side effects: writes staged configs, checkpoints, TensorBoard logs, and run summaries under the caller-provided
    --output-dir
    ; may write model caches under the upstream checkout,
    ~/.cache/huggingface/
    , and
    ~/.cache/torch/
    ; may contact
    https://huggingface.co
    ,
    https://github.com
    , and
    https://download.pytorch.org
    .
  • The datalist is a MONAI-style JSON object with non-empty
    training[]
    and
    validation[]
    or
    testing[]
    . Each entry has an
    image
    path relative to
    --data-base-dir
    and optional
    class
    or
    modality
    of
    ct
    or
    mri
    .
  • NV_GENERATE_ROOT
    需指向
    https://github.com/NVIDIA-Medtech/NV-Generate-CTMR
    的当前检出版本,其中需包含
    configs/config_maisi_vae_train.json
    scripts/transforms.py
    scripts/utils.py
  • 如果未设置
    NV_GENERATE_ROOT
    ,封装工具会搜索
    .workbench_data/upstreams/NV-Generate-CTMR
  • CUDA_VISIBLE_DEVICES
    为可选参数,可用于选择真实训练时使用的GPU。
  • 运行时要求:真实训练需NVIDIA CUDA GPU;需安装上游
    requirements.txt
    中的Python包、
    lpips
    ,以及下载的VAE权重(除非使用
    --train-from-scratch
    参数)。
  • 副作用:会在用户指定的
    --output-dir
    下写入准备好的配置文件、检查点、TensorBoard日志和运行摘要;可能会在上游检出目录、
    ~/.cache/huggingface/
    ~/.cache/torch/
    下写入模型缓存;可能会连接
    https://huggingface.co
    https://github.com
    https://download.pytorch.org
  • 数据列表为MONAI风格的JSON对象,需包含非空的
    training[]
    validation[]
    testing[]
    。每个条目需包含相对于
    --data-base-dir
    image
    路径,以及可选的
    class
    modality
    (值为
    ct
    mri
    )。

1. Config and environment JSON (adapt to your data)

1. 配置与环境JSON(适配您的数据)

The wrapper copies the upstream VAE config/env JSON from
$NV_GENERATE_ROOT/configs
, rewrites the fields below, and writes the staged copies under
OUT_DIR/workflow/configs/
. You normally only set your datalist and data root; the listed CLI flags override individual fields when you need to.
Environment JSON (
environment_maisi_vae_train.json
):
FieldSet fromNotes
model_dir
--output-dir
Where
autoencoder.pt
/
discriminator.pt
and best checkpoints are saved.
tfevent_path
--output-dir
TensorBoard event directory.
finetune
--train-from-scratch
true
(default) loads
trained_autoencoder_path
; the flag sets it
false
.
trained_autoencoder_path
upstream weights /
--trained-autoencoder-path
Starting VAE checkpoint when finetuning.
Training fields (
config_maisi_vae_train.json
):
FieldFlagTypeDefaultNotes
autoencoder_train.n_epochs
--epochs
int
1
autoencoder_train.batch_size
--batch-size
int
1
Per-GPU (single-GPU runner).
autoencoder_train.patch_size
--patch-size
int,int,int
64,64,64
Training crop.
autoencoder_train.val_batch_size
--val-batch-size
int
1
autoencoder_train.val_sliding_window_patch_size
--val-sliding-window-patch-size
int,int,int
96,96,64
Sliding-window validation ROI.
autoencoder_train.lr
--lr
float
1e-4
autoencoder_train.perceptual_weight
--perceptual-weight
float
0.3
LPIPS term.
autoencoder_train.kl_weight
--kl-weight
float
1e-7
KL term.
autoencoder_train.adv_weight
--adv-weight
float
0.1
Adversarial term.
autoencoder_train.recon_loss
--recon-loss
l1
|
l2
l1
autoencoder_train.val_interval
--val-interval
int
1
Epochs between validation passes.
autoencoder_train.cache
--cache-rate
float
0.0
MONAI
CacheDataset
fraction.
autoencoder_train.amp
--no-amp
flagonMixed precision; flag disables it.
data_option.random_aug
--no-random-aug
flagonRandom augmentation; flag disables it.
data_option.spacing_type
--spacing-type
original
|
fixed
|
rand_zoom
original
data_option.spacing
--spacing
float,float,floatunsetRequired when
spacing_type
is
fixed
/
rand_zoom
.
data_option.select_channel
--select-channel
int
0
Channel for multi-channel inputs.
--modality
(
ct
or
mri
, default
mri
) fills the per-entry
class
for datalist items missing one. Validation/testing entries are required because the training loop runs a validation pass.
For an end-to-end reference including example data download, see the upstream tutorial
train_vae_tutorial.ipynb
.
封装工具会从
$NV_GENERATE_ROOT/configs
复制上游的VAE配置/环境JSON,重写以下字段,并将准备好的副本写入
OUT_DIR/workflow/configs/
。通常您只需设置数据列表和数据根目录;当需要时,可通过列出的CLI参数覆盖单个字段。
环境JSON(
environment_maisi_vae_train.json
):
字段来源说明
model_dir
--output-dir
保存
autoencoder.pt
/
discriminator.pt
和最优检查点的目录。
tfevent_path
--output-dir
TensorBoard事件目录。
finetune
--train-from-scratch
默认值为
true
(加载
trained_autoencoder_path
);该参数会将其设置为
false
trained_autoencoder_path
上游权重 /
--trained-autoencoder-path
微调时使用的初始VAE检查点。
训练字段(
config_maisi_vae_train.json
):
字段参数类型默认值说明
autoencoder_train.n_epochs
--epochs
int
1
autoencoder_train.batch_size
--batch-size
int
1
单GPU运行时的每GPU批次大小。
autoencoder_train.patch_size
--patch-size
int,int,int
64,64,64
训练时的裁剪尺寸。
autoencoder_train.val_batch_size
--val-batch-size
int
1
autoencoder_train.val_sliding_window_patch_size
--val-sliding-window-patch-size
int,int,int
96,96,64
滑动窗口验证的感兴趣区域(ROI)。
autoencoder_train.lr
--lr
float
1e-4
autoencoder_train.perceptual_weight
--perceptual-weight
float
0.3
LPIPS项权重。
autoencoder_train.kl_weight
--kl-weight
float
1e-7
KL项权重。
autoencoder_train.adv_weight
--adv-weight
float
0.1
对抗项权重。
autoencoder_train.recon_loss
--recon-loss
l1
|
l2
l1
autoencoder_train.val_interval
--val-interval
int
1
两次验证之间的训练轮数。
autoencoder_train.cache
--cache-rate
float
0.0
MONAI
CacheDataset
的缓存比例。
autoencoder_train.amp
--no-amp
参数开启混合精度训练;该参数会关闭此功能。
data_option.random_aug
--no-random-aug
参数开启随机数据增强;该参数会关闭此功能。
data_option.spacing_type
--spacing-type
original
|
fixed
|
rand_zoom
original
data_option.spacing
--spacing
float,float,float未设置
spacing_type
fixed
/
rand_zoom
时为必填项。
data_option.select_channel
--select-channel
int
0
多通道输入时选择的通道。
--modality
(值为
ct
mri
,默认
mri
)会为数据列表中缺少
class
的条目填充该字段。必须提供验证/测试条目,因为训练循环会执行验证步骤。
如需包含示例数据下载的端到端参考,请参阅上游教程
train_vae_tutorial.ipynb

2. Usage (one-line training)

2. 使用方式(一键训练)

Preflight only:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  PATH_TO_DATALIST.json \
  --data-base-dir PATH_TO_DATA_ROOT \
  --output-dir runs/nv_generate_vae_finetune_preflight \
  --preflight
Preflight bundle input:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  PATH_TO_INPUT_BUNDLE/preflight_datalist.json \
  --data-base-dir PATH_TO_INPUT_BUNDLE/preflight_dataset \
  --output-dir runs/nv_generate_vae_finetune_preflight \
  --preflight
GPU finetuning:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python -m pip install -r "$NV_GENERATE_ROOT/requirements.txt" && \
python -m pip install lpips tensorboard && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  PATH_TO_DATALIST.json \
  --data-base-dir PATH_TO_DATA_ROOT \
  --output-dir runs/nv_generate_vae_finetune \
  --epochs 1 \
  --modality mri \
  --patch-size 64,64,64 \
  --download-model-data
Replace
PATH_TO_DATALIST.json
and
PATH_TO_DATA_ROOT
with the user's actual paths. Do not use the fixture datalist for real training; it is a preflight-only placeholder.
仅预检查:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  PATH_TO_DATALIST.json \
  --data-base-dir PATH_TO_DATA_ROOT \
  --output-dir runs/nv_generate_vae_finetune_preflight \
  --preflight
预检查包输入:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  PATH_TO_INPUT_BUNDLE/preflight_datalist.json \
  --data-base-dir PATH_TO_INPUT_BUNDLE/preflight_dataset \
  --output-dir runs/nv_generate_vae_finetune_preflight \
  --preflight
GPU微调:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python -m pip install -r "$NV_GENERATE_ROOT/requirements.txt" && \
python -m pip install lpips tensorboard && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
  PATH_TO_DATALIST.json \
  --data-base-dir PATH_TO_DATA_ROOT \
  --output-dir runs/nv_generate_vae_finetune \
  --epochs 1 \
  --modality mri \
  --patch-size 64,64,64 \
  --download-model-data
请将
PATH_TO_DATALIST.json
PATH_TO_DATA_ROOT
替换为用户的实际路径。请勿使用测试数据列表进行真实训练;它仅作为预检查的占位符。

3. Monitor training (TensorBoard)

3. 监控训练(TensorBoard)

The runner writes TensorBoard scalars (per-iteration and per-epoch
recons_loss
,
kl_loss
,
p_loss
, adversarial/real/fake losses, and a validation
scale_factor
) under
OUT_DIR/artifacts/tfevent/autoencoder
. Launch TensorBoard against the output directory:
bash
python -m pip install tensorboard && \
tensorboard --logdir runs/nv_generate_vae_finetune/artifacts/tfevent
The same per-epoch loss history is also captured in
OUT_DIR/artifacts/workflow_summary.json
and echoed in the JSON the wrapper prints to stdout (
loss_history
, best-checkpoint paths,
exit_code
,
stderr_tail
).
运行器会在
OUT_DIR/artifacts/tfevent/autoencoder
下写入TensorBoard标量数据(每迭代和每轮的
recons_loss
kl_loss
p_loss
、对抗/真实/虚假损失,以及验证
scale_factor
)。针对输出目录启动TensorBoard:
bash
python -m pip install tensorboard && \
tensorboard --logdir runs/nv_generate_vae_finetune/artifacts/tfevent
相同的每轮损失历史也会被记录在
OUT_DIR/artifacts/workflow_summary.json
中,并在封装工具打印到标准输出的JSON中回显(包含
loss_history
、最优检查点路径、
exit_code
stderr_tail
)。

4. Hyperparameter tuning and common pitfalls

4. 超参数调优与常见问题

  • Reconstructions blurry — raise
    --perceptual-weight
    (default
    0.3
    ); try
    --recon-loss l2
    if edges look washed out.
  • Posterior collapse / over-regularized latents
    --kl-weight
    is intentionally tiny (
    1e-7
    ); increasing it too much degrades reconstruction.
  • Adversarial training unstable — lower
    --adv-weight
    (default
    0.1
    ) or
    --lr
    ; a warmup schedule already ramps the LR over the first 20 epochs.
  • Out-of-memory — reduce
    --patch-size
    (e.g.
    48,48,48
    ) and
    --val-sliding-window-patch-size
    , keep
    --batch-size 1
    , and lower
    --cache-rate
    .
  • datalist must include non-empty validation[] or testing[]
    — the validation loop is mandatory; add
    validation[]
    (or
    testing[]
    ) entries.
  • Single-GPU only — the runner asserts exactly one CUDA GPU; set
    CUDA_VISIBLE_DEVICES
    to pick which one.
  • 重建结果模糊 —— 提高
    --perceptual-weight
    (默认值
    0.3
    );如果边缘看起来模糊,尝试使用
    --recon-loss l2
  • 后验坍缩 / 潜在空间过度正则化 ——
    --kl-weight
    默认设置得非常小(
    1e-7
    );过度增大该值会降低重建质量。
  • 对抗训练不稳定 —— 降低
    --adv-weight
    (默认值
    0.1
    )或
    --lr
    ;学习率预热调度已在前20轮逐步提升学习率。
  • 内存不足 —— 减小
    --patch-size
    (例如
    48,48,48
    )和
    --val-sliding-window-patch-size
    ,保持
    --batch-size 1
    ,并降低
    --cache-rate
  • datalist must include non-empty validation[] or testing[]
    —— 验证循环是必须的;添加
    validation[]
    (或
    testing[]
    )条目。
  • 仅支持单GPU —— 运行器会检查是否存在且仅存在一个CUDA GPU;设置
    CUDA_VISIBLE_DEVICES
    来选择使用哪一个。

5. Evaluate the finetuned VAE

5. 评估微调后的VAE

Validation reconstruction loss (lowest-
val_weighted_loss
epoch) is tracked automatically and the best autoencoder is saved as
autoencoder_epochN.pt
under
OUT_DIR/artifacts/models
. To evaluate downstream:
  • Compare validation
    recons_loss
    /
    p_loss
    curves across runs in TensorBoard, and
  • Plug the finetuned autoencoder into a diffusion finetune/generation run (e.g.
    nv-generate-mr-brain-finetune
    via
    --trained-autoencoder-path
    ) to confirm latents still decode to usable volumes.
This skill gates file accounting and reconstruction bookkeeping only — image quality and downstream utility must be judged by a domain expert.
验证重建损失(
val_weighted_loss
最低的轮次)会被自动跟踪,最优自动编码器会保存为
OUT_DIR/artifacts/models
下的
autoencoder_epochN.pt
。如需进行下游评估:
  • 在TensorBoard中比较不同运行的验证
    recons_loss
    /
    p_loss
    曲线,并且
  • 将微调后的自动编码器插入到扩散模型微调/生成运行中(例如通过
    --trained-autoencoder-path
    参数传入
    nv-generate-mr-brain-finetune
    ),以确认潜在空间仍能解码为可用的数据集。
本技能仅负责文件管理和重建记录——图像质量和下游效用必须由领域专家判断。

Limitations

局限性

  • Requires a current upstream
    NV-Generate-CTMR
    checkout with VAE configs and helper APIs. The skill owns the runner glue and does not depend on the notebook.
  • Full training can be expensive and is not deterministic across hardware, CUDA, and package versions.
  • The wrapper gates file accounting and command provenance, not anatomical realism, reconstruction quality, or downstream model utility.
  • Not for clinical deployment, clinical interpretation, autonomous diagnosis, regulatory submission, or production training-data approval.
  • 需要当前版本的上游
    NV-Generate-CTMR
    检出版本,包含VAE配置文件和辅助API。本技能负责运行器的封装逻辑,不依赖于notebook。
  • 完整训练可能成本高昂,且在不同硬件、CUDA和包版本下结果不具有确定性。
  • 封装工具仅负责文件管理和命令溯源,不保证解剖学真实性、重建质量或下游模型效用。
  • 不用于临床部署、临床解读、自主诊断、合规提交或生产训练数据审批。

Troubleshooting

故障排除

ErrorCauseFix
VAE configs/helpers were not found
NV_GENERATE_ROOT
does not point at a current NV-Generate-CTMR checkout.
Clone or update
https://github.com/NVIDIA-Medtech/NV-Generate-CTMR
and set
NV_GENERATE_ROOT
.
datalist must include non-empty validation[] or testing[]
VAE training requires validation data for the configured validation loop.Add
validation[]
or
testing[]
entries with relative image paths.
CUDA, MONAI, or LPIPS import failureRuntime environment lacks upstream dependencies.Install
"$NV_GENERATE_ROOT/requirements.txt"
plus
lpips tensorboard
in the selected environment.
错误原因解决方法
VAE configs/helpers were not found
NV_GENERATE_ROOT
未指向当前版本的NV-Generate-CTMR检出目录。
克隆或更新
https://github.com/NVIDIA-Medtech/NV-Generate-CTMR
并设置
NV_GENERATE_ROOT
datalist must include non-empty validation[] or testing[]
VAE训练需要验证数据以执行配置好的验证循环。添加包含相对图像路径的
validation[]
testing[]
条目。
CUDA、MONAI或LPIPS导入失败运行环境缺少上游依赖。在所选环境中安装
"$NV_GENERATE_ROOT/requirements.txt"
中的包以及
lpips tensorboard