nv-generate-vae-finetune
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNV-Generate-VAE-Finetune
NV-Generate-VAE-Finetune
Purpose
用途
- Used for finetuning the NV-Generate-CTMR MAISI VAE/autoencoder from user-supplied CT or MRI NIfTI training volumes.
- Not for clinical interpretation, regulatory use, or approving synthetic data for production training.
- Upstream currently documents VAE training in and provides configs/helpers, but not a
train_vae_tutorial.ipynbCLI. This skill does not execute the notebook; it stages the required config/datalist glue locally and uses upstream helper APIs.scripts.train_vae - Manifest I/O: inputs are and
datalist; outputs aredata_base_dir,autoencoder_checkpoint, anddiscriminator_checkpoint.result_json - The underlying training contract is the upstream config/env JSON (+
config_maisi_vae_train.json, as used inenvironment_maisi_vae_train.json). The wrapper stages those JSON files for you and exposes the most-tuned fields as CLI flags; the sections below document the fields, their defaults, and how to monitor/tune a run.train_vae_tutorial.ipynb
- 用于基于用户提供的CT或MRI NIfTI训练数据集微调NV-Generate-CTMR MAISI VAE/自动编码器。
- 不用于临床解读、合规性使用,也不用于批准合成数据用于生产训练。
- 上游项目目前在中记录了VAE训练流程,并提供配置文件/辅助工具,但未提供
train_vae_tutorial.ipynb命令行界面(CLI)。本技能不执行该notebook;它会在本地准备所需的配置/数据列表关联文件,并使用上游的辅助API。scripts.train_vae - 清单输入输出:输入为和
datalist;输出为data_base_dir、autoencoder_checkpoint和discriminator_checkpoint。result_json - 底层训练规则遵循上游的配置/环境JSON文件(+
config_maisi_vae_train.json,与environment_maisi_vae_train.json中使用的一致)。该封装工具会为您准备这些JSON文件,并将最常用的可配置字段暴露为CLI参数;以下章节将记录这些字段、默认值以及如何监控/调优训练过程。train_vae_tutorial.ipynb
Instructions
操作说明
- Read before changing arguments, side effects, or validation gates.
skill_manifest.yaml - Run from the Medical AI Skills repo root.
scripts/run_vae_finetune.py - If a host agent exposes , use
run_script; otherwise run the Bash/Python command below.run_script("scripts/run_vae_finetune.py", args=[...]) - Use first when checking a new datalist; remove
--preflightonly when the user explicitly wants to launch GPU finetuning.--preflight - For a staged preflight input bundle directory, use as the datalist and
BUNDLE/preflight_datalist.jsonasBUNDLE/preflight_datasetwhen those files are present.--data-base-dir
- 在修改参数、副作用或验证规则前,请先阅读。
skill_manifest.yaml - 从Medical AI Skills仓库根目录运行。
scripts/run_vae_finetune.py - 如果宿主Agent提供方法,请使用
run_script;否则运行下方的Bash/Python命令。run_script("scripts/run_vae_finetune.py", args=[...]) - 检查新数据列表时,请先使用参数;仅当用户明确希望启动GPU微调时,再移除该参数。
--preflight - 对于已准备好的预检查输入包目录,当存在相关文件时,使用作为数据列表,
BUNDLE/preflight_datalist.json作为BUNDLE/preflight_dataset。--data-base-dir
Examples
示例
Validate and stage a preflight finetune check from an input bundle (the recommended first step — no GPU, no training). This is the single canonical command; replace and with your paths:
INPUT_BUNDLEOUT_DIRbash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
INPUT_BUNDLE/preflight_datalist.json \
--data-base-dir INPUT_BUNDLE/preflight_dataset \
--output-dir OUT_DIR \
--modality mri \
--preflightFor real GPU finetuning and other variations, see Usage below.
从输入包中验证并准备预检查微调(推荐的第一步——无需GPU,不进行训练)。这是标准命令;请将和替换为您的路径:
INPUT_BUNDLEOUT_DIRbash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
INPUT_BUNDLE/preflight_datalist.json \
--data-base-dir INPUT_BUNDLE/preflight_dataset \
--output-dir OUT_DIR \
--modality mri \
--preflight如需进行真实的GPU微调或其他操作变体,请参阅下方的【2. 使用方式(一键训练)】章节。
Available Scripts
可用脚本
| Script | Purpose | Arguments |
|---|---|---|
| Primary entrypoint declared by | |
| 脚本 | 用途 | 参数 |
|---|---|---|
| | |
Prerequisites
前置条件
- may point to a current checkout of
NV_GENERATE_ROOTcontaininghttps://github.com/NVIDIA-Medtech/NV-Generate-CTMR,configs/config_maisi_vae_train.json, andscripts/transforms.py.scripts/utils.py - If is unset, the wrapper searches
NV_GENERATE_ROOT..workbench_data/upstreams/NV-Generate-CTMR - is optional and can be used to select the GPU for real training.
CUDA_VISIBLE_DEVICES - Runtime requirements: NVIDIA CUDA GPU for real training, Python packages from the upstream ,
requirements.txt, and downloaded VAE weights unless usinglpips.--train-from-scratch - Side effects: writes staged configs, checkpoints, TensorBoard logs, and run summaries under the caller-provided ; may write model caches under the upstream checkout,
--output-dir, and~/.cache/huggingface/; may contact~/.cache/torch/,https://huggingface.co, andhttps://github.com.https://download.pytorch.org - The datalist is a MONAI-style JSON object with non-empty and
training[]orvalidation[]. Each entry has antesting[]path relative toimageand optional--data-base-dirorclassofmodalityorct.mri
- 需指向
NV_GENERATE_ROOT的当前检出版本,其中需包含https://github.com/NVIDIA-Medtech/NV-Generate-CTMR、configs/config_maisi_vae_train.json和scripts/transforms.py。scripts/utils.py - 如果未设置,封装工具会搜索
NV_GENERATE_ROOT。.workbench_data/upstreams/NV-Generate-CTMR - 为可选参数,可用于选择真实训练时使用的GPU。
CUDA_VISIBLE_DEVICES - 运行时要求:真实训练需NVIDIA CUDA GPU;需安装上游中的Python包、
requirements.txt,以及下载的VAE权重(除非使用lpips参数)。--train-from-scratch - 副作用:会在用户指定的下写入准备好的配置文件、检查点、TensorBoard日志和运行摘要;可能会在上游检出目录、
--output-dir和~/.cache/huggingface/下写入模型缓存;可能会连接~/.cache/torch/、https://huggingface.co和https://github.com。https://download.pytorch.org - 数据列表为MONAI风格的JSON对象,需包含非空的和
training[]或validation[]。每个条目需包含相对于testing[]的--data-base-dir路径,以及可选的image或class(值为modality或ct)。mri
1. Config and environment JSON (adapt to your data)
1. 配置与环境JSON(适配您的数据)
The wrapper copies the upstream VAE config/env JSON from , rewrites the fields below, and writes the staged copies under . You normally only set your datalist and data root; the listed CLI flags override individual fields when you need to.
$NV_GENERATE_ROOT/configsOUT_DIR/workflow/configs/Environment JSON ():
environment_maisi_vae_train.json| Field | Set from | Notes |
|---|---|---|
| | Where |
| | TensorBoard event directory. |
| | |
| upstream weights / | Starting VAE checkpoint when finetuning. |
Training fields ():
config_maisi_vae_train.json| Field | Flag | Type | Default | Notes |
|---|---|---|---|---|
| | int | | |
| | int | | Per-GPU (single-GPU runner). |
| | int,int,int | | Training crop. |
| | int | | |
| | int,int,int | | Sliding-window validation ROI. |
| | float | | |
| | float | | LPIPS term. |
| | float | | KL term. |
| | float | | Adversarial term. |
| | | | |
| | int | | Epochs between validation passes. |
| | float | | MONAI |
| | flag | on | Mixed precision; flag disables it. |
| | flag | on | Random augmentation; flag disables it. |
| | | | |
| | float,float,float | unset | Required when |
| | int | | Channel for multi-channel inputs. |
--modalityctmrimriclassFor an end-to-end reference including example data download, see the upstream tutorial .
train_vae_tutorial.ipynb封装工具会从复制上游的VAE配置/环境JSON,重写以下字段,并将准备好的副本写入。通常您只需设置数据列表和数据根目录;当需要时,可通过列出的CLI参数覆盖单个字段。
$NV_GENERATE_ROOT/configsOUT_DIR/workflow/configs/环境JSON():
environment_maisi_vae_train.json| 字段 | 来源 | 说明 |
|---|---|---|
| | 保存 |
| | TensorBoard事件目录。 |
| | 默认值为 |
| 上游权重 / | 微调时使用的初始VAE检查点。 |
训练字段():
config_maisi_vae_train.json| 字段 | 参数 | 类型 | 默认值 | 说明 |
|---|---|---|---|---|
| | int | | |
| | int | | 单GPU运行时的每GPU批次大小。 |
| | int,int,int | | 训练时的裁剪尺寸。 |
| | int | | |
| | int,int,int | | 滑动窗口验证的感兴趣区域(ROI)。 |
| | float | | |
| | float | | LPIPS项权重。 |
| | float | | KL项权重。 |
| | float | | 对抗项权重。 |
| | | | |
| | int | | 两次验证之间的训练轮数。 |
| | float | | MONAI |
| | 参数 | 开启 | 混合精度训练;该参数会关闭此功能。 |
| | 参数 | 开启 | 随机数据增强;该参数会关闭此功能。 |
| | | | |
| | float,float,float | 未设置 | 当 |
| | int | | 多通道输入时选择的通道。 |
--modalityctmrimriclass如需包含示例数据下载的端到端参考,请参阅上游教程。
train_vae_tutorial.ipynb2. Usage (one-line training)
2. 使用方式(一键训练)
Preflight only:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
PATH_TO_DATALIST.json \
--data-base-dir PATH_TO_DATA_ROOT \
--output-dir runs/nv_generate_vae_finetune_preflight \
--preflightPreflight bundle input:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
PATH_TO_INPUT_BUNDLE/preflight_datalist.json \
--data-base-dir PATH_TO_INPUT_BUNDLE/preflight_dataset \
--output-dir runs/nv_generate_vae_finetune_preflight \
--preflightGPU finetuning:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python -m pip install -r "$NV_GENERATE_ROOT/requirements.txt" && \
python -m pip install lpips tensorboard && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
PATH_TO_DATALIST.json \
--data-base-dir PATH_TO_DATA_ROOT \
--output-dir runs/nv_generate_vae_finetune \
--epochs 1 \
--modality mri \
--patch-size 64,64,64 \
--download-model-dataReplace and with the user's actual paths. Do not use the fixture datalist for real training; it is a preflight-only placeholder.
PATH_TO_DATALIST.jsonPATH_TO_DATA_ROOT仅预检查:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
PATH_TO_DATALIST.json \
--data-base-dir PATH_TO_DATA_ROOT \
--output-dir runs/nv_generate_vae_finetune_preflight \
--preflight预检查包输入:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
PATH_TO_INPUT_BUNDLE/preflight_datalist.json \
--data-base-dir PATH_TO_INPUT_BUNDLE/preflight_dataset \
--output-dir runs/nv_generate_vae_finetune_preflight \
--preflightGPU微调:
bash
export NV_GENERATE_ROOT="${NV_GENERATE_ROOT:-.workbench_data/upstreams/NV-Generate-CTMR}" && \
python -m pip install -r "$NV_GENERATE_ROOT/requirements.txt" && \
python -m pip install lpips tensorboard && \
python skills/nv-generate-vae-finetune/scripts/run_vae_finetune.py \
PATH_TO_DATALIST.json \
--data-base-dir PATH_TO_DATA_ROOT \
--output-dir runs/nv_generate_vae_finetune \
--epochs 1 \
--modality mri \
--patch-size 64,64,64 \
--download-model-data请将和替换为用户的实际路径。请勿使用测试数据列表进行真实训练;它仅作为预检查的占位符。
PATH_TO_DATALIST.jsonPATH_TO_DATA_ROOT3. Monitor training (TensorBoard)
3. 监控训练(TensorBoard)
The runner writes TensorBoard scalars (per-iteration and per-epoch , , , adversarial/real/fake losses, and a validation ) under . Launch TensorBoard against the output directory:
recons_losskl_lossp_lossscale_factorOUT_DIR/artifacts/tfevent/autoencoderbash
python -m pip install tensorboard && \
tensorboard --logdir runs/nv_generate_vae_finetune/artifacts/tfeventThe same per-epoch loss history is also captured in and echoed in the JSON the wrapper prints to stdout (, best-checkpoint paths, , ).
OUT_DIR/artifacts/workflow_summary.jsonloss_historyexit_codestderr_tail运行器会在下写入TensorBoard标量数据(每迭代和每轮的、、、对抗/真实/虚假损失,以及验证)。针对输出目录启动TensorBoard:
OUT_DIR/artifacts/tfevent/autoencoderrecons_losskl_lossp_lossscale_factorbash
python -m pip install tensorboard && \
tensorboard --logdir runs/nv_generate_vae_finetune/artifacts/tfevent相同的每轮损失历史也会被记录在中,并在封装工具打印到标准输出的JSON中回显(包含、最优检查点路径、、)。
OUT_DIR/artifacts/workflow_summary.jsonloss_historyexit_codestderr_tail4. Hyperparameter tuning and common pitfalls
4. 超参数调优与常见问题
- Reconstructions blurry — raise (default
--perceptual-weight); try0.3if edges look washed out.--recon-loss l2 - Posterior collapse / over-regularized latents — is intentionally tiny (
--kl-weight); increasing it too much degrades reconstruction.1e-7 - Adversarial training unstable — lower (default
--adv-weight) or0.1; a warmup schedule already ramps the LR over the first 20 epochs.--lr - Out-of-memory — reduce (e.g.
--patch-size) and48,48,48, keep--val-sliding-window-patch-size, and lower--batch-size 1.--cache-rate - — the validation loop is mandatory; add
datalist must include non-empty validation[] or testing[](orvalidation[]) entries.testing[] - Single-GPU only — the runner asserts exactly one CUDA GPU; set to pick which one.
CUDA_VISIBLE_DEVICES
- 重建结果模糊 —— 提高(默认值
--perceptual-weight);如果边缘看起来模糊,尝试使用0.3。--recon-loss l2 - 后验坍缩 / 潜在空间过度正则化 —— 默认设置得非常小(
--kl-weight);过度增大该值会降低重建质量。1e-7 - 对抗训练不稳定 —— 降低(默认值
--adv-weight)或0.1;学习率预热调度已在前20轮逐步提升学习率。--lr - 内存不足 —— 减小(例如
--patch-size)和48,48,48,保持--val-sliding-window-patch-size,并降低--batch-size 1。--cache-rate - —— 验证循环是必须的;添加
datalist must include non-empty validation[] or testing[](或validation[])条目。testing[] - 仅支持单GPU —— 运行器会检查是否存在且仅存在一个CUDA GPU;设置来选择使用哪一个。
CUDA_VISIBLE_DEVICES
5. Evaluate the finetuned VAE
5. 评估微调后的VAE
Validation reconstruction loss (lowest- epoch) is tracked automatically and the best autoencoder is saved as under . To evaluate downstream:
val_weighted_lossautoencoder_epochN.ptOUT_DIR/artifacts/models- Compare validation /
recons_losscurves across runs in TensorBoard, andp_loss - Plug the finetuned autoencoder into a diffusion finetune/generation run (e.g. via
nv-generate-mr-brain-finetune) to confirm latents still decode to usable volumes.--trained-autoencoder-path
This skill gates file accounting and reconstruction bookkeeping only — image quality and downstream utility must be judged by a domain expert.
验证重建损失(最低的轮次)会被自动跟踪,最优自动编码器会保存为下的。如需进行下游评估:
val_weighted_lossOUT_DIR/artifacts/modelsautoencoder_epochN.pt- 在TensorBoard中比较不同运行的验证/
recons_loss曲线,并且p_loss - 将微调后的自动编码器插入到扩散模型微调/生成运行中(例如通过参数传入
--trained-autoencoder-path),以确认潜在空间仍能解码为可用的数据集。nv-generate-mr-brain-finetune
本技能仅负责文件管理和重建记录——图像质量和下游效用必须由领域专家判断。
Limitations
局限性
- Requires a current upstream checkout with VAE configs and helper APIs. The skill owns the runner glue and does not depend on the notebook.
NV-Generate-CTMR - Full training can be expensive and is not deterministic across hardware, CUDA, and package versions.
- The wrapper gates file accounting and command provenance, not anatomical realism, reconstruction quality, or downstream model utility.
- Not for clinical deployment, clinical interpretation, autonomous diagnosis, regulatory submission, or production training-data approval.
- 需要当前版本的上游检出版本,包含VAE配置文件和辅助API。本技能负责运行器的封装逻辑,不依赖于notebook。
NV-Generate-CTMR - 完整训练可能成本高昂,且在不同硬件、CUDA和包版本下结果不具有确定性。
- 封装工具仅负责文件管理和命令溯源,不保证解剖学真实性、重建质量或下游模型效用。
- 不用于临床部署、临床解读、自主诊断、合规提交或生产训练数据审批。
Troubleshooting
故障排除
| Error | Cause | Fix |
|---|---|---|
| | Clone or update |
| VAE training requires validation data for the configured validation loop. | Add |
| CUDA, MONAI, or LPIPS import failure | Runtime environment lacks upstream dependencies. | Install |
| 错误 | 原因 | 解决方法 |
|---|---|---|
| | 克隆或更新 |
| VAE训练需要验证数据以执行配置好的验证循环。 | 添加包含相对图像路径的 |
| CUDA、MONAI或LPIPS导入失败 | 运行环境缺少上游依赖。 | 在所选环境中安装 |