pyhealth
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePyHealth
PyHealth
PyHealth (https://pyhealth.dev/) is a Python toolkit for clinical deep learning. It provides a unified, modular pipeline across electronic health records (EHR), physiological signals, and medical imaging.
The library is built around a 5-stage pipeline — — where each stage is replaceable and the interfaces between stages are stable. Code that follows this pipeline shape composes well; code that bypasses it usually fights the library.
Dataset → Task → Model → Trainer → Metrics该库围绕五阶段流程构建————每个阶段均可替换,且阶段间的接口稳定。遵循此流程编写的代码兼容性良好;若绕过该流程,通常会与库的设计冲突。
Dataset → Task → Model → Trainer → MetricsWhen to use this skill
何时使用本技能
Use this skill whenever the user is doing clinical/healthcare ML and any of the following are true:
- They mention PyHealth, MIMIC-III/IV, eICU, OMOP-CDM, EHRShot, SleepEDF, SHHS, ISRUC, COVID19-CXR, ChestX-ray14, TUEV/TUAB.
- They want to predict mortality, readmission, length of stay, drug recommendations, sleep stages, ICD codes, EEG events, or de-identification.
- They need to look up or cross-map medical codes (ICD-9-CM, ICD-10-CM, ATC, NDC, RxNorm, CCS).
- They have EHR-shaped data and want to train a clinical model without writing the plumbing themselves.
PyHealth is the right tool when the workflow fits its 5 stages. If the user just wants generic PyTorch on tabular data, this skill is not necessary.
当用户从事临床/医疗机器学习工作,且满足以下任一条件时,可使用本技能:
- 提及PyHealth、MIMIC-III/IV、eICU、OMOP-CDM、EHRShot、SleepEDF、SHHS、ISRUC、COVID19-CXR、ChestX-ray14、TUEV/TUAB。
- 需要预测死亡率、再入院率、住院时长、进行药物推荐、睡眠分期、ICD编码、EEG事件检测或去标识化。
- 需要查询或跨映射医疗代码(ICD-9-CM、ICD-10-CM、ATC、NDC、RxNorm、CCS)。
- 拥有EHR格式的数据,希望无需自行编写基础代码即可训练临床模型。
当工作流程适配其五阶段模式时,PyHealth是合适的工具。若用户仅需在表格数据上使用通用PyTorch,则无需本技能。
Installation (uv)
安装(uv)
PyHealth 2.0 requires Python ≥ 3.12, < 3.14. Use for environment management — it's faster and reproducible.
uvbash
undefinedPyHealth 2.0要求Python ≥ 3.12且< 3.14。使用进行环境管理——它更快且可复现。
uvbash
undefinedCreate a project with the right Python
创建项目并指定正确的Python版本
uv init my-pyhealth-project
cd my-pyhealth-project
uv python pin 3.12
uv init my-pyhealth-project
cd my-pyhealth-project
uv python pin 3.12
Add PyHealth (this also pulls in PyTorch and friends)
添加PyHealth(这会自动引入PyTorch及相关依赖)
uv add pyhealth
uv add pyhealth
Run scripts inside the env
在环境中运行脚本
uv run python train.py
For a one-off script without a project, use `uv run --with pyhealth python script.py`. For the legacy 1.x line (Python 3.9+), `uv add pyhealth==1.16`. Detailed install notes, MIMIC access, and GPU/CPU device tips are in `references/installation.md`.uv run python train.py
对于无需项目的一次性脚本,使用`uv run --with pyhealth python script.py`。若使用旧版1.x系列(支持Python 3.9+),执行`uv add pyhealth==1.16`。详细安装说明、MIMIC访问权限及GPU/CPU设备提示请查看`references/installation.md`。The 5-stage pipeline
五阶段流程
A complete pipeline is typically <20 lines. This is the canonical shape — start here and modify pieces:
python
from pyhealth.datasets import MIMIC3Dataset, split_by_patient, get_dataloader
from pyhealth.tasks import MortalityPredictionMIMIC3
from pyhealth.models import Transformer
from pyhealth.trainer import Trainer
from pyhealth.metrics.binary import binary_metrics_fn完整流程通常只需不到20行代码。以下是标准模板——从这里开始并按需修改:
python
from pyhealth.datasets import MIMIC3Dataset, split_by_patient, get_dataloader
from pyhealth.tasks import MortalityPredictionMIMIC3
from pyhealth.models import Transformer
from pyhealth.trainer import Trainer
from pyhealth.metrics.binary import binary_metrics_fn1. Dataset — raw patient registry
1. 数据集——原始患者注册表
base = MIMIC3Dataset(
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
)
base = MIMIC3Dataset(
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/",
tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"],
)
2. Task — converts patients into supervised samples
2. 任务——将患者数据转换为有监督样本
samples = base.set_task(MortalityPredictionMIMIC3())
samples = base.set_task(MortalityPredictionMIMIC3())
3. Split + DataLoaders (split by patient to avoid leakage)
3. 拆分 + 数据加载器(按患者拆分以避免信息泄露)
train_ds, val_ds, test_ds = split_by_patient(samples, [0.8, 0.1, 0.1])
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
train_ds, val_ds, test_ds = split_by_patient(samples, [0.8, 0.1, 0.1])
train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True)
val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False)
test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
4. Model — must be passed the SampleDataset, not the BaseDataset
4. 模型——必须传入SampleDataset,而非BaseDataset
model = Transformer(dataset=samples)
model = Transformer(dataset=samples)
5. Train + evaluate
5. 训练 + 评估
trainer = Trainer(model=model)
trainer.train(
train_dataloader=train_loader,
val_dataloader=val_loader,
epochs=50,
monitor="pr_auc",
)
y_true, y_prob, _ = trainer.inference(test_loader)
print(binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"]))
A copy-pasteable starter is in `assets/starter_pipeline.py`.trainer = Trainer(model=model)
trainer.train(
train_dataloader=train_loader,
val_dataloader=val_loader,
epochs=50,
monitor="pr_auc",
)
y_true, y_prob, _ = trainer.inference(test_loader)
print(binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"]))
可直接复制粘贴的入门模板位于`assets/starter_pipeline.py`。Critical things to get right
需要注意的关键事项
These are the mistakes that PyHealth code most commonly trips on. Internalize them before writing pipelines:
-
Models take a, not a
SampleDataset.BaseDatasetreturns aMIMIC3Dataset(...)(a queryable patient registry). Only afterBaseDatasetdo you get a.set_task(task), which is what models, splitters, and DataLoaders expect. If you passSampleDatasetto a model, it will fail or behave wrong.base -
Always split by patient (or visit), not by sample. Random sample-level splits leak information across train/test because the same patient can appear in both. Usefor patient-level prediction,
split_by_patientonly when visits are independent.split_by_visit -
Match the task to the dataset. Tasks are dataset-specific:won't work on MIMIC-IV — use
MortalityPredictionMIMIC3orMortalityPredictionMIMIC4. The full mapping is inInHospitalMortalityMIMIC4.references/tasks.md -
Pickto match the task type. For binary classification use
monitoror"pr_auc". For multilabel (drug rec) use"roc_auc"or"pr_auc_samples". For multiclass use"jaccard_samples"or"accuracy". Wrong monitor → checkpoint selection saves the wrong epoch."f1_macro" -
MIMIC-IV uses, not
ehr_root=. This is the one inconsistency in the dataset constructors.root= -
For reproducible work, pointsomewhere persistent. PyHealth caches the parsed dataset; without
cache_dir=, you re-parse every run.cache_dir
这些是PyHealth代码最常犯的错误。在编写流程前请牢记:
-
模型接收的是,而非
SampleDataset。BaseDataset返回的是MIMIC3Dataset(...)(可查询的患者注册表)。只有调用BaseDataset后,才能得到.set_task(task),这才是模型、拆分器和数据加载器所需的类型。若将SampleDataset传入模型,会导致失败或行为异常。base -
始终按患者(或就诊)拆分,而非按样本拆分。 随机按样本拆分会导致同一患者出现在训练集和测试集中,造成信息泄露。针对患者级预测使用,仅当就诊相互独立时才使用
split_by_patient。split_by_visit -
任务与数据集匹配。 任务是数据集特定的:无法在MIMIC-IV上使用——请使用
MortalityPredictionMIMIC3或MortalityPredictionMIMIC4。完整映射请查看InHospitalMortalityMIMIC4。references/tasks.md -
选择与任务类型匹配的指标。 二分类任务使用
monitor或"pr_auc";多标签任务(如药物推荐)使用"roc_auc"或"pr_auc_samples";多分类任务使用"jaccard_samples"或"accuracy"。错误的监控指标会导致 checkpoint 保存错误的训练轮次。"f1_macro" -
MIMIC-IV使用参数,而非
ehr_root=。 这是数据集构造函数中唯一的不一致之处。root= -
为可复现的工作,将指向持久化路径。 PyHealth会缓存解析后的数据集;若未设置
cache_dir=,每次运行都会重新解析数据。cache_dir
How to use this skill
如何使用本技能
PyHealth has a large API surface — there's no point loading it all at once. Read the reference file that matches the user's task:
| If the user is asking about… | Read |
|---|---|
| Installing, env setup, MIMIC access, GPU | |
| Which dataset class to use, loading patterns, splitting | |
| What prediction task to choose (mortality, readmission, drug rec, sleep…) | |
| Picking a model architecture, model-specific arguments | |
| Looking up or cross-mapping ICD/ATC/NDC/RxNorm/CCS codes, tokenizers | |
| End-to-end recipes for common scenarios | |
For multi-step tasks (e.g., "build a drug recommendation pipeline on MIMIC-IV"), read + + together — they cross-reference each other.
tasks.mdmodels.mdexamples.mdPyHealth的API覆盖范围很广——无需一次性全部加载。请根据用户的任务阅读对应的参考文档:
| 用户咨询的内容… | 阅读文档 |
|---|---|
| 安装、环境配置、MIMIC访问、GPU使用 | |
| 选择数据集类、加载模式、拆分方法 | |
| 选择预测任务(死亡率、再入院、药物推荐、睡眠分期等) | |
| 选择模型架构、模型特定参数 | |
| 查询或跨映射ICD/ATC/NDC/RxNorm/CCS代码、分词器 | |
| 常见场景的端到端示例 | |
对于多步骤任务(例如“在MIMIC-IV上构建药物推荐流程”),请同时阅读 + + ——它们之间会相互引用。
tasks.mdmodels.mdexamples.mdA note on style
风格说明
Write minimal, idiomatic PyHealth. The library is opinionated; lean into its abstractions instead of reimplementing them in raw PyTorch. If you find yourself writing a custom training loop, ask whether would do the job — it almost always will, and it handles checkpointing, logging, and best-model selection for free.
TrainerWhen the user has private MIMIC access, point them at the local CSV root; for demos and learning, the synthetic MIMIC-III bucket () is fine and works without credentialing.
https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/编写简洁、符合PyHealth风格的代码。该库具有明确的设计倾向;请充分利用其抽象能力,而非用原生PyTorch重新实现。若发现自己在编写自定义训练循环,请思考是否能完成这项工作——它几乎总能胜任,并且会自动处理 checkpoint、日志记录和最佳模型选择。
Trainer当用户拥有MIMIC私有访问权限时,指引他们使用本地CSV根目录;对于演示和学习,合成MIMIC-III存储桶()即可使用,无需凭证。
https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/