pyhealth

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PyHealth

PyHealth

PyHealth (https://pyhealth.dev/) is a Python toolkit for clinical deep learning. It provides a unified, modular pipeline across electronic health records (EHR), physiological signals, and medical imaging.
The library is built around a 5-stage pipeline
Dataset → Task → Model → Trainer → Metrics
— where each stage is replaceable and the interfaces between stages are stable. Code that follows this pipeline shape composes well; code that bypasses it usually fights the library.
该库围绕五阶段流程构建——
Dataset → Task → Model → Trainer → Metrics
——每个阶段均可替换,且阶段间的接口稳定。遵循此流程编写的代码兼容性良好;若绕过该流程,通常会与库的设计冲突。

When to use this skill

何时使用本技能

Use this skill whenever the user is doing clinical/healthcare ML and any of the following are true:
  • They mention PyHealth, MIMIC-III/IV, eICU, OMOP-CDM, EHRShot, SleepEDF, SHHS, ISRUC, COVID19-CXR, ChestX-ray14, TUEV/TUAB.
  • They want to predict mortality, readmission, length of stay, drug recommendations, sleep stages, ICD codes, EEG events, or de-identification.
  • They need to look up or cross-map medical codes (ICD-9-CM, ICD-10-CM, ATC, NDC, RxNorm, CCS).
  • They have EHR-shaped data and want to train a clinical model without writing the plumbing themselves.
PyHealth is the right tool when the workflow fits its 5 stages. If the user just wants generic PyTorch on tabular data, this skill is not necessary.
当用户从事临床/医疗机器学习工作,且满足以下任一条件时,可使用本技能:
  • 提及PyHealth、MIMIC-III/IV、eICU、OMOP-CDM、EHRShot、SleepEDF、SHHS、ISRUC、COVID19-CXR、ChestX-ray14、TUEV/TUAB。
  • 需要预测死亡率、再入院率、住院时长、进行药物推荐、睡眠分期、ICD编码、EEG事件检测或去标识化。
  • 需要查询或跨映射医疗代码(ICD-9-CM、ICD-10-CM、ATC、NDC、RxNorm、CCS)。
  • 拥有EHR格式的数据,希望无需自行编写基础代码即可训练临床模型。
当工作流程适配其五阶段模式时,PyHealth是合适的工具。若用户仅需在表格数据上使用通用PyTorch,则无需本技能。

Installation (uv)

安装(uv)

PyHealth 2.0 requires Python ≥ 3.12, < 3.14. Use
uv
for environment management — it's faster and reproducible.
bash
undefined
PyHealth 2.0要求Python ≥ 3.12且< 3.14。使用
uv
进行环境管理——它更快且可复现。
bash
undefined

Create a project with the right Python

创建项目并指定正确的Python版本

uv init my-pyhealth-project cd my-pyhealth-project uv python pin 3.12
uv init my-pyhealth-project cd my-pyhealth-project uv python pin 3.12

Add PyHealth (this also pulls in PyTorch and friends)

添加PyHealth(这会自动引入PyTorch及相关依赖)

uv add pyhealth
uv add pyhealth

Run scripts inside the env

在环境中运行脚本

uv run python train.py

For a one-off script without a project, use `uv run --with pyhealth python script.py`. For the legacy 1.x line (Python 3.9+), `uv add pyhealth==1.16`. Detailed install notes, MIMIC access, and GPU/CPU device tips are in `references/installation.md`.
uv run python train.py

对于无需项目的一次性脚本,使用`uv run --with pyhealth python script.py`。若使用旧版1.x系列(支持Python 3.9+),执行`uv add pyhealth==1.16`。详细安装说明、MIMIC访问权限及GPU/CPU设备提示请查看`references/installation.md`。

The 5-stage pipeline

五阶段流程

A complete pipeline is typically <20 lines. This is the canonical shape — start here and modify pieces:
python
from pyhealth.datasets import MIMIC3Dataset, split_by_patient, get_dataloader
from pyhealth.tasks import MortalityPredictionMIMIC3
from pyhealth.models import Transformer
from pyhealth.trainer import Trainer
from pyhealth.metrics.binary import binary_metrics_fn
完整流程通常只需不到20行代码。以下是标准模板——从这里开始并按需修改:
python
from pyhealth.datasets import MIMIC3Dataset, split_by_patient, get_dataloader
from pyhealth.tasks import MortalityPredictionMIMIC3
from pyhealth.models import Transformer
from pyhealth.trainer import Trainer
from pyhealth.metrics.binary import binary_metrics_fn

1. Dataset — raw patient registry

1. 数据集——原始患者注册表

base = MIMIC3Dataset( root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/", tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], )
base = MIMIC3Dataset( root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/", tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], )

2. Task — converts patients into supervised samples

2. 任务——将患者数据转换为有监督样本

samples = base.set_task(MortalityPredictionMIMIC3())
samples = base.set_task(MortalityPredictionMIMIC3())

3. Split + DataLoaders (split by patient to avoid leakage)

3. 拆分 + 数据加载器(按患者拆分以避免信息泄露)

train_ds, val_ds, test_ds = split_by_patient(samples, [0.8, 0.1, 0.1]) train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True) val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False) test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)
train_ds, val_ds, test_ds = split_by_patient(samples, [0.8, 0.1, 0.1]) train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True) val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False) test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)

4. Model — must be passed the SampleDataset, not the BaseDataset

4. 模型——必须传入SampleDataset,而非BaseDataset

model = Transformer(dataset=samples)
model = Transformer(dataset=samples)

5. Train + evaluate

5. 训练 + 评估

trainer = Trainer(model=model) trainer.train( train_dataloader=train_loader, val_dataloader=val_loader, epochs=50, monitor="pr_auc", )
y_true, y_prob, _ = trainer.inference(test_loader) print(binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"]))

A copy-pasteable starter is in `assets/starter_pipeline.py`.
trainer = Trainer(model=model) trainer.train( train_dataloader=train_loader, val_dataloader=val_loader, epochs=50, monitor="pr_auc", )
y_true, y_prob, _ = trainer.inference(test_loader) print(binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"]))

可直接复制粘贴的入门模板位于`assets/starter_pipeline.py`。

Critical things to get right

需要注意的关键事项

These are the mistakes that PyHealth code most commonly trips on. Internalize them before writing pipelines:
  1. Models take a
    SampleDataset
    , not a
    BaseDataset
    .
    MIMIC3Dataset(...)
    returns a
    BaseDataset
    (a queryable patient registry). Only after
    .set_task(task)
    do you get a
    SampleDataset
    , which is what models, splitters, and DataLoaders expect. If you pass
    base
    to a model, it will fail or behave wrong.
  2. Always split by patient (or visit), not by sample. Random sample-level splits leak information across train/test because the same patient can appear in both. Use
    split_by_patient
    for patient-level prediction,
    split_by_visit
    only when visits are independent.
  3. Match the task to the dataset. Tasks are dataset-specific:
    MortalityPredictionMIMIC3
    won't work on MIMIC-IV — use
    MortalityPredictionMIMIC4
    or
    InHospitalMortalityMIMIC4
    . The full mapping is in
    references/tasks.md
    .
  4. Pick
    monitor
    to match the task type.
    For binary classification use
    "pr_auc"
    or
    "roc_auc"
    . For multilabel (drug rec) use
    "pr_auc_samples"
    or
    "jaccard_samples"
    . For multiclass use
    "accuracy"
    or
    "f1_macro"
    . Wrong monitor → checkpoint selection saves the wrong epoch.
  5. MIMIC-IV uses
    ehr_root=
    , not
    root=
    .
    This is the one inconsistency in the dataset constructors.
  6. For reproducible work, point
    cache_dir=
    somewhere persistent.
    PyHealth caches the parsed dataset; without
    cache_dir
    , you re-parse every run.
这些是PyHealth代码最常犯的错误。在编写流程前请牢记:
  1. 模型接收的是
    SampleDataset
    ,而非
    BaseDataset
    MIMIC3Dataset(...)
    返回的是
    BaseDataset
    (可查询的患者注册表)。只有调用
    .set_task(task)
    后,才能得到
    SampleDataset
    ,这才是模型、拆分器和数据加载器所需的类型。若将
    base
    传入模型,会导致失败或行为异常。
  2. 始终按患者(或就诊)拆分,而非按样本拆分。 随机按样本拆分会导致同一患者出现在训练集和测试集中,造成信息泄露。针对患者级预测使用
    split_by_patient
    ,仅当就诊相互独立时才使用
    split_by_visit
  3. 任务与数据集匹配。 任务是数据集特定的:
    MortalityPredictionMIMIC3
    无法在MIMIC-IV上使用——请使用
    MortalityPredictionMIMIC4
    InHospitalMortalityMIMIC4
    。完整映射请查看
    references/tasks.md
  4. 选择与任务类型匹配的
    monitor
    指标。
    二分类任务使用
    "pr_auc"
    "roc_auc"
    ;多标签任务(如药物推荐)使用
    "pr_auc_samples"
    "jaccard_samples"
    ;多分类任务使用
    "accuracy"
    "f1_macro"
    。错误的监控指标会导致 checkpoint 保存错误的训练轮次。
  5. MIMIC-IV使用
    ehr_root=
    参数,而非
    root=
    这是数据集构造函数中唯一的不一致之处。
  6. 为可复现的工作,将
    cache_dir=
    指向持久化路径。
    PyHealth会缓存解析后的数据集;若未设置
    cache_dir
    ,每次运行都会重新解析数据。

How to use this skill

如何使用本技能

PyHealth has a large API surface — there's no point loading it all at once. Read the reference file that matches the user's task:
If the user is asking about…Read
Installing, env setup, MIMIC access, GPU
references/installation.md
Which dataset class to use, loading patterns, splitting
references/datasets.md
What prediction task to choose (mortality, readmission, drug rec, sleep…)
references/tasks.md
Picking a model architecture, model-specific arguments
references/models.md
Looking up or cross-mapping ICD/ATC/NDC/RxNorm/CCS codes, tokenizers
references/medcode.md
End-to-end recipes for common scenarios
references/examples.md
For multi-step tasks (e.g., "build a drug recommendation pipeline on MIMIC-IV"), read
tasks.md
+
models.md
+
examples.md
together — they cross-reference each other.
PyHealth的API覆盖范围很广——无需一次性全部加载。请根据用户的任务阅读对应的参考文档:
用户咨询的内容…阅读文档
安装、环境配置、MIMIC访问、GPU使用
references/installation.md
选择数据集类、加载模式、拆分方法
references/datasets.md
选择预测任务(死亡率、再入院、药物推荐、睡眠分期等)
references/tasks.md
选择模型架构、模型特定参数
references/models.md
查询或跨映射ICD/ATC/NDC/RxNorm/CCS代码、分词器
references/medcode.md
常见场景的端到端示例
references/examples.md
对于多步骤任务(例如“在MIMIC-IV上构建药物推荐流程”),请同时阅读
tasks.md
+
models.md
+
examples.md
——它们之间会相互引用。

A note on style

风格说明

Write minimal, idiomatic PyHealth. The library is opinionated; lean into its abstractions instead of reimplementing them in raw PyTorch. If you find yourself writing a custom training loop, ask whether
Trainer
would do the job — it almost always will, and it handles checkpointing, logging, and best-model selection for free.
When the user has private MIMIC access, point them at the local CSV root; for demos and learning, the synthetic MIMIC-III bucket (
https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/
) is fine and works without credentialing.
编写简洁、符合PyHealth风格的代码。该库具有明确的设计倾向;请充分利用其抽象能力,而非用原生PyTorch重新实现。若发现自己在编写自定义训练循环,请思考
Trainer
是否能完成这项工作——它几乎总能胜任,并且会自动处理 checkpoint、日志记录和最佳模型选择。
当用户拥有MIMIC私有访问权限时,指引他们使用本地CSV根目录;对于演示和学习,合成MIMIC-III存储桶(
https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/
)即可使用,无需凭证。