pyhealth

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PyHealth

PyHealth (https://pyhealth.dev/) is a Python toolkit for clinical deep learning. It provides a unified, modular pipeline across electronic health records (EHR), physiological signals, and medical imaging.

The library is built around a 5-stage pipeline —

Dataset → Task → Model → Trainer → Metrics

— where each stage is replaceable and the interfaces between stages are stable. Code that follows this pipeline shape composes well; code that bypasses it usually fights the library.

PyHealth（https://pyhealth.dev/）是一款面向临床深度学习的Python工具包。它为电子健康记录（EHR）、生理信号和医学影像提供了统一的模块化流程。

该库围绕五阶段流程构建——

Dataset → Task → Model → Trainer → Metrics

——每个阶段均可替换，且阶段间的接口稳定。遵循此流程编写的代码兼容性良好；若绕过该流程，通常会与库的设计冲突。

When to use this skill

何时使用本技能

Use this skill whenever the user is doing clinical/healthcare ML and any of the following are true:

They mention PyHealth, MIMIC-III/IV, eICU, OMOP-CDM, EHRShot, SleepEDF, SHHS, ISRUC, COVID19-CXR, ChestX-ray14, TUEV/TUAB.
They want to predict mortality, readmission, length of stay, drug recommendations, sleep stages, ICD codes, EEG events, or de-identification.
They need to look up or cross-map medical codes (ICD-9-CM, ICD-10-CM, ATC, NDC, RxNorm, CCS).
They have EHR-shaped data and want to train a clinical model without writing the plumbing themselves.

PyHealth is the right tool when the workflow fits its 5 stages. If the user just wants generic PyTorch on tabular data, this skill is not necessary.

当用户从事临床/医疗机器学习工作，且满足以下任一条件时，可使用本技能：

提及PyHealth、MIMIC-III/IV、eICU、OMOP-CDM、EHRShot、SleepEDF、SHHS、ISRUC、COVID19-CXR、ChestX-ray14、TUEV/TUAB。
需要预测死亡率、再入院率、住院时长、进行药物推荐、睡眠分期、ICD编码、EEG事件检测或去标识化。
需要查询或跨映射医疗代码（ICD-9-CM、ICD-10-CM、ATC、NDC、RxNorm、CCS）。
拥有EHR格式的数据，希望无需自行编写基础代码即可训练临床模型。

当工作流程适配其五阶段模式时，PyHealth是合适的工具。若用户仅需在表格数据上使用通用PyTorch，则无需本技能。

Installation (uv)

安装（uv）

PyHealth 2.0 requires Python ≥ 3.12, < 3.14. Use

uv

for environment management — it's faster and reproducible.

bash

undefined

PyHealth 2.0要求Python ≥ 3.12且< 3.14。使用

uv

进行环境管理——它更快且可复现。

bash

undefined

Create a project with the right Python

创建项目并指定正确的Python版本

uv init my-pyhealth-project cd my-pyhealth-project uv python pin 3.12

Add PyHealth (this also pulls in PyTorch and friends)

添加PyHealth（这会自动引入PyTorch及相关依赖）

uv add pyhealth

Run scripts inside the env

在环境中运行脚本

uv run python train.py


For a one-off script without a project, use `uv run --with pyhealth python script.py`. For the legacy 1.x line (Python 3.9+), `uv add pyhealth==1.16`. Detailed install notes, MIMIC access, and GPU/CPU device tips are in `references/installation.md`.

uv run python train.py


对于无需项目的一次性脚本，使用`uv run --with pyhealth python script.py`。若使用旧版1.x系列（支持Python 3.9+），执行`uv add pyhealth==1.16`。详细安装说明、MIMIC访问权限及GPU/CPU设备提示请查看`references/installation.md`。

The 5-stage pipeline

五阶段流程

A complete pipeline is typically <20 lines. This is the canonical shape — start here and modify pieces:

python

from pyhealth.datasets import MIMIC3Dataset, split_by_patient, get_dataloader
from pyhealth.tasks import MortalityPredictionMIMIC3
from pyhealth.models import Transformer
from pyhealth.trainer import Trainer
from pyhealth.metrics.binary import binary_metrics_fn

完整流程通常只需不到20行代码。以下是标准模板——从这里开始并按需修改：

python

from pyhealth.datasets import MIMIC3Dataset, split_by_patient, get_dataloader
from pyhealth.tasks import MortalityPredictionMIMIC3
from pyhealth.models import Transformer
from pyhealth.trainer import Trainer
from pyhealth.metrics.binary import binary_metrics_fn

1. Dataset — raw patient registry

1. 数据集——原始患者注册表

base = MIMIC3Dataset( root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/", tables=["DIAGNOSES_ICD", "PROCEDURES_ICD", "PRESCRIPTIONS"], )

2. Task — converts patients into supervised samples

2. 任务——将患者数据转换为有监督样本

samples = base.set_task(MortalityPredictionMIMIC3())

3. Split + DataLoaders (split by patient to avoid leakage)

3. 拆分 + 数据加载器（按患者拆分以避免信息泄露）

train_ds, val_ds, test_ds = split_by_patient(samples, [0.8, 0.1, 0.1]) train_loader = get_dataloader(train_ds, batch_size=32, shuffle=True) val_loader = get_dataloader(val_ds, batch_size=32, shuffle=False) test_loader = get_dataloader(test_ds, batch_size=32, shuffle=False)

4. Model — must be passed the SampleDataset, not the BaseDataset

4. 模型——必须传入SampleDataset，而非BaseDataset

model = Transformer(dataset=samples)

5. Train + evaluate

5. 训练 + 评估

trainer = Trainer(model=model) trainer.train( train_dataloader=train_loader, val_dataloader=val_loader, epochs=50, monitor="pr_auc", )

y_true, y_prob, _ = trainer.inference(test_loader) print(binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"]))


A copy-pasteable starter is in `assets/starter_pipeline.py`.

trainer = Trainer(model=model) trainer.train( train_dataloader=train_loader, val_dataloader=val_loader, epochs=50, monitor="pr_auc", )

y_true, y_prob, _ = trainer.inference(test_loader) print(binary_metrics_fn(y_true, y_prob, metrics=["pr_auc", "roc_auc"]))


可直接复制粘贴的入门模板位于`assets/starter_pipeline.py`。

Critical things to get right

需要注意的关键事项

These are the mistakes that PyHealth code most commonly trips on. Internalize them before writing pipelines:

Models take a
SampleDataset
, not a
BaseDataset
.
```
MIMIC3Dataset(...)
```
returns a
```
BaseDataset
```
(a queryable patient registry). Only after
```
.set_task(task)
```
do you get a
```
SampleDataset
```
, which is what models, splitters, and DataLoaders expect. If you pass
```
base
```
to a model, it will fail or behave wrong.
Always split by patient (or visit), not by sample. Random sample-level splits leak information across train/test because the same patient can appear in both. Use
```
split_by_patient
```
for patient-level prediction,
```
split_by_visit
```
only when visits are independent.
Match the task to the dataset. Tasks are dataset-specific:
```
MortalityPredictionMIMIC3
```
won't work on MIMIC-IV — use
```
MortalityPredictionMIMIC4
```
or
```
InHospitalMortalityMIMIC4
```
. The full mapping is in
```
references/tasks.md
```
.
Pick
monitor
to match the task type. For binary classification use
```
"pr_auc"
```
or
```
"roc_auc"
```
. For multilabel (drug rec) use
```
"pr_auc_samples"
```
or
```
"jaccard_samples"
```
. For multiclass use
```
"accuracy"
```
or
```
"f1_macro"
```
. Wrong monitor → checkpoint selection saves the wrong epoch.
MIMIC-IV uses
ehr_root=
, not
root=
. This is the one inconsistency in the dataset constructors.
For reproducible work, point
cache_dir=
somewhere persistent. PyHealth caches the parsed dataset; without
```
cache_dir
```
, you re-parse every run.

这些是PyHealth代码最常犯的错误。在编写流程前请牢记：

模型接收的是
SampleDataset
，而非
BaseDataset
。
```
MIMIC3Dataset(...)
```
返回的是
```
BaseDataset
```
（可查询的患者注册表）。只有调用
```
.set_task(task)
```
后，才能得到
```
SampleDataset
```
，这才是模型、拆分器和数据加载器所需的类型。若将
```
base
```
传入模型，会导致失败或行为异常。
始终按患者（或就诊）拆分，而非按样本拆分。 随机按样本拆分会导致同一患者出现在训练集和测试集中，造成信息泄露。针对患者级预测使用
```
split_by_patient
```
，仅当就诊相互独立时才使用
```
split_by_visit
```
。
任务与数据集匹配。 任务是数据集特定的：
```
MortalityPredictionMIMIC3
```
无法在MIMIC-IV上使用——请使用
```
MortalityPredictionMIMIC4
```
或
```
InHospitalMortalityMIMIC4
```
。完整映射请查看
```
references/tasks.md
```
。
选择与任务类型匹配的
monitor
指标。二分类任务使用
```
"pr_auc"
```
或
```
"roc_auc"
```
；多标签任务（如药物推荐）使用
```
"pr_auc_samples"
```
或
```
"jaccard_samples"
```
；多分类任务使用
```
"accuracy"
```
或
```
"f1_macro"
```
。错误的监控指标会导致 checkpoint 保存错误的训练轮次。
MIMIC-IV使用
ehr_root=
参数，而非
root=
。这是数据集构造函数中唯一的不一致之处。
为可复现的工作，将
cache_dir=
指向持久化路径。 PyHealth会缓存解析后的数据集；若未设置
```
cache_dir
```
，每次运行都会重新解析数据。

How to use this skill

如何使用本技能

PyHealth has a large API surface — there's no point loading it all at once. Read the reference file that matches the user's task:

If the user is asking about…	Read
Installing, env setup, MIMIC access, GPU	`references/installation.md`
Which dataset class to use, loading patterns, splitting	`references/datasets.md`
What prediction task to choose (mortality, readmission, drug rec, sleep…)	`references/tasks.md`
Picking a model architecture, model-specific arguments	`references/models.md`
Looking up or cross-mapping ICD/ATC/NDC/RxNorm/CCS codes, tokenizers	`references/medcode.md`
End-to-end recipes for common scenarios	`references/examples.md`

For multi-step tasks (e.g., "build a drug recommendation pipeline on MIMIC-IV"), read

tasks.md

models.md

examples.md

together — they cross-reference each other.

PyHealth的API覆盖范围很广——无需一次性全部加载。请根据用户的任务阅读对应的参考文档：

用户咨询的内容…	阅读文档
安装、环境配置、MIMIC访问、GPU使用	`references/installation.md`
选择数据集类、加载模式、拆分方法	`references/datasets.md`
选择预测任务（死亡率、再入院、药物推荐、睡眠分期等）	`references/tasks.md`
选择模型架构、模型特定参数	`references/models.md`
查询或跨映射ICD/ATC/NDC/RxNorm/CCS代码、分词器	`references/medcode.md`
常见场景的端到端示例	`references/examples.md`

对于多步骤任务（例如“在MIMIC-IV上构建药物推荐流程”），请同时阅读

tasks.md

models.md

examples.md

——它们之间会相互引用。

A note on style

风格说明

Write minimal, idiomatic PyHealth. The library is opinionated; lean into its abstractions instead of reimplementing them in raw PyTorch. If you find yourself writing a custom training loop, ask whether

Trainer

would do the job — it almost always will, and it handles checkpointing, logging, and best-model selection for free.

When the user has private MIMIC access, point them at the local CSV root; for demos and learning, the synthetic MIMIC-III bucket (

https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/

) is fine and works without credentialing.

编写简洁、符合PyHealth风格的代码。该库具有明确的设计倾向；请充分利用其抽象能力，而非用原生PyTorch重新实现。若发现自己在编写自定义训练循环，请思考

Trainer

是否能完成这项工作——它几乎总能胜任，并且会自动处理 checkpoint、日志记录和最佳模型选择。

当用户拥有MIMIC私有访问权限时，指引他们使用本地CSV根目录；对于演示和学习，合成MIMIC-III存储桶（

https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/

）即可使用，无需凭证。