Imaging Data Commons

影像数据共享平台（Imaging Data Commons）

Overview

概述

Use the

idc-index

Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.

Primary tool:

idc-index

(GitHub)

Check current data scale for the latest version:

python

from idc_index import IDCClient
client = IDCClient()

使用

idc-index

Python包查询并下载美国国家癌症研究所（NCI）影像数据共享平台（IDC）的公开癌症影像数据。数据访问无需身份验证。

核心工具：

idc-index

（GitHub）

查看最新版本的当前数据规模：

python

from idc_index import IDCClient
client = IDCClient()

get IDC data version

获取IDC数据版本

print(client.get_idc_version())

Get collection count and total series

获取集合数量和总序列数

stats = client.sql_query(""" SELECT
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)


**Core workflow:**
1. Query metadata → `client.sql_query()`
2. Download DICOM files → `client.download_from_selection()`
3. Visualize in browser → `client.get_viewer_URL(seriesInstanceUID=...)`

stats = client.sql_query(""" SELECT
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)


**核心工作流：**
1. 查询元数据 → `client.sql_query()`
2. 下载DICOM文件 → `client.download_from_selection()`
3. 浏览器可视化 → `client.get_viewer_URL(seriesInstanceUID=...)`

When to Use This Skill

适用场景

Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images
Selecting image subsets by cancer type, modality, anatomical site, or other metadata
Downloading DICOM data from IDC
Checking data licenses before use in research or commercial applications
Visualizing medical images in a browser without local DICOM viewer software

查找公开可用的放射学（CT、MR、PET）或病理学（玻片显微镜）影像
按癌症类型、成像模态、解剖部位或其他元数据筛选影像子集
从IDC下载DICOM数据
在研究或商业应用中使用前检查数据许可证
无需本地DICOM查看器软件，直接在浏览器中查看医学影像

IDC Data Model

IDC数据模型

IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):

collection_id: Groups patients by disease, modality, or research focus (e.g.,
```
tcga_luad
```
,
```
nlst
```
). A patient belongs to exactly one collection.
analysis_result_id: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections.

Use

collection_id

to find original imaging data, may include annotations deposited along with the images; use

analysis_result_id

to find AI-generated or expert annotations.

Key identifiers for queries:

Identifier	Scope	Use for
`collection_id`	Dataset grouping	Filtering by project/study
`PatientID`	Patient	Grouping images by patient
`StudyInstanceUID`	DICOM study	Grouping of related series, visualization
`SeriesInstanceUID`	DICOM series	Grouping of related series, visualization

IDC在标准DICOM层级（患者→检查→序列→实例）之上新增了两个分组层级：

collection_id：按疾病、成像模态或研究重点对患者进行分组（例如
```
tcga_luad
```
、
```
nlst
```
）。一名患者仅属于一个集合。
analysis_result_id：标识跨一个或多个原始集合的衍生对象（分割结果、标注、放射组学特征）。

使用

collection_id

查找原始影像数据，其中可能包含随影像提交的标注；使用

analysis_result_id

查找AI生成或专家标注的衍生数据集。

查询关键标识符：

标识符	范围	用途
`collection_id`	数据集分组	按项目/研究筛选
`PatientID`	患者	按患者分组影像
`StudyInstanceUID`	DICOM检查	关联序列分组、可视化
`SeriesInstanceUID`	DICOM序列	关联序列分组、可视化

Index Tables

索引表

The

idc-index

package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.

Important: Use

client.indices_overview

to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.

idc-index

包提供多个元数据索引表，可通过SQL或pandas DataFrame访问。

重要提示： 使用

client.indices_overview

获取当前表的描述和列架构。这是获取可用列及其类型的权威来源——编写SQL或探索数据结构时务必查询该内容。

Available Tables

可用表

Table	Row Granularity	Loaded	Description
`index`	1 row = 1 DICOM series	Auto	Primary metadata for all current IDC data
`prior_versions_index`	1 row = 1 DICOM series	Auto	Series from previous IDC releases; for downloading deprecated data
`collections_index`	1 row = 1 collection	fetch_index()	Collection-level metadata and descriptions
`analysis_results_index`	1 row = 1 analysis result collection	fetch_index()	Metadata about derived datasets (annotations, segmentations)
`clinical_index`	1 row = 1 clinical data column	fetch_index()	Dictionary mapping clinical table columns to collections
`sm_index`	1 row = 1 slide microscopy series	fetch_index()	Slide Microscopy (pathology) series metadata
`sm_instance_index`	1 row = 1 slide microscopy instance	fetch_index()	Instance-level (SOPInstanceUID) metadata for slide microscopy
`seg_index`	1 row = 1 DICOM Segmentation series	fetch_index()	Segmentation metadata: algorithm, segment count, reference to source image series

Auto = loaded automatically when

IDCClient()

is instantiated fetch_index() = requires

client.fetch_index("table_name")

to load

表名	行粒度	加载方式	描述
`index`	1行 = 1个DICOM序列	自动加载	所有当前IDC数据的主要元数据
`prior_versions_index`	1行 = 1个DICOM序列	自动加载	来自IDC旧版本的序列；用于下载已弃用的数据
`collections_index`	1行 = 1个集合	fetch_index()	集合级元数据和描述
`analysis_results_index`	1行 = 1个分析结果集合	fetch_index()	衍生数据集（标注、分割结果）的元数据
`clinical_index`	1行 = 1个临床数据列	fetch_index()	临床表列与集合的映射字典
`sm_index`	1行 = 1个玻片显微镜序列	fetch_index()	玻片显微镜（病理学）序列元数据
`sm_instance_index`	1行 = 1个玻片显微镜实例	fetch_index()	玻片显微镜的实例级（SOPInstanceUID）元数据
`seg_index`	1行 = 1个DICOM分割序列	fetch_index()	分割元数据：算法、分割数量、源影像序列引用

自动加载 = 实例化

IDCClient()

时自动加载 fetch_index() = 需要调用

client.fetch_index("table_name")

加载

Joining Tables

表关联

Key columns are not explicitly labeled, the following is a subset that can be used in joins.

Join Column	Tables	Use Case
`collection_id`	index, prior_versions_index, collections_index, clinical_index	Link series to collection metadata or clinical data
`SeriesInstanceUID`	index, prior_versions_index, sm_index, sm_instance_index	Link series across tables; connect to slide microscopy details
`StudyInstanceUID`	index, prior_versions_index	Link studies across current and historical data
`PatientID`	index, prior_versions_index	Link patients across current and historical data
`analysis_result_id`	index, analysis_results_index	Link series to analysis result metadata (annotations, segmentations)
`source_DOI`	index, analysis_results_index	Link by publication DOI
`crdc_series_uuid`	index, prior_versions_index	Link by CRDC unique identifier
`Modality`	index, prior_versions_index	Filter by imaging modality
`SeriesInstanceUID`	index, seg_index	Link segmentation series to its index metadata
`segmented_SeriesInstanceUID`	seg_index → index	Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID)

Note:

Subjects

,

Updated

, and

Description

appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).

Example joins:

python

from idc_index import IDCClient
client = IDCClient()

关键列未显式标记，以下是可用于关联的子集。

关联列	涉及表	适用场景
`collection_id`	index, prior_versions_index, collections_index, clinical_index	将序列与集合元数据或临床数据关联
`SeriesInstanceUID`	index, prior_versions_index, sm_index, sm_instance_index	跨表关联序列；关联玻片显微镜详情
`StudyInstanceUID`	index, prior_versions_index	关联当前和历史数据中的检查
`PatientID`	index, prior_versions_index	关联当前和历史数据中的患者
`analysis_result_id`	index, analysis_results_index	将序列与分析结果元数据（标注、分割结果）关联
`source_DOI`	index, analysis_results_index	通过出版物DOI关联
`crdc_series_uuid`	index, prior_versions_index	通过CRDC唯一标识符关联
`Modality`	index, prior_versions_index	按成像模态筛选
`SeriesInstanceUID`	index, seg_index	将分割序列与其索引元数据关联
`segmented_SeriesInstanceUID`	seg_index → index	将分割结果与其源影像序列关联（关联条件：seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID）

注意：

Subjects

、

Updated

和

Description

出现在多个表中，但含义不同（计数与标识符、不同更新场景）。

关联示例：

python

from idc_index import IDCClient
client = IDCClient()

Join index with collections_index to get cancer types

关联index与collections_index以获取癌症类型

client.fetch_index("collections_index") result = client.sql_query(""" SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE i.Modality = 'MR' LIMIT 10 """)

Join index with sm_index for slide microscopy details

关联index与sm_index以获取玻片显微镜详情

client.fetch_index("sm_index") result = client.sql_query(""" SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf FROM index i JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID LIMIT 10 """)

Join seg_index with index to find segmentations and their source images

关联seg_index与index以查找分割结果及其源影像

client.fetch_index("seg_index") result = client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, src.collection_id, src.Modality as source_modality, src.BodyPartExamined FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE s.AlgorithmType = 'AUTOMATIC' LIMIT 10 """)

undefined

client.fetch_index("seg_index") result = client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, src.collection_id, src.Modality as source_modality, src.BodyPartExamined FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE s.AlgorithmType = 'AUTOMATIC' LIMIT 10 """)

undefined

Accessing Index Tables

访问索引表

Via SQL (recommended for filtering/aggregation):

python

from idc_index import IDCClient
client = IDCClient()

通过SQL（推荐用于筛选/聚合）：

python

from idc_index import IDCClient
client = IDCClient()

Query the primary index (always available)

查询主索引（始终可用）

results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")

Fetch and query additional indices

获取并查询额外索引

client.fetch_index("collections_index") collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")

client.fetch_index("analysis_results_index") analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")


**As pandas DataFrames (direct access):**
```python

client.fetch_index("collections_index") collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")

client.fetch_index("analysis_results_index") analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")


**作为pandas DataFrame（直接访问）：**
```python

Primary index (always available after client initialization)

主索引（实例化客户端后始终可用）

df = client.index

Fetch and access on-demand indices

获取并访问按需加载的索引

client.fetch_index("sm_index") sm_df = client.sm_index

undefined

client.fetch_index("sm_index") sm_df = client.sm_index

undefined

Discovering Table Schemas (Essential for Query Writing)

发现表架构（查询编写必备）

The

indices_overview

dictionary contains complete schema information for all tables. Always consult this when writing queries or exploring data structure.

DICOM attribute mapping: Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like

PatientID

,

StudyInstanceUID

,

Modality

,

BodyPartExamined

work as expected.

python

from idc_index import IDCClient
client = IDCClient()

indices_overview

字典包含所有表的完整架构信息。编写查询或探索数据结构时务必参考该内容。

DICOM属性映射： 许多列直接从源文件的DICOM属性填充。架构中的列描述会指明该列是否对应DICOM属性（例如“DICOM Modality属性”或引用DICOM标签）。这允许在查询时利用DICOM知识——标准DICOM属性名称如

PatientID

、

StudyInstanceUID

、

Modality

、

BodyPartExamined

可直接使用。

python

from idc_index import IDCClient
client = IDCClient()

List all available indices with descriptions

列出所有可用索引及其描述

for name, info in client.indices_overview.items(): print(f"\n{name}:") print(f" Installed: {info['installed']}") print(f" Description: {info['description']}")

for name, info in client.indices_overview.items(): print(f"\n{name}:") print(f" 是否已加载: {info['installed']}") print(f" 描述: {info['description']}")

Get complete schema for a specific index (columns, types, descriptions)

获取特定索引的完整架构（列、类型、描述）

schema = client.indices_overview["index"]["schema"] print(f"\nTable: {schema['table_description']}") print("\nColumns:") for col in schema['columns']: desc = col.get('description', 'No description') # Description indicates if column is from DICOM attribute print(f" {col['name']} ({col['type']}): {desc}")

schema = client.indices_overview["index"]["schema"] print(f"\n表: {schema['table_description']}") print("\n列:") for col in schema['columns']: desc = col.get('description', '无描述') # 描述中会指明列是否来自DICOM属性 print(f" {col['name']} ({col['type']}): {desc}")

Find columns that are DICOM attributes (check description for "DICOM" reference)

查找源自DICOM属性的列（检查描述中是否包含"DICOM"）

dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] print(f"\nDICOM-sourced columns: {dicom_cols}")


**Alternative: use `get_index_schema()` method:**
```python
schema = client.get_index_schema("index")

dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] print(f"\n源自DICOM的列: {dicom_cols}")


**替代方式：使用`get_index_schema()`方法：**
```python
schema = client.get_index_schema("index")

Returns same schema dict: {'table_description': ..., 'columns': [...]}

返回相同的架构字典: {'table_description': ..., 'columns': [...]}

undefined

undefined

Key Columns in Primary

index

Table

主

index

表的关键列

Most common columns for queries (use

indices_overview

for complete list and descriptions):

Column	Type	DICOM	Description
`collection_id`	STRING	No	IDC collection identifier
`analysis_result_id`	STRING	No	If applicable, indicates what analysis results collection given series is part of
`source_DOI`	STRING	No	DOI linking to dataset details; use for learning more about the content and for attribution (see citations below)
`PatientID`	STRING	Yes	Patient identifier
`StudyInstanceUID`	STRING	Yes	DICOM Study UID
`SeriesInstanceUID`	STRING	Yes	DICOM Series UID — use for downloads/viewing
`Modality`	STRING	Yes	Imaging modality (CT, MR, PT, SM, etc.)
`BodyPartExamined`	STRING	Yes	Anatomical region
`SeriesDescription`	STRING	Yes	Description of the series
`Manufacturer`	STRING	Yes	Equipment manufacturer
`StudyDate`	STRING	Yes	Date study was performed
`PatientSex`	STRING	Yes	Patient sex
`PatientAge`	STRING	Yes	Patient age at time of study
`license_short_name`	STRING	No	License type (CC BY 4.0, CC BY-NC 4.0, etc.)
`series_size_MB`	FLOAT	No	Size of series in megabytes
`instanceCount`	INTEGER	No	Number of DICOM instances in series

DICOM = Yes: Column value extracted from the DICOM attribute with the same name. Refer to the DICOM standard for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.

查询中最常用的列（完整列表和描述请使用

indices_overview

）：

列名	类型	是否来自DICOM	描述
`collection_id`	字符串	否	IDC集合标识符
`analysis_result_id`	字符串	否	若适用，指示给定序列所属的分析结果集合
`source_DOI`	字符串	否	链接到数据集详情的DOI；用于了解内容来源和引用（见下文引用部分）
`PatientID`	字符串	是	患者标识符
`StudyInstanceUID`	字符串	是	DICOM检查唯一标识符
`SeriesInstanceUID`	字符串	是	DICOM序列唯一标识符——用于下载/可视化
`Modality`	字符串	是	成像模态（CT、MR、PT、SM等）
`BodyPartExamined`	字符串	是	解剖部位
`SeriesDescription`	字符串	是	序列描述
`Manufacturer`	字符串	是	设备制造商
`StudyDate`	字符串	是	检查执行日期
`PatientSex`	字符串	是	患者性别
`PatientAge`	字符串	是	检查时患者年龄
`license_short_name`	字符串	否	许可证类型（CC BY 4.0、CC BY-NC 4.0等）
`series_size_MB`	浮点数	否	序列大小（MB）
`instanceCount`	整数	否	序列中DICOM实例数量

是否来自DICOM = 是：列值从同名DICOM属性提取。有关数字标签映射，请参考DICOM标准。可使用标准DICOM知识推断预期值和格式。

Clinical Data Access

临床数据访问

python

undefined

python

undefined

Fetch clinical index (also downloads clinical data tables)

获取临床索引（同时下载临床数据表）

client.fetch_index("clinical_index")

Query clinical index to find available tables and their columns

查询临床索引以查找可用表及其列

tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")

Load a specific clinical table as DataFrame

将特定临床表加载为DataFrame

clinical_df = client.get_clinical_table("table_name")


See `references/clinical_data_guide.md` for detailed workflows including value mapping patterns and joining clinical data with imaging.

clinical_df = client.get_clinical_table("table_name")


有关包括值映射模式和临床数据与影像关联的详细工作流，请参阅`references/clinical_data_guide.md`。

Data Access Options

数据访问选项

Method	Auth Required	Best For
`idc-index`	No	Key queries and downloads (recommended)
IDC Portal	No	Interactive exploration, manual selection, browser-based download
BigQuery	Yes (GCP account)	Complex queries, full DICOM metadata
DICOMweb proxy	No	Tool integration via DICOMweb API
Cloud storage (S3/GCS)	No	Direct file access, bulk downloads, custom pipelines

Cloud storage organization

IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.

Bucket (AWS / GCS)	License	Content
`idc-open-data` / `idc-open-data`	No commercial restriction	>90% of IDC data
`idc-open-data-two` / `idc-open-idc1`	No commercial restriction	Collections with potential head scans
`idc-open-data-cr` / `idc-open-cr`	Commercial use restricted (CC BY-NC)	~4% of data

Files are stored as

<crdc_series_uuid>/<crdc_instance_uuid>.dcm

. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use

series_aws_url

column from the index for S3 URLs; GCS uses the same path structure.

See

references/cloud_storage_guide.md

for bucket details, access commands, UUID mapping, and versioning.

DICOMweb access

IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.

Endpoint	Auth	Use Case
Public proxy	No	Testing, moderate queries, daily quota
Google Healthcare	Yes (GCP)	Production use, higher quotas

See

references/dicomweb_guide.md

for endpoint URLs, code examples, supported operations, and implementation details.

方法	是否需要身份验证	最佳适用场景
`idc-index`	否	核心查询和下载（推荐）
IDC门户	否	交互式探索、手动选择、浏览器端下载
BigQuery	是（需要GCP账户）	复杂查询、完整DICOM元数据
DICOMweb代理	否	通过DICOMweb API集成工具
云存储（AWS S3/GCS）	否	直接文件访问、批量下载、自定义流水线

云存储组织

IDC将所有DICOM文件存储在公共云存储桶中，在AWS S3和Google Cloud Storage（GCS）之间镜像同步。文件按CRDC UUID（而非DICOM UID）组织，以支持版本控制。

存储桶（AWS / GCS）	许可证	内容
`idc-open-data` / `idc-open-data`	无商业限制	占IDC数据的90%以上
`idc-open-data-two` / `idc-open-idc1`	无商业限制	包含头部扫描的集合
`idc-open-data-cr` / `idc-open-cr`	商业使用受限（CC BY-NC）	约占数据的4%

文件存储路径为

<crdc_series_uuid>/<crdc_instance_uuid>.dcm

。可通过AWS CLI、gsutil或s5cmd以匿名访问方式免费获取（无出口费用）。使用index表中的

series_aws_url

列获取S3 URL；GCS使用相同路径结构。

有关存储桶详情、访问命令、UUID映射和版本控制，请参阅

references/cloud_storage_guide.md

。

DICOMweb访问

IDC数据可通过DICOMweb接口（Google Cloud Healthcare API实现）访问，以集成到PACS系统和支持DICOMweb的工具中。

端点	是否需要身份验证	适用场景
公共代理	否	测试、中等规模查询、每日配额限制
Google Healthcare	是（需要GCP账户）	生产使用、更高配额

有关端点URL、代码示例、支持的操作和实现细节，请参阅

references/dicomweb_guide.md

。

Installation and Setup

安装与设置

Required (for basic access):

bash

pip install --upgrade idc-index

Important: New IDC data release will always trigger a new version of

idc-index

. Always use

--upgrade

flag while installing, unless an older version is needed for reproducibility.

Tested with: idc-index 0.11.7 (IDC data version v23)

Optional (for data analysis):

bash

pip install pandas numpy pydicom

基础访问必备：

bash

pip install --upgrade idc-index

重要提示： IDC数据新版本发布后，

idc-index

会同步更新版本。除非需要复现旧版本结果，否则安装时请始终使用

--upgrade

参数。

测试兼容版本： idc-index 0.11.7（对应IDC数据版本v23）

数据分析可选依赖：

bash

pip install pandas numpy pydicom

Core Capabilities

核心功能

1. Data Discovery and Exploration

1. 数据发现与探索

Discover what imaging collections and data are available in IDC:

python

from idc_index import IDCClient

client = IDCClient()

探索IDC中可用的影像集合和数据：

python

from idc_index import IDCClient

client = IDCClient()

Get summary statistics from primary index

从主索引获取汇总统计

query = """ SELECT collection_id, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(series_size_MB) as size_mb FROM index GROUP BY collection_id ORDER BY patients DESC """ collections_summary = client.sql_query(query)

For richer collection metadata, use collections_index

如需更丰富的集合元数据，使用collections_index

client.fetch_index("collections_index") collections_info = client.sql_query(""" SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData FROM collections_index """)

For analysis results (annotations, segmentations), use analysis_results_index

如需分析结果（标注、分割结果），使用analysis_results_index

client.fetch_index("analysis_results_index") analysis_info = client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities FROM analysis_results_index """)


**`collections_index`** provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.

**`analysis_results_index`** lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.

client.fetch_index("analysis_results_index") analysis_info = client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities FROM analysis_results_index """)


**`collections_index`** 提供每个集合的精选元数据：癌症类型、肿瘤位置、物种、受试者数量和支持数据类型——无需从主索引聚合。

**`analysis_results_index`** 列出衍生数据集（AI分割结果、专家标注、放射组学特征）及其源集合和模态。

2. Querying Metadata with SQL

2. 使用SQL查询元数据

Query the IDC mini-index using SQL to find specific datasets.

First, explore available values for filter columns:

python

from idc_index import IDCClient

client = IDCClient()

使用SQL查询IDC迷你索引以查找特定数据集。

首先，探索筛选列的可用值：

python

from idc_index import IDCClient

client = IDCClient()

Check what Modality values exist

查看所有可用的Modality值

modalities = client.sql_query(""" SELECT DISTINCT Modality, COUNT(*) as series_count FROM index GROUP BY Modality ORDER BY series_count DESC """) print(modalities)

Check what BodyPartExamined values exist for MR modality

查看MR模态下的BodyPartExamined值

body_parts = client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count FROM index WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY series_count DESC LIMIT 20 """) print(body_parts)


**Then query with validated filter values:**
```python

body_parts = client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count FROM index WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY series_count DESC LIMIT 20 """) print(body_parts)


**然后使用验证后的筛选值进行查询：**
```python

Find breast MRI scans (use actual values from exploration above)

查找乳腺MRI扫描（使用上述探索得到的实际值）

results = client.sql_query(""" SELECT collection_id, PatientID, SeriesInstanceUID, Modality, SeriesDescription, license_short_name FROM index WHERE Modality = 'MR' AND BodyPartExamined = 'BREAST' LIMIT 20 """)

Access results as pandas DataFrame

以pandas DataFrame形式访问结果

for idx, row in results.iterrows(): print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")


**To filter by cancer type, join with `collections_index`:**
```python
client.fetch_index("collections_index")
results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
      AND i.Modality = 'MR'
    LIMIT 20
""")

Available metadata fields (use

client.indices_overview

for complete list):

Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
Clinical: PatientAge, PatientSex, StudyDate
Descriptions: StudyDescription, SeriesDescription
Licensing: license_short_name

Note: Cancer type is in

collections_index.CancerTypes

, not in the primary

index

table.

for idx, row in results.iterrows(): print(f"患者: {row['PatientID']}, 序列: {row['SeriesInstanceUID']}")


**如需按癌症类型筛选，关联`collections_index`：**
```python
client.fetch_index("collections_index")
results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
      AND i.Modality = 'MR'
    LIMIT 20
""")

可用元数据字段（完整列表请使用

client.indices_overview

）：

标识符：collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
影像相关：Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
临床相关：PatientAge, PatientSex, StudyDate
描述信息：StudyDescription, SeriesDescription
许可证：license_short_name

注意： 癌症类型存储在

collections_index.CancerTypes

中，而非主

index

表。

3. Downloading DICOM Files

3. 下载DICOM文件

Download imaging data efficiently from IDC's cloud storage:

Download entire collection:

python

from idc_index import IDCClient

client = IDCClient()

从IDC云存储高效下载影像数据：

下载整个集合：

python

from idc_index import IDCClient

client = IDCClient()

Download small collection (RIDER Pilot ~1GB)

下载小型集合（RIDER Pilot 约1GB）

client.download_from_selection( collection_id="rider_pilot", downloadDir="./data/rider" )


**Download specific series:**
```python

client.download_from_selection( collection_id="rider_pilot", downloadDir="./data/rider" )


**下载特定序列：**
```python

First, query for series UIDs

首先查询序列UID

series_df = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND collection_id = 'nlst' LIMIT 5 """)

Download only those series

仅下载这些序列

client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/lung_ct" )


**Custom directory structure:**

Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`

```python

client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/lung_ct" )


**自定义目录结构：**

默认`dirTemplate`：`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`

```python

Simplified hierarchy (omit StudyInstanceUID level)

简化层级（省略StudyInstanceUID层级）

client.download_from_selection( collection_id="tcga_luad", downloadDir="./data", dirTemplate="%collection_id/%PatientID/%Modality" )

Results in: ./data/tcga_luad/TCGA-05-4244/CT/

结果路径: ./data/tcga_luad/TCGA-05-4244/CT/

Flat structure (all files in one directory)

扁平结构（所有文件在同一目录）

client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/flat", dirTemplate="" )

Results in: ./data/flat/*.dcm

结果路径: ./data/flat/*.dcm

undefined

undefined

Command-Line Download

命令行下载

The

idc download

command provides command-line access to download functionality without writing Python code. Available after installing

idc-index

.

Auto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).

bash

undefined

安装

idc-index

后，可使用

idc download

命令通过命令行执行下载操作，无需编写Python代码。

自动检测输入类型： 清单文件路径，或标识符（collection_id、PatientID、StudyInstanceUID、SeriesInstanceUID、crdc_series_uuid）。

bash

undefined

Download entire collection

下载整个集合

idc download rider_pilot --download-dir ./data

Download specific series by UID

通过UID下载特定序列

idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data

Download multiple items (comma-separated)

下载多个项目（逗号分隔）

idc download "tcga_luad,tcga_lusc" --download-dir ./data

Download from manifest file (auto-detected)

从清单文件下载（自动检测）

idc download manifest.txt --download-dir ./data


**Options:**

| Option | Description |
|--------|-------------|
| `--download-dir` | Output directory (default: current directory) |
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | Verbosity: debug, info, warning, error, critical |

**Manifest files:**

Manifest files contain S3 URLs (one per line) and can be:
- Exported from the IDC Portal after cohort selection
- Shared by collaborators for reproducible data access
- Generated programmatically from query results

Format (one S3 URL per line):

s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*


**Example: Generate manifest from Python query:**

```python
from idc_index import IDCClient

client = IDCClient()

idc download manifest.txt --download-dir ./data


**选项：**

| 选项 | 描述 |
|--------|-------------|
| `--download-dir` | 输出目录（默认：当前目录） |
| `--dir-template` | 目录层级模板（默认：`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`） |
| `--log-level` | 日志级别：debug、info、warning、error、critical |

**清单文件：**

清单文件包含S3 URL（每行一个），可通过以下方式生成：
- 在IDC门户中选择队列后导出
- 由协作者共享以实现可复现的数据访问
- 通过查询结果程序化生成

格式（每行一个S3 URL）：

s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*


**示例：通过Python查询生成清单：**

```python
from idc_index import IDCClient

client = IDCClient()

Query for series URLs

查询序列URL

results = client.sql_query(""" SELECT series_aws_url FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' """)

Save as manifest file

保存为清单文件

with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n')


Then download:
```bash
idc download ct_manifest.txt --download-dir ./ct_data

with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n')


然后执行下载：
```bash
idc download ct_manifest.txt --download-dir ./ct_data

4. Visualizing IDC Images

4. 可视化IDC影像

View DICOM data in browser without downloading:

python

from idc_index import IDCClient
import webbrowser

client = IDCClient()

无需下载即可在浏览器中查看DICOM数据：

python

from idc_index import IDCClient
import webbrowser

client = IDCClient()

First query to get valid UIDs

首先查询获取有效的UID

results = client.sql_query(""" SELECT SeriesInstanceUID, StudyInstanceUID FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' LIMIT 1 """)

View single series

查看单个序列

viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID']) webbrowser.open(viewer_url)

View all series in a study (useful for multi-series exams like MRI protocols)

查看检查中的所有序列（适用于多序列检查，如MRI协议）

viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) webbrowser.open(viewer_url)


The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).

viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) webbrowser.open(viewer_url)


该方法会自动为放射学影像选择OHIF v3，为玻片显微镜选择SLIM。按检查查看适用于DICOM检查包含多个序列的场景（例如单次MRI检查中的T1、T2、DWI序列）。

5. Understanding and Checking Licenses

5. 理解与检查许可证

Check data licensing before use (critical for commercial applications):

python

from idc_index import IDCClient

client = IDCClient()

使用前请检查数据许可证（商业应用尤为重要）：

python

from idc_index import IDCClient

client = IDCClient()

Check licenses for all collections

检查所有集合的许可证

query = """ SELECT DISTINCT collection_id, license_short_name, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM index GROUP BY collection_id, license_short_name ORDER BY collection_id """

licenses = client.sql_query(query) print(licenses)


**License types in IDC:**
- **CC BY 4.0** / **CC BY 3.0** (~97% of data) - Allows commercial use with attribution
- **CC BY-NC 4.0** / **CC BY-NC 3.0** (~3% of data) - Non-commercial use only
- **Custom licenses** (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions)

**Important:** Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.

query = """ SELECT DISTINCT collection_id, license_short_name, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM index GROUP BY collection_id, license_short_name ORDER BY collection_id """

licenses = client.sql_query(query) print(licenses)


**IDC中的许可证类型：**
- **CC BY 4.0** / **CC BY 3.0**（约占数据的97%）- 允许商业使用，但需注明出处
- **CC BY-NC 4.0** / **CC BY-NC 3.0**（约占数据的3%）- 仅允许非商业使用
- **自定义许可证**（罕见）- 部分集合有特定条款（如NLM条款和条件）

**重要提示：** 在出版物或商业应用中使用IDC数据前，请务必检查许可证。每个DICOM文件的元数据中都标记了其特定许可证。

Generating Citations for Attribution

生成引用信息以注明出处

The

source_DOI

column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use

citations_from_selection()

to generate properly formatted citations:

python

from idc_index import IDCClient

client = IDCClient()

source_DOI

列包含链接到数据集生成相关出版物的DOI。为满足出处要求，可使用

citations_from_selection()

生成格式规范的引用：

python

from idc_index import IDCClient

client = IDCClient()

Get citations for a collection (APA format by default)

获取集合的引用（默认APA格式）

citations = client.citations_from_selection(collection_id="rider_pilot") for citation in citations: print(citation)

Get citations for specific series

获取特定序列的引用

results = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE collection_id = 'tcga_luad' LIMIT 5 """) citations = client.citations_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values) )

Alternative format: BibTeX (for LaTeX documents)

替代格式：BibTeX（适用于LaTeX文档）

bibtex_citations = client.citations_from_selection( collection_id="tcga_luad", citation_format=IDCClient.CITATION_FORMAT_BIBTEX )


**Parameters:**
- `collection_id`: Filter by collection(s)
- `patientId`: Filter by patient ID(s)
- `studyInstanceUID`: Filter by study UID(s)
- `seriesInstanceUID`: Filter by series UID(s)
- `citation_format`: Use `IDCClient.CITATION_FORMAT_*` constants:
  - `CITATION_FORMAT_APA` (default) - APA style
  - `CITATION_FORMAT_BIBTEX` - BibTeX for LaTeX
  - `CITATION_FORMAT_JSON` - CSL JSON
  - `CITATION_FORMAT_TURTLE` - RDF Turtle

**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.

bibtex_citations = client.citations_from_selection( collection_id="tcga_luad", citation_format=IDCClient.CITATION_FORMAT_BIBTEX )


**参数：**
- `collection_id`：按集合筛选
- `patientId`：按患者ID筛选
- `studyInstanceUID`：按检查UID筛选
- `seriesInstanceUID`：按序列UID筛选
- `citation_format`：使用`IDCClient.CITATION_FORMAT_*`常量：
  - `CITATION_FORMAT_APA`（默认）- APA格式
  - `CITATION_FORMAT_BIBTEX` - 适用于LaTeX的BibTeX格式
  - `CITATION_FORMAT_JSON` - CSL JSON格式
  - `CITATION_FORMAT_TURTLE` - RDF Turtle格式

**最佳实践：** 使用IDC数据发表结果时，请包含生成的引用，以正确注明数据源并满足许可证要求。

6. Batch Processing and Filtering

6. 批量处理与筛选

Process large datasets efficiently with filtering:

python

from idc_index import IDCClient
import pandas as pd

client = IDCClient()

通过筛选高效处理大型数据集：

python

from idc_index import IDCClient
import pandas as pd

client = IDCClient()

Find chest CT scans from GE scanners

查找GE扫描仪的胸部CT扫描

query = """ SELECT SeriesInstanceUID, PatientID, collection_id, ManufacturerModelName FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND Manufacturer = 'GE MEDICAL SYSTEMS' AND license_short_name = 'CC BY 4.0' LIMIT 100 """

results = client.sql_query(query)

query = """ SELECT SeriesInstanceUID, PatientID, collection_id, ManufacturerModelName FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND Manufacturer = 'GE MEDICAL SYSTEMS' AND license_short_name = 'CC BY 4.0' LIMIT 100 """)

results = client.sql_query(query)

Save manifest for later

保存清单供后续使用

results.to_csv('lung_ct_manifest.csv', index=False)

Download in batches to avoid timeout

分批下载以避免超时

batch_size = 10 for i in range(0, len(results), batch_size): batch = results.iloc[i:i+batch_size] client.download_from_selection( seriesInstanceUID=list(batch['SeriesInstanceUID'].values), downloadDir=f"./data/batch_{i//batch_size}" )

undefined

batch_size = 10 for i in range(0, len(results), batch_size): batch = results.iloc[i:i+batch_size] client.download_from_selection( seriesInstanceUID=list(batch['SeriesInstanceUID'].values), downloadDir=f"./data/batch_{i//batch_size}" )

undefined

7. Advanced Queries with BigQuery

7. 使用BigQuery进行高级查询

For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.

Quick reference:

Dataset:
```
bigquery-public-data.idc_current.*
```
Main table:
```
dicom_all
```
(combined metadata)
Full metadata:
```
dicom_metadata
```
(all DICOM tags)
Private elements:
```
OtherElements
```
column (vendor-specific tags like diffusion b-values)

See

references/bigquery_guide.md

for setup, table schemas, query patterns, private element access, and cost optimization.

如需查询完整DICOM元数据、复杂关联、临床数据表或私有DICOM元素，请使用Google BigQuery。需要启用计费的GCP账户。

快速参考：

数据集：
```
bigquery-public-data.idc_current.*
```
主表：
```
dicom_all
```
（合并元数据）
完整元数据：
```
dicom_metadata
```
（所有DICOM标签）
私有元素：
```
OtherElements
```
列（厂商特定标签，如弥散b值）

有关设置、表架构、查询模式、私有元素访问和成本优化，请参阅

references/bigquery_guide.md

。

8. Tool Selection Guide

8. 工具选择指南

Task	Tool	Reference
Programmatic queries & downloads	`idc-index`	This document
Interactive exploration	IDC Portal	https://portal.imaging.datacommons.cancer.gov/
Complex metadata queries	BigQuery	`references/bigquery_guide.md`
3D visualization & analysis	SlicerIDCBrowser	https://github.com/ImagingDataCommons/SlicerIDCBrowser

Default choice: Use

idc-index

for most tasks (no auth, easy API, batch downloads).

任务	工具	参考文档
程序化查询与下载	`idc-index`	本文档
交互式探索	IDC门户	https://portal.imaging.datacommons.cancer.gov/
复杂元数据查询	BigQuery	`references/bigquery_guide.md`
3D可视化与分析	SlicerIDCBrowser	https://github.com/ImagingDataCommons/SlicerIDCBrowser

默认选择： 大多数任务使用

idc-index

（无需身份验证、API易用、支持批量下载）。

9. Integration with Analysis Pipelines

9. 与分析流水线集成

Integrate IDC data into imaging analysis workflows:

Read downloaded DICOM files:

python

import pydicom
import os

将IDC数据集成到影像分析工作流：

读取下载的DICOM文件：

python

import pydicom
import os

Read DICOM files from downloaded series

读取下载序列中的DICOM文件

series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) if f.endswith('.dcm')]

series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) if f.endswith('.dcm')]

Load first image

加载第一张影像

ds = pydicom.dcmread(dicom_files[0]) print(f"Patient ID: {ds.PatientID}") print(f"Modality: {ds.Modality}") print(f"Image shape: {ds.pixel_array.shape}")


**Build 3D volume from CT series:**
```python
import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
    """Load CT series as 3D numpy array"""
    files = sorted(Path(series_path).glob('*.dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # Sort by slice location
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # Stack into 3D array
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0]  # Return volume and first slice for metadata

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}")  # (z, y, x)

Integrate with SimpleITK:

python

import SimpleITK as sitk
from pathlib import Path

ds = pydicom.dcmread(dicom_files[0]) print(f"患者ID: {ds.PatientID}") print(f"成像模态: {ds.Modality}") print(f"影像尺寸: {ds.pixel_array.shape}")


**从CT序列构建3D体素：**
```python
import pydicom
import numpy as np
from pathlib import Path

def load_ct_series(series_path):
    """将CT序列加载为3D numpy数组"""
    files = sorted(Path(series_path).glob('*.dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # 按切片位置排序
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # 堆叠为3D数组
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0]  # 返回体素和第一张切片的元数据

volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"体素尺寸: {volume.shape}")  # (z, y, x)

与SimpleITK集成：

python

import SimpleITK as sitk
from pathlib import Path

Read DICOM series

读取DICOM序列

series_path = "./data/ct_series" reader = sitk.ImageSeriesReader() dicom_names = reader.GetGDCMSeriesFileNames(series_path) reader.SetFileNames(dicom_names) image = reader.Execute()

Apply processing

应用处理

smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)

Save as NIfTI

保存为NIfTI格式

sitk.WriteImage(smoothed, "processed_volume.nii.gz")

undefined

sitk.WriteImage(smoothed, "processed_volume.nii.gz")

undefined

Common Use Cases

常见使用场景

Use Case 1: Find and Download Lung CT Scans for Deep Learning

场景1：查找并下载肺部CT扫描用于深度学习

Objective: Build training dataset of lung CT scans from NLST collection

Steps:

python

from idc_index import IDCClient

client = IDCClient()

目标： 构建NLST集合的肺部CT扫描训练数据集

步骤：

python

from idc_index import IDCClient

client = IDCClient()

1. Query for lung CT scans with specific criteria

1. 按特定条件查询肺部CT扫描

query = """ SELECT PatientID, SeriesInstanceUID, SeriesDescription FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' AND license_short_name = 'CC BY 4.0' ORDER BY PatientID LIMIT 100 """

results = client.sql_query(query) print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")

query = """ SELECT PatientID, SeriesInstanceUID, SeriesDescription FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' AND license_short_name = 'CC BY 4.0' ORDER BY PatientID LIMIT 100 """)

results = client.sql_query(query) print(f"找到{len(results)}个序列，来自{results['PatientID'].nunique()}名患者")

2. Download data organized by patient

2. 按患者组织数据并下载

client.download_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values), downloadDir="./training_data", dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" )

3. Save manifest for reproducibility

3. 保存清单以确保可复现性

results.to_csv('training_manifest.csv', index=False)

undefined

results.to_csv('training_manifest.csv', index=False)

undefined

Use Case 2: Query Brain MRI by Manufacturer for Quality Study

场景2：按厂商查询脑部MRI用于质量研究

Objective: Compare image quality across different MRI scanner manufacturers

Steps:

python

from idc_index import IDCClient
import pandas as pd

client = IDCClient()

目标： 比较不同MRI扫描仪厂商的影像质量

步骤：

python

from idc_index import IDCClient
import pandas as pd

client = IDCClient()

Query for brain MRI grouped by manufacturer

按厂商分组查询脑部MRI

query = """ SELECT Manufacturer, ManufacturerModelName, COUNT(DISTINCT SeriesInstanceUID) as num_series, COUNT(DISTINCT PatientID) as num_patients FROM index WHERE Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' GROUP BY Manufacturer, ManufacturerModelName HAVING num_series >= 10 ORDER BY num_series DESC """

manufacturers = client.sql_query(query) print(manufacturers)

query = """ SELECT Manufacturer, ManufacturerModelName, COUNT(DISTINCT SeriesInstanceUID) as num_series, COUNT(DISTINCT PatientID) as num_patients FROM index WHERE Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' GROUP BY Manufacturer, ManufacturerModelName HAVING num_series >= 10 ORDER BY num_series DESC """)

manufacturers = client.sql_query(query) print(manufacturers)

Download sample from each manufacturer for comparison

下载每个厂商的样本用于比较

for _, row in manufacturers.head(3).iterrows(): mfr = row['Manufacturer'] model = row['ManufacturerModelName']

query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
  AND ManufacturerModelName = '{model}'
  AND Modality = 'MR'
  AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
"""

series = client.sql_query(query)
client.download_from_selection(
    seriesInstanceUID=list(series['SeriesInstanceUID'].values),
    downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)

undefined

for _, row in manufacturers.head(3).iterrows(): mfr = row['Manufacturer'] model = row['ManufacturerModelName']

query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
  AND ManufacturerModelName = '{model}'
  AND Modality = 'MR'
  AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
""")

series = client.sql_query(query)
client.download_from_selection(
    seriesInstanceUID=list(series['SeriesInstanceUID'].values),
    downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)

undefined

Use Case 3: Visualize Series Without Downloading

场景3：无需下载即可预览序列

Objective: Preview imaging data before committing to download

python

from idc_index import IDCClient
import webbrowser

client = IDCClient()

series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
""")

目标： 在决定下载前预览影像数据

python

from idc_index import IDCClient
import webbrowser

client = IDCClient()

series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
""")

Preview each in browser

在浏览器中预览每个序列

for _, row in series_list.iterrows(): viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") print(f" View at: {viewer_url}") # webbrowser.open(viewer_url) # Uncomment to open automatically


For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.

for _, row in series_list.iterrows(): viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) print(f"患者{row['PatientID']}: {row['SeriesDescription']}") print(f" 查看地址: {viewer_url}") # webbrowser.open(viewer_url) # 取消注释以自动打开


有关更多可视化选项，请参阅[IDC门户入门指南](https://learn.canceridc.dev/portal/getting-started)或用于3D Slicer集成的[SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser)。

Use Case 4: License-Aware Batch Download for Commercial Use

场景4：面向商业用途的许可证感知批量下载

Objective: Download only CC-BY licensed data suitable for commercial applications

Steps:

python

from idc_index import IDCClient

client = IDCClient()

目标： 仅下载适用于商业应用的CC-BY许可数据

步骤：

python

from idc_index import IDCClient

client = IDCClient()

Query ONLY for CC BY licensed data (allows commercial use with attribution)

仅查询CC BY许可数据（允许商业使用，需注明出处）

query = """ SELECT SeriesInstanceUID, collection_id, PatientID, Modality FROM index WHERE license_short_name LIKE 'CC BY%' AND license_short_name NOT LIKE '%NC%' AND Modality IN ('CT', 'MR') AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') LIMIT 200 """

cc_by_data = client.sql_query(query)

print(f"Found {len(cc_by_data)} CC BY licensed series") print(f"Collections: {cc_by_data['collection_id'].unique()}")

query = """ SELECT SeriesInstanceUID, collection_id, PatientID, Modality FROM index WHERE license_short_name LIKE 'CC BY%' AND license_short_name NOT LIKE '%NC%' AND Modality IN ('CT', 'MR') AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') LIMIT 200 """)

cc_by_data = client.sql_query(query)

print(f"找到{len(cc_by_data)}个CC BY许可序列") print(f"涉及集合: {cc_by_data['collection_id'].unique()}")

Download with license verification

下载并验证许可证

client.download_from_selection( seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), downloadDir="./commercial_dataset", dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" )

Save license information

保存许可证信息

cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)

undefined

cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)

undefined

Best Practices

最佳实践

Check licenses before use - Always query the
```
license_short_name
```
field and respect licensing terms (CC BY vs CC BY-NC)
Generate citations for attribution - Use
```
citations_from_selection()
```
to get properly formatted citations from
```
source_DOI
```
values; include these in publications
Start with small queries - Use
```
LIMIT
```
clause when exploring to avoid long downloads and understand data structure
Use mini-index for simple queries - Only use BigQuery when you need comprehensive metadata or complex JOINs
Organize downloads with dirTemplate - Use meaningful directory structures like
```
%collection_id/%PatientID/%Modality
```
Cache query results - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility
Estimate size first - Check collection size before downloading - some collection sizes are in terabytes!
Save manifests - Always save query results with Series UIDs for reproducibility and data provenance
Read documentation - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/
Use IDC forum - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/

使用前检查许可证 - 务必查询
```
license_short_name
```
字段并遵守许可条款（CC BY与CC BY-NC）
生成引用以注明出处 - 使用
```
citations_from_selection()
```
从
```
source_DOI
```
值生成格式规范的引用；在出版物中包含这些引用
从小规模查询开始 - 探索时使用
```
LIMIT
```
子句，避免长时间下载并了解数据结构
简单查询使用迷你索引 - 仅在需要全面元数据或复杂关联时使用BigQuery
使用dirTemplate组织下载 - 使用有意义的目录结构，如
```
%collection_id/%PatientID/%Modality
```
缓存查询结果 - 将DataFrame保存为CSV文件，避免重复查询并确保可复现性
先估算大小 - 下载前检查集合大小——部分集合大小可达TB级！
保存清单 - 始终保存包含序列UID的查询结果，以确保可复现性和数据溯源
阅读文档 - IDC数据结构和元数据字段记录在https://learn.canceridc.dev/
使用IDC论坛 - 在https://discourse.canceridc.dev/搜索问题/答案，并向IDC维护者和用户提问

Troubleshooting

故障排除

Issue:
ModuleNotFoundError: No module named 'idc_index'

Cause: idc-index package not installed
Solution: Install with
```
pip install --upgrade idc-index
```

Issue: Download fails with connection timeout

Cause: Network instability or large download size
Solution:
- Download smaller batches (e.g., 10-20 series at a time)
- Check network connection
- Use
```
dirTemplate
```
  to organize downloads by batch
- Implement retry logic with delays

Issue:
BigQuery quota exceeded
or billing errors

Cause: BigQuery requires billing-enabled GCP project
Solution: Use idc-index mini-index for simple queries (no billing required), or see
```
references/bigquery_guide.md
```
for cost optimization tips

Issue: Series UID not found or no data returned

Cause: Typo in UID, data not in current IDC version, or wrong field name
Solution:
- Check if data is in current IDC version (some old data may be deprecated)
- Use
```
LIMIT 5
```
  to test query first
- Check field names against metadata schema documentation

Issue: Downloaded DICOM files won't open

Cause: Corrupted download or incompatible viewer
Solution:
- Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools
- Verify file integrity (check file sizes)
- Use pydicom to validate:
```
pydicom.dcmread(file, force=True)
```
- Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath)
- Re-download the series

问题：
ModuleNotFoundError: No module named 'idc_index'

原因： 未安装idc-index包
解决方案： 使用
```
pip install --upgrade idc-index
```
安装

问题：下载因连接超时失败

原因： 网络不稳定或下载文件过大
解决方案：
- 分批次下载（例如每次10-20个序列）
- 检查网络连接
- 使用
```
dirTemplate
```
  按批次组织下载
- 实现带延迟的重试逻辑

问题：
BigQuery quota exceeded
或计费错误

原因： BigQuery需要启用计费的GCP项目
解决方案： 简单查询使用idc-index迷你索引（无需计费），或参阅
```
references/bigquery_guide.md
```
获取成本优化技巧

问题：序列UID未找到或无数据返回

原因： UID输入错误、数据不在当前IDC版本中、字段名错误
解决方案：
- 检查数据是否在当前IDC版本中（部分旧数据可能已弃用）
- 使用
```
LIMIT 5
```
  测试查询
- 对照元数据架构文档检查字段名

问题：下载的DICOM文件无法打开

原因： 下载损坏或查看器不兼容
解决方案：
- 检查DICOM对象类型（Modality和SOPClassUID属性）——部分对象类型需要专用工具
- 验证文件完整性（检查文件大小）
- 使用pydicom验证：
```
pydicom.dcmread(file, force=True)
```
- 尝试其他DICOM查看器（3D Slicer、Horos、RadiAnt、QuPath）
- 重新下载序列

Common SQL Query Patterns

常见SQL查询模式

Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.

常见查询快速参考。有关带上下文的详细示例，请参阅上文核心功能部分。

Discover available filter values

探索筛选列的可用值

python

undefined

python

undefined

What modalities exist?

有哪些成像模态？

client.sql_query("SELECT DISTINCT Modality FROM index")

What body parts for a specific modality?

特定模态下有哪些解剖部位？

client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as n FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY n DESC """)

What manufacturers for MR?

MR模态有哪些设备制造商？

client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """)

undefined

client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """)

undefined

Find annotations and segmentations

查找标注与分割结果

Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.

python

undefined

注意： 并非所有影像衍生对象都属于分析结果集合。部分标注随原始影像一起提交。可使用DICOM Modality或SOPClassUID查找所有衍生对象，无论集合类型。

python

undefined

Find ALL segmentations and structure sets by DICOM Modality

按DICOM模态查找所有分割结果和结构集

SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set

SEG = DICOM分割, RTSTRUCT = 放疗结构集

client.sql_query(""" SELECT collection_id, Modality, COUNT(*) as series_count FROM index WHERE Modality IN ('SEG', 'RTSTRUCT') GROUP BY collection_id, Modality ORDER BY series_count DESC """)

Find segmentations for a specific collection (includes non-analysis-result items)

查找特定集合的分割结果（包括非分析结果项）

client.sql_query(""" SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id FROM index WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' """)

List analysis result collections (curated derived datasets)

列出分析结果集合（精选衍生数据集）

client.fetch_index("analysis_results_index") client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Collections, Modalities FROM analysis_results_index """)

Find analysis results for a specific source collection

查找特定源集合的分析结果

client.sql_query(""" SELECT analysis_result_id, analysis_result_title FROM analysis_results_index WHERE Collections LIKE '%tcga_luad%' """)

Use seg_index for detailed DICOM Segmentation metadata

使用seg_index获取详细的DICOM分割元数据

client.fetch_index("seg_index")

Get segmentation statistics by algorithm

按算法统计分割结果

client.sql_query(""" SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count FROM seg_index WHERE AlgorithmName IS NOT NULL GROUP BY AlgorithmName, AlgorithmType ORDER BY seg_count DESC LIMIT 10 """)

Find segmentations for specific source images (e.g., chest CT)

查找特定源影像的分割结果（如胸部CT）

client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, s.segmented_SeriesInstanceUID as source_series FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' LIMIT 10 """)

Find TotalSegmentator results with source image context

查找TotalSegmentator结果及其源影像上下文

client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """)

undefined

client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """)

undefined

Query slide microscopy data

查询玻片显微镜数据

python

undefined

python

undefined

sm_index has detailed metadata; join with index for collection_id

sm_index包含详细元数据；与index关联以获取collection_id

client.fetch_index("sm_index") client.sql_query(""" SELECT i.collection_id, COUNT(*) as slides, MIN(s.min_PixelSpacing_2sf) as min_resolution FROM sm_index s JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID GROUP BY i.collection_id ORDER BY slides DESC """)

undefined

client.fetch_index("sm_index") client.sql_query(""" SELECT i.collection_id, COUNT(*) as slides, MIN(s.min_PixelSpacing_2sf) as min_resolution FROM sm_index s JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID GROUP BY i.collection_id ORDER BY slides DESC """)

undefined

Estimate download size

估算下载大小

python

undefined

python

undefined

Size for specific criteria

特定条件下的下载大小

client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """)

undefined

client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """)

undefined

Link to clinical data

关联临床数据

python

client.fetch_index("clinical_index")

python

client.fetch_index("clinical_index")

Find collections with clinical data and their tables

查找包含临床数据的集合及其表

client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """)


See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.

client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """)


有关包括值映射和患者队列选择的完整模式，请参阅`references/clinical_data_guide.md`。

Related Skills

Resources

资源

Schema Reference (Primary Source)

架构参考（主要来源）

Always use
client.indices_overview
for current column schemas. This ensures accuracy with the installed idc-index version:

python

undefined

始终使用
client.indices_overview
获取当前列架构。这确保与已安装的idc-index版本一致：

python

undefined

Get all column names and types for any table

获取任意表的所有列名和类型

schema = client.indices_overview["index"]["schema"] columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]

undefined

schema = client.indices_overview["index"]["schema"] columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]

undefined

Reference Documentation

参考文档

clinical_data_guide.md - Clinical/tabular data navigation, value mapping, and joining with imaging data
cloud_storage_guide.md - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
cli_guide.md - Complete idc-index command-line interface reference (
```
idc download
```
,
```
idc download-from-manifest
```
,
```
idc download-from-selection
```
)
bigquery_guide.md - Advanced BigQuery usage guide for complex metadata queries
dicomweb_guide.md - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
indices_reference - External documentation for index tables (may be ahead of the installed version)

clinical_data_guide.md - 临床/表格数据导航、值映射和与影像关联
cloud_storage_guide.md - 直接云存储访问（S3/GCS）、文件组织、CRDC UUID、版本控制和可复现性

cli_guide.md - idc-index命令行界面完整参考（

idc download

、

idc download-from-manifest

、

idc download-from-selection

）

bigquery_guide.md - BigQuery高级使用指南，用于复杂元数据查询
dicomweb_guide.md - DICOMweb端点URL、代码示例和Google Healthcare API实现细节
indices_reference - 索引表的外部文档（可能领先于已安装版本）

External Links

外部链接

IDC Portal: https://portal.imaging.datacommons.cancer.gov/explore/
Documentation: https://learn.canceridc.dev/
Tutorials: https://github.com/ImagingDataCommons/IDC-Tutorials
User Forum: https://discourse.canceridc.dev/
idc-index GitHub: https://github.com/ImagingDataCommons/idc-index
Citation: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180

IDC门户: https://portal.imaging.datacommons.cancer.gov/explore/
文档: https://learn.canceridc.dev/
教程: https://github.com/ImagingDataCommons/IDC-Tutorials
用户论坛: https://discourse.canceridc.dev/
idc-index GitHub: https://github.com/ImagingDataCommons/idc-index
引用: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180

Skill Updates

技能更新

This skill version is available in skill metadata. To check for updates:

Visit the releases page
Watch the repository on GitHub (Watch → Custom → Releases)

本技能版本记录在技能元数据中。检查更新方式：

访问发布页面
在GitHub上关注该仓库（Watch → Custom → Releases）

imaging-data-commons

Original

Translation

Imaging Data Commons

影像数据共享平台（Imaging Data Commons）

Overview

概述

get IDC data version

获取IDC数据版本

Get collection count and total series

获取集合数量和总序列数

When to Use This Skill

适用场景

IDC Data Model

IDC数据模型

Index Tables

索引表

Available Tables

可用表

Joining Tables

表关联

Join index with collections_index to get cancer types

关联index与collections_index以获取癌症类型

Join index with sm_index for slide microscopy details

关联index与sm_index以获取玻片显微镜详情

Join seg_index with index to find segmentations and their source images

关联seg_index与index以查找分割结果及其源影像

Accessing Index Tables

访问索引表

Query the primary index (always available)

查询主索引（始终可用）

Fetch and query additional indices

获取并查询额外索引

Primary index (always available after client initialization)

主索引（实例化客户端后始终可用）

Fetch and access on-demand indices

获取并访问按需加载的索引

Discovering Table Schemas (Essential for Query Writing)

发现表架构（查询编写必备）

List all available indices with descriptions

列出所有可用索引及其描述

Get complete schema for a specific index (columns, types, descriptions)

获取特定索引的完整架构（列、类型、描述）

Find columns that are DICOM attributes (check description for "DICOM" reference)

查找源自DICOM属性的列（检查描述中是否包含"DICOM"）

Returns same schema dict: {'table_description': ..., 'columns': [...]}

返回相同的架构字典: {'table_description': ..., 'columns': [...]}

Key Columns in Primary index Table

主index表的关键列

Clinical Data Access

临床数据访问

Fetch clinical index (also downloads clinical data tables)

获取临床索引（同时下载临床数据表）

Query clinical index to find available tables and their columns

查询临床索引以查找可用表及其列

Load a specific clinical table as DataFrame

将特定临床表加载为DataFrame

Data Access Options

数据访问选项

Installation and Setup

安装与设置

Core Capabilities

核心功能

1. Data Discovery and Exploration

1. 数据发现与探索

Get summary statistics from primary index

从主索引获取汇总统计

For richer collection metadata, use collections_index

如需更丰富的集合元数据，使用collections_index

For analysis results (annotations, segmentations), use analysis_results_index

如需分析结果（标注、分割结果），使用analysis_results_index

2. Querying Metadata with SQL

2. 使用SQL查询元数据

Check what Modality values exist

查看所有可用的Modality值

Check what BodyPartExamined values exist for MR modality

查看MR模态下的BodyPartExamined值

Find breast MRI scans (use actual values from exploration above)

查找乳腺MRI扫描（使用上述探索得到的实际值）

Access results as pandas DataFrame

Key Columns in Primary
`index`
Table

主
`index`
表的关键列