imaging-data-commons
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseImaging Data Commons
影像数据共享平台(Imaging Data Commons)
Overview
概述
Use the Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.
idc-indexPrimary tool: (GitHub)
idc-indexCheck current data scale for the latest version:
python
from idc_index import IDCClient
client = IDCClient()使用 Python包查询并下载美国国家癌症研究所(NCI)影像数据共享平台(IDC)的公开癌症影像数据。数据访问无需身份验证。
idc-index核心工具: (GitHub)
idc-index查看最新版本的当前数据规模:
python
from idc_index import IDCClient
client = IDCClient()get IDC data version
获取IDC数据版本
print(client.get_idc_version())
print(client.get_idc_version())
Get collection count and total series
获取集合数量和总序列数
stats = client.sql_query("""
SELECT
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)
**Core workflow:**
1. Query metadata → `client.sql_query()`
2. Download DICOM files → `client.download_from_selection()`
3. Visualize in browser → `client.get_viewer_URL(seriesInstanceUID=...)`stats = client.sql_query("""
SELECT
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)
COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats)
**核心工作流:**
1. 查询元数据 → `client.sql_query()`
2. 下载DICOM文件 → `client.download_from_selection()`
3. 浏览器可视化 → `client.get_viewer_URL(seriesInstanceUID=...)`When to Use This Skill
适用场景
- Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images
- Selecting image subsets by cancer type, modality, anatomical site, or other metadata
- Downloading DICOM data from IDC
- Checking data licenses before use in research or commercial applications
- Visualizing medical images in a browser without local DICOM viewer software
- 查找公开可用的放射学(CT、MR、PET)或病理学(玻片显微镜)影像
- 按癌症类型、成像模态、解剖部位或其他元数据筛选影像子集
- 从IDC下载DICOM数据
- 在研究或商业应用中使用前检查数据许可证
- 无需本地DICOM查看器软件,直接在浏览器中查看医学影像
IDC Data Model
IDC数据模型
IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):
- collection_id: Groups patients by disease, modality, or research focus (e.g., ,
tcga_luad). A patient belongs to exactly one collection.nlst - analysis_result_id: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections.
Use to find original imaging data, may include annotations deposited along with the images; use to find AI-generated or expert annotations.
collection_idanalysis_result_idKey identifiers for queries:
| Identifier | Scope | Use for |
|---|---|---|
| Dataset grouping | Filtering by project/study |
| Patient | Grouping images by patient |
| DICOM study | Grouping of related series, visualization |
| DICOM series | Grouping of related series, visualization |
IDC在标准DICOM层级(患者→检查→序列→实例)之上新增了两个分组层级:
- collection_id:按疾病、成像模态或研究重点对患者进行分组(例如、
tcga_luad)。一名患者仅属于一个集合。nlst - analysis_result_id:标识跨一个或多个原始集合的衍生对象(分割结果、标注、放射组学特征)。
使用查找原始影像数据,其中可能包含随影像提交的标注;使用查找AI生成或专家标注的衍生数据集。
collection_idanalysis_result_id查询关键标识符:
| 标识符 | 范围 | 用途 |
|---|---|---|
| 数据集分组 | 按项目/研究筛选 |
| 患者 | 按患者分组影像 |
| DICOM检查 | 关联序列分组、可视化 |
| DICOM序列 | 关联序列分组、可视化 |
Index Tables
索引表
The package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.
idc-indexImportant: Use to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.
client.indices_overviewidc-index重要提示: 使用获取当前表的描述和列架构。这是获取可用列及其类型的权威来源——编写SQL或探索数据结构时务必查询该内容。
client.indices_overviewAvailable Tables
可用表
| Table | Row Granularity | Loaded | Description |
|---|---|---|---|
| 1 row = 1 DICOM series | Auto | Primary metadata for all current IDC data |
| 1 row = 1 DICOM series | Auto | Series from previous IDC releases; for downloading deprecated data |
| 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions |
| 1 row = 1 analysis result collection | fetch_index() | Metadata about derived datasets (annotations, segmentations) |
| 1 row = 1 clinical data column | fetch_index() | Dictionary mapping clinical table columns to collections |
| 1 row = 1 slide microscopy series | fetch_index() | Slide Microscopy (pathology) series metadata |
| 1 row = 1 slide microscopy instance | fetch_index() | Instance-level (SOPInstanceUID) metadata for slide microscopy |
| 1 row = 1 DICOM Segmentation series | fetch_index() | Segmentation metadata: algorithm, segment count, reference to source image series |
Auto = loaded automatically when is instantiated
fetch_index() = requires to load
IDCClient()client.fetch_index("table_name")| 表名 | 行粒度 | 加载方式 | 描述 |
|---|---|---|---|
| 1行 = 1个DICOM序列 | 自动加载 | 所有当前IDC数据的主要元数据 |
| 1行 = 1个DICOM序列 | 自动加载 | 来自IDC旧版本的序列;用于下载已弃用的数据 |
| 1行 = 1个集合 | fetch_index() | 集合级元数据和描述 |
| 1行 = 1个分析结果集合 | fetch_index() | 衍生数据集(标注、分割结果)的元数据 |
| 1行 = 1个临床数据列 | fetch_index() | 临床表列与集合的映射字典 |
| 1行 = 1个玻片显微镜序列 | fetch_index() | 玻片显微镜(病理学)序列元数据 |
| 1行 = 1个玻片显微镜实例 | fetch_index() | 玻片显微镜的实例级(SOPInstanceUID)元数据 |
| 1行 = 1个DICOM分割序列 | fetch_index() | 分割元数据:算法、分割数量、源影像序列引用 |
自动加载 = 实例化时自动加载
fetch_index() = 需要调用加载
IDCClient()client.fetch_index("table_name")Joining Tables
表关联
Key columns are not explicitly labeled, the following is a subset that can be used in joins.
| Join Column | Tables | Use Case |
|---|---|---|
| index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data |
| index, prior_versions_index, sm_index, sm_instance_index | Link series across tables; connect to slide microscopy details |
| index, prior_versions_index | Link studies across current and historical data |
| index, prior_versions_index | Link patients across current and historical data |
| index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) |
| index, analysis_results_index | Link by publication DOI |
| index, prior_versions_index | Link by CRDC unique identifier |
| index, prior_versions_index | Filter by imaging modality |
| index, seg_index | Link segmentation series to its index metadata |
| seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
Note: , , and appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).
SubjectsUpdatedDescriptionExample joins:
python
from idc_index import IDCClient
client = IDCClient()关键列未显式标记,以下是可用于关联的子集。
| 关联列 | 涉及表 | 适用场景 |
|---|---|---|
| index, prior_versions_index, collections_index, clinical_index | 将序列与集合元数据或临床数据关联 |
| index, prior_versions_index, sm_index, sm_instance_index | 跨表关联序列;关联玻片显微镜详情 |
| index, prior_versions_index | 关联当前和历史数据中的检查 |
| index, prior_versions_index | 关联当前和历史数据中的患者 |
| index, analysis_results_index | 将序列与分析结果元数据(标注、分割结果)关联 |
| index, analysis_results_index | 通过出版物DOI关联 |
| index, prior_versions_index | 通过CRDC唯一标识符关联 |
| index, prior_versions_index | 按成像模态筛选 |
| index, seg_index | 将分割序列与其索引元数据关联 |
| seg_index → index | 将分割结果与其源影像序列关联(关联条件:seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) |
注意: 、和出现在多个表中,但含义不同(计数与标识符、不同更新场景)。
SubjectsUpdatedDescription关联示例:
python
from idc_index import IDCClient
client = IDCClient()Join index with collections_index to get cancer types
关联index与collections_index以获取癌症类型
client.fetch_index("collections_index")
result = client.sql_query("""
SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE i.Modality = 'MR'
LIMIT 10
""")
client.fetch_index("collections_index")
result = client.sql_query("""
SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE i.Modality = 'MR'
LIMIT 10
""")
Join index with sm_index for slide microscopy details
关联index与sm_index以获取玻片显微镜详情
client.fetch_index("sm_index")
result = client.sql_query("""
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
FROM index i
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
LIMIT 10
""")
client.fetch_index("sm_index")
result = client.sql_query("""
SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
FROM index i
JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
LIMIT 10
""")
Join seg_index with index to find segmentations and their source images
关联seg_index与index以查找分割结果及其源影像
client.fetch_index("seg_index")
result = client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
src.collection_id,
src.Modality as source_modality,
src.BodyPartExamined
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE s.AlgorithmType = 'AUTOMATIC'
LIMIT 10
""")
undefinedclient.fetch_index("seg_index")
result = client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
src.collection_id,
src.Modality as source_modality,
src.BodyPartExamined
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE s.AlgorithmType = 'AUTOMATIC'
LIMIT 10
""")
undefinedAccessing Index Tables
访问索引表
Via SQL (recommended for filtering/aggregation):
python
from idc_index import IDCClient
client = IDCClient()通过SQL(推荐用于筛选/聚合):
python
from idc_index import IDCClient
client = IDCClient()Query the primary index (always available)
查询主索引(始终可用)
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10")
Fetch and query additional indices
获取并查询额外索引
client.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
**As pandas DataFrames (direct access):**
```pythonclient.fetch_index("collections_index")
collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")
client.fetch_index("analysis_results_index")
analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5")
**作为pandas DataFrame(直接访问):**
```pythonPrimary index (always available after client initialization)
主索引(实例化客户端后始终可用)
df = client.index
df = client.index
Fetch and access on-demand indices
获取并访问按需加载的索引
client.fetch_index("sm_index")
sm_df = client.sm_index
undefinedclient.fetch_index("sm_index")
sm_df = client.sm_index
undefinedDiscovering Table Schemas (Essential for Query Writing)
发现表架构(查询编写必备)
The dictionary contains complete schema information for all tables. Always consult this when writing queries or exploring data structure.
indices_overviewDICOM attribute mapping: Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like , , , work as expected.
PatientIDStudyInstanceUIDModalityBodyPartExaminedpython
from idc_index import IDCClient
client = IDCClient()indices_overviewDICOM属性映射: 许多列直接从源文件的DICOM属性填充。架构中的列描述会指明该列是否对应DICOM属性(例如“DICOM Modality属性”或引用DICOM标签)。这允许在查询时利用DICOM知识——标准DICOM属性名称如、、、可直接使用。
PatientIDStudyInstanceUIDModalityBodyPartExaminedpython
from idc_index import IDCClient
client = IDCClient()List all available indices with descriptions
列出所有可用索引及其描述
for name, info in client.indices_overview.items():
print(f"\n{name}:")
print(f" Installed: {info['installed']}")
print(f" Description: {info['description']}")
for name, info in client.indices_overview.items():
print(f"\n{name}:")
print(f" 是否已加载: {info['installed']}")
print(f" 描述: {info['description']}")
Get complete schema for a specific index (columns, types, descriptions)
获取特定索引的完整架构(列、类型、描述)
schema = client.indices_overview["index"]["schema"]
print(f"\nTable: {schema['table_description']}")
print("\nColumns:")
for col in schema['columns']:
desc = col.get('description', 'No description')
# Description indicates if column is from DICOM attribute
print(f" {col['name']} ({col['type']}): {desc}")
schema = client.indices_overview["index"]["schema"]
print(f"\n表: {schema['table_description']}")
print("\n列:")
for col in schema['columns']:
desc = col.get('description', '无描述')
# 描述中会指明列是否来自DICOM属性
print(f" {col['name']} ({col['type']}): {desc}")
Find columns that are DICOM attributes (check description for "DICOM" reference)
查找源自DICOM属性的列(检查描述中是否包含"DICOM")
dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
print(f"\nDICOM-sourced columns: {dicom_cols}")
**Alternative: use `get_index_schema()` method:**
```python
schema = client.get_index_schema("index")dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
print(f"\n源自DICOM的列: {dicom_cols}")
**替代方式:使用`get_index_schema()`方法:**
```python
schema = client.get_index_schema("index")Returns same schema dict: {'table_description': ..., 'columns': [...]}
返回相同的架构字典: {'table_description': ..., 'columns': [...]}
undefinedundefinedKey Columns in Primary index
Table
index主index
表的关键列
indexMost common columns for queries (use for complete list and descriptions):
indices_overview| Column | Type | DICOM | Description |
|---|---|---|---|
| STRING | No | IDC collection identifier |
| STRING | No | If applicable, indicates what analysis results collection given series is part of |
| STRING | No | DOI linking to dataset details; use for learning more about the content and for attribution (see citations below) |
| STRING | Yes | Patient identifier |
| STRING | Yes | DICOM Study UID |
| STRING | Yes | DICOM Series UID — use for downloads/viewing |
| STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) |
| STRING | Yes | Anatomical region |
| STRING | Yes | Description of the series |
| STRING | Yes | Equipment manufacturer |
| STRING | Yes | Date study was performed |
| STRING | Yes | Patient sex |
| STRING | Yes | Patient age at time of study |
| STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) |
| FLOAT | No | Size of series in megabytes |
| INTEGER | No | Number of DICOM instances in series |
DICOM = Yes: Column value extracted from the DICOM attribute with the same name. Refer to the DICOM standard for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.
查询中最常用的列(完整列表和描述请使用):
indices_overview| 列名 | 类型 | 是否来自DICOM | 描述 |
|---|---|---|---|
| 字符串 | 否 | IDC集合标识符 |
| 字符串 | 否 | 若适用,指示给定序列所属的分析结果集合 |
| 字符串 | 否 | 链接到数据集详情的DOI;用于了解内容来源和引用(见下文引用部分) |
| 字符串 | 是 | 患者标识符 |
| 字符串 | 是 | DICOM检查唯一标识符 |
| 字符串 | 是 | DICOM序列唯一标识符——用于下载/可视化 |
| 字符串 | 是 | 成像模态(CT、MR、PT、SM等) |
| 字符串 | 是 | 解剖部位 |
| 字符串 | 是 | 序列描述 |
| 字符串 | 是 | 设备制造商 |
| 字符串 | 是 | 检查执行日期 |
| 字符串 | 是 | 患者性别 |
| 字符串 | 是 | 检查时患者年龄 |
| 字符串 | 否 | 许可证类型(CC BY 4.0、CC BY-NC 4.0等) |
| 浮点数 | 否 | 序列大小(MB) |
| 整数 | 否 | 序列中DICOM实例数量 |
是否来自DICOM = 是:列值从同名DICOM属性提取。有关数字标签映射,请参考DICOM标准。可使用标准DICOM知识推断预期值和格式。
Clinical Data Access
临床数据访问
python
undefinedpython
undefinedFetch clinical index (also downloads clinical data tables)
获取临床索引(同时下载临床数据表)
client.fetch_index("clinical_index")
client.fetch_index("clinical_index")
Query clinical index to find available tables and their columns
查询临床索引以查找可用表及其列
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")
tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")
Load a specific clinical table as DataFrame
将特定临床表加载为DataFrame
clinical_df = client.get_clinical_table("table_name")
See `references/clinical_data_guide.md` for detailed workflows including value mapping patterns and joining clinical data with imaging.clinical_df = client.get_clinical_table("table_name")
有关包括值映射模式和临床数据与影像关联的详细工作流,请参阅`references/clinical_data_guide.md`。Data Access Options
数据访问选项
| Method | Auth Required | Best For |
|---|---|---|
| No | Key queries and downloads (recommended) |
| IDC Portal | No | Interactive exploration, manual selection, browser-based download |
| BigQuery | Yes (GCP account) | Complex queries, full DICOM metadata |
| DICOMweb proxy | No | Tool integration via DICOMweb API |
| Cloud storage (S3/GCS) | No | Direct file access, bulk downloads, custom pipelines |
Cloud storage organization
IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.
| Bucket (AWS / GCS) | License | Content |
|---|---|---|
| No commercial restriction | >90% of IDC data |
| No commercial restriction | Collections with potential head scans |
| Commercial use restricted (CC BY-NC) | ~4% of data |
Files are stored as . Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use column from the index for S3 URLs; GCS uses the same path structure.
<crdc_series_uuid>/<crdc_instance_uuid>.dcmseries_aws_urlSee for bucket details, access commands, UUID mapping, and versioning.
references/cloud_storage_guide.mdDICOMweb access
IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.
| Endpoint | Auth | Use Case |
|---|---|---|
| Public proxy | No | Testing, moderate queries, daily quota |
| Google Healthcare | Yes (GCP) | Production use, higher quotas |
See for endpoint URLs, code examples, supported operations, and implementation details.
references/dicomweb_guide.md| 方法 | 是否需要身份验证 | 最佳适用场景 |
|---|---|---|
| 否 | 核心查询和下载(推荐) |
| IDC门户 | 否 | 交互式探索、手动选择、浏览器端下载 |
| BigQuery | 是(需要GCP账户) | 复杂查询、完整DICOM元数据 |
| DICOMweb代理 | 否 | 通过DICOMweb API集成工具 |
| 云存储(AWS S3/GCS) | 否 | 直接文件访问、批量下载、自定义流水线 |
云存储组织
IDC将所有DICOM文件存储在公共云存储桶中,在AWS S3和Google Cloud Storage(GCS)之间镜像同步。文件按CRDC UUID(而非DICOM UID)组织,以支持版本控制。
| 存储桶(AWS / GCS) | 许可证 | 内容 |
|---|---|---|
| 无商业限制 | 占IDC数据的90%以上 |
| 无商业限制 | 包含头部扫描的集合 |
| 商业使用受限(CC BY-NC) | 约占数据的4% |
文件存储路径为。可通过AWS CLI、gsutil或s5cmd以匿名访问方式免费获取(无出口费用)。使用index表中的列获取S3 URL;GCS使用相同路径结构。
<crdc_series_uuid>/<crdc_instance_uuid>.dcmseries_aws_url有关存储桶详情、访问命令、UUID映射和版本控制,请参阅。
references/cloud_storage_guide.mdDICOMweb访问
IDC数据可通过DICOMweb接口(Google Cloud Healthcare API实现)访问,以集成到PACS系统和支持DICOMweb的工具中。
| 端点 | 是否需要身份验证 | 适用场景 |
|---|---|---|
| 公共代理 | 否 | 测试、中等规模查询、每日配额限制 |
| Google Healthcare | 是(需要GCP账户) | 生产使用、更高配额 |
有关端点URL、代码示例、支持的操作和实现细节,请参阅。
references/dicomweb_guide.mdInstallation and Setup
安装与设置
Required (for basic access):
bash
pip install --upgrade idc-indexImportant: New IDC data release will always trigger a new version of . Always use flag while installing, unless an older version is needed for reproducibility.
idc-index--upgradeTested with: idc-index 0.11.7 (IDC data version v23)
Optional (for data analysis):
bash
pip install pandas numpy pydicom基础访问必备:
bash
pip install --upgrade idc-index重要提示: IDC数据新版本发布后,会同步更新版本。除非需要复现旧版本结果,否则安装时请始终使用参数。
idc-index--upgrade测试兼容版本: idc-index 0.11.7(对应IDC数据版本v23)
数据分析可选依赖:
bash
pip install pandas numpy pydicomCore Capabilities
核心功能
1. Data Discovery and Exploration
1. 数据发现与探索
Discover what imaging collections and data are available in IDC:
python
from idc_index import IDCClient
client = IDCClient()探索IDC中可用的影像集合和数据:
python
from idc_index import IDCClient
client = IDCClient()Get summary statistics from primary index
从主索引获取汇总统计
query = """
SELECT
collection_id,
COUNT(DISTINCT PatientID) as patients,
COUNT(DISTINCT SeriesInstanceUID) as series,
SUM(series_size_MB) as size_mb
FROM index
GROUP BY collection_id
ORDER BY patients DESC
"""
collections_summary = client.sql_query(query)
query = """
SELECT
collection_id,
COUNT(DISTINCT PatientID) as patients,
COUNT(DISTINCT SeriesInstanceUID) as series,
SUM(series_size_MB) as size_mb
FROM index
GROUP BY collection_id
ORDER BY patients DESC
"""
collections_summary = client.sql_query(query)
For richer collection metadata, use collections_index
如需更丰富的集合元数据,使用collections_index
client.fetch_index("collections_index")
collections_info = client.sql_query("""
SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
FROM collections_index
""")
client.fetch_index("collections_index")
collections_info = client.sql_query("""
SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
FROM collections_index
""")
For analysis results (annotations, segmentations), use analysis_results_index
如需分析结果(标注、分割结果),使用analysis_results_index
client.fetch_index("analysis_results_index")
analysis_info = client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
FROM analysis_results_index
""")
**`collections_index`** provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.
**`analysis_results_index`** lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.client.fetch_index("analysis_results_index")
analysis_info = client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
FROM analysis_results_index
""")
**`collections_index`** 提供每个集合的精选元数据:癌症类型、肿瘤位置、物种、受试者数量和支持数据类型——无需从主索引聚合。
**`analysis_results_index`** 列出衍生数据集(AI分割结果、专家标注、放射组学特征)及其源集合和模态。2. Querying Metadata with SQL
2. 使用SQL查询元数据
Query the IDC mini-index using SQL to find specific datasets.
First, explore available values for filter columns:
python
from idc_index import IDCClient
client = IDCClient()使用SQL查询IDC迷你索引以查找特定数据集。
首先,探索筛选列的可用值:
python
from idc_index import IDCClient
client = IDCClient()Check what Modality values exist
查看所有可用的Modality值
modalities = client.sql_query("""
SELECT DISTINCT Modality, COUNT(*) as series_count
FROM index
GROUP BY Modality
ORDER BY series_count DESC
""")
print(modalities)
modalities = client.sql_query("""
SELECT DISTINCT Modality, COUNT(*) as series_count
FROM index
GROUP BY Modality
ORDER BY series_count DESC
""")
print(modalities)
Check what BodyPartExamined values exist for MR modality
查看MR模态下的BodyPartExamined值
body_parts = client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count
FROM index
WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined
ORDER BY series_count DESC
LIMIT 20
""")
print(body_parts)
**Then query with validated filter values:**
```pythonbody_parts = client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count
FROM index
WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined
ORDER BY series_count DESC
LIMIT 20
""")
print(body_parts)
**然后使用验证后的筛选值进行查询:**
```pythonFind breast MRI scans (use actual values from exploration above)
查找乳腺MRI扫描(使用上述探索得到的实际值)
results = client.sql_query("""
SELECT
collection_id,
PatientID,
SeriesInstanceUID,
Modality,
SeriesDescription,
license_short_name
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined = 'BREAST'
LIMIT 20
""")
results = client.sql_query("""
SELECT
collection_id,
PatientID,
SeriesInstanceUID,
Modality,
SeriesDescription,
license_short_name
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined = 'BREAST'
LIMIT 20
""")
Access results as pandas DataFrame
以pandas DataFrame形式访问结果
for idx, row in results.iterrows():
print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")
**To filter by cancer type, join with `collections_index`:**
```python
client.fetch_index("collections_index")
results = client.sql_query("""
SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE c.CancerTypes LIKE '%Breast%'
AND i.Modality = 'MR'
LIMIT 20
""")Available metadata fields (use for complete list):
client.indices_overview- Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
- Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
- Clinical: PatientAge, PatientSex, StudyDate
- Descriptions: StudyDescription, SeriesDescription
- Licensing: license_short_name
Note: Cancer type is in , not in the primary table.
collections_index.CancerTypesindexfor idx, row in results.iterrows():
print(f"患者: {row['PatientID']}, 序列: {row['SeriesInstanceUID']}")
**如需按癌症类型筛选,关联`collections_index`:**
```python
client.fetch_index("collections_index")
results = client.sql_query("""
SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
FROM index i
JOIN collections_index c ON i.collection_id = c.collection_id
WHERE c.CancerTypes LIKE '%Breast%'
AND i.Modality = 'MR'
LIMIT 20
""")可用元数据字段(完整列表请使用):
client.indices_overview- 标识符:collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID
- 影像相关:Modality, BodyPartExamined, Manufacturer, ManufacturerModelName
- 临床相关:PatientAge, PatientSex, StudyDate
- 描述信息:StudyDescription, SeriesDescription
- 许可证:license_short_name
注意: 癌症类型存储在中,而非主表。
collections_index.CancerTypesindex3. Downloading DICOM Files
3. 下载DICOM文件
Download imaging data efficiently from IDC's cloud storage:
Download entire collection:
python
from idc_index import IDCClient
client = IDCClient()从IDC云存储高效下载影像数据:
下载整个集合:
python
from idc_index import IDCClient
client = IDCClient()Download small collection (RIDER Pilot ~1GB)
下载小型集合(RIDER Pilot 约1GB)
client.download_from_selection(
collection_id="rider_pilot",
downloadDir="./data/rider"
)
**Download specific series:**
```pythonclient.download_from_selection(
collection_id="rider_pilot",
downloadDir="./data/rider"
)
**下载特定序列:**
```pythonFirst, query for series UIDs
首先查询序列UID
series_df = client.sql_query("""
SELECT SeriesInstanceUID
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND collection_id = 'nlst'
LIMIT 5
""")
series_df = client.sql_query("""
SELECT SeriesInstanceUID
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND collection_id = 'nlst'
LIMIT 5
""")
Download only those series
仅下载这些序列
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/lung_ct"
)
**Custom directory structure:**
Default `dirTemplate`: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`
```pythonclient.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/lung_ct"
)
**自定义目录结构:**
默认`dirTemplate`:`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`
```pythonSimplified hierarchy (omit StudyInstanceUID level)
简化层级(省略StudyInstanceUID层级)
client.download_from_selection(
collection_id="tcga_luad",
downloadDir="./data",
dirTemplate="%collection_id/%PatientID/%Modality"
)
client.download_from_selection(
collection_id="tcga_luad",
downloadDir="./data",
dirTemplate="%collection_id/%PatientID/%Modality"
)
Results in: ./data/tcga_luad/TCGA-05-4244/CT/
结果路径: ./data/tcga_luad/TCGA-05-4244/CT/
Flat structure (all files in one directory)
扁平结构(所有文件在同一目录)
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/flat",
dirTemplate=""
)
client.download_from_selection(
seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
downloadDir="./data/flat",
dirTemplate=""
)
Results in: ./data/flat/*.dcm
结果路径: ./data/flat/*.dcm
undefinedundefinedCommand-Line Download
命令行下载
The command provides command-line access to download functionality without writing Python code. Available after installing .
idc downloadidc-indexAuto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).
bash
undefined安装后,可使用命令通过命令行执行下载操作,无需编写Python代码。
idc-indexidc download自动检测输入类型: 清单文件路径,或标识符(collection_id、PatientID、StudyInstanceUID、SeriesInstanceUID、crdc_series_uuid)。
bash
undefinedDownload entire collection
下载整个集合
idc download rider_pilot --download-dir ./data
idc download rider_pilot --download-dir ./data
Download specific series by UID
通过UID下载特定序列
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data
Download multiple items (comma-separated)
下载多个项目(逗号分隔)
idc download "tcga_luad,tcga_lusc" --download-dir ./data
idc download "tcga_luad,tcga_lusc" --download-dir ./data
Download from manifest file (auto-detected)
从清单文件下载(自动检测)
idc download manifest.txt --download-dir ./data
**Options:**
| Option | Description |
|--------|-------------|
| `--download-dir` | Output directory (default: current directory) |
| `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | Verbosity: debug, info, warning, error, critical |
**Manifest files:**
Manifest files contain S3 URLs (one per line) and can be:
- Exported from the IDC Portal after cohort selection
- Shared by collaborators for reproducible data access
- Generated programmatically from query results
Format (one S3 URL per line):s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
**Example: Generate manifest from Python query:**
```python
from idc_index import IDCClient
client = IDCClient()idc download manifest.txt --download-dir ./data
**选项:**
| 选项 | 描述 |
|--------|-------------|
| `--download-dir` | 输出目录(默认:当前目录) |
| `--dir-template` | 目录层级模板(默认:`%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) |
| `--log-level` | 日志级别:debug、info、warning、error、critical |
**清单文件:**
清单文件包含S3 URL(每行一个),可通过以下方式生成:
- 在IDC门户中选择队列后导出
- 由协作者共享以实现可复现的数据访问
- 通过查询结果程序化生成
格式(每行一个S3 URL):s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/*
s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/*
**示例:通过Python查询生成清单:**
```python
from idc_index import IDCClient
client = IDCClient()Query for series URLs
查询序列URL
results = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")
results = client.sql_query("""
SELECT series_aws_url
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
""")
Save as manifest file
保存为清单文件
with open('ct_manifest.txt', 'w') as f:
for url in results['series_aws_url']:
f.write(url + '\n')
Then download:
```bash
idc download ct_manifest.txt --download-dir ./ct_datawith open('ct_manifest.txt', 'w') as f:
for url in results['series_aws_url']:
f.write(url + '\n')
然后执行下载:
```bash
idc download ct_manifest.txt --download-dir ./ct_data4. Visualizing IDC Images
4. 可视化IDC影像
View DICOM data in browser without downloading:
python
from idc_index import IDCClient
import webbrowser
client = IDCClient()无需下载即可在浏览器中查看DICOM数据:
python
from idc_index import IDCClient
import webbrowser
client = IDCClient()First query to get valid UIDs
首先查询获取有效的UID
results = client.sql_query("""
SELECT SeriesInstanceUID, StudyInstanceUID
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 1
""")
results = client.sql_query("""
SELECT SeriesInstanceUID, StudyInstanceUID
FROM index
WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
LIMIT 1
""")
View single series
查看单个序列
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
webbrowser.open(viewer_url)
viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
webbrowser.open(viewer_url)
View all series in a study (useful for multi-series exams like MRI protocols)
查看检查中的所有序列(适用于多序列检查,如MRI协议)
viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
webbrowser.open(viewer_url)
The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
webbrowser.open(viewer_url)
该方法会自动为放射学影像选择OHIF v3,为玻片显微镜选择SLIM。按检查查看适用于DICOM检查包含多个序列的场景(例如单次MRI检查中的T1、T2、DWI序列)。5. Understanding and Checking Licenses
5. 理解与检查许可证
Check data licensing before use (critical for commercial applications):
python
from idc_index import IDCClient
client = IDCClient()使用前请检查数据许可证(商业应用尤为重要):
python
from idc_index import IDCClient
client = IDCClient()Check licenses for all collections
检查所有集合的许可证
query = """
SELECT DISTINCT
collection_id,
license_short_name,
COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM index
GROUP BY collection_id, license_short_name
ORDER BY collection_id
"""
licenses = client.sql_query(query)
print(licenses)
**License types in IDC:**
- **CC BY 4.0** / **CC BY 3.0** (~97% of data) - Allows commercial use with attribution
- **CC BY-NC 4.0** / **CC BY-NC 3.0** (~3% of data) - Non-commercial use only
- **Custom licenses** (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions)
**Important:** Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.query = """
SELECT DISTINCT
collection_id,
license_short_name,
COUNT(DISTINCT SeriesInstanceUID) as series_count
FROM index
GROUP BY collection_id, license_short_name
ORDER BY collection_id
"""
licenses = client.sql_query(query)
print(licenses)
**IDC中的许可证类型:**
- **CC BY 4.0** / **CC BY 3.0**(约占数据的97%)- 允许商业使用,但需注明出处
- **CC BY-NC 4.0** / **CC BY-NC 3.0**(约占数据的3%)- 仅允许非商业使用
- **自定义许可证**(罕见)- 部分集合有特定条款(如NLM条款和条件)
**重要提示:** 在出版物或商业应用中使用IDC数据前,请务必检查许可证。每个DICOM文件的元数据中都标记了其特定许可证。Generating Citations for Attribution
生成引用信息以注明出处
The column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use to generate properly formatted citations:
source_DOIcitations_from_selection()python
from idc_index import IDCClient
client = IDCClient()source_DOIcitations_from_selection()python
from idc_index import IDCClient
client = IDCClient()Get citations for a collection (APA format by default)
获取集合的引用(默认APA格式)
citations = client.citations_from_selection(collection_id="rider_pilot")
for citation in citations:
print(citation)
citations = client.citations_from_selection(collection_id="rider_pilot")
for citation in citations:
print(citation)
Get citations for specific series
获取特定序列的引用
results = client.sql_query("""
SELECT SeriesInstanceUID FROM index
WHERE collection_id = 'tcga_luad' LIMIT 5
""")
citations = client.citations_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values)
)
results = client.sql_query("""
SELECT SeriesInstanceUID FROM index
WHERE collection_id = 'tcga_luad' LIMIT 5
""")
citations = client.citations_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values)
)
Alternative format: BibTeX (for LaTeX documents)
替代格式:BibTeX(适用于LaTeX文档)
bibtex_citations = client.citations_from_selection(
collection_id="tcga_luad",
citation_format=IDCClient.CITATION_FORMAT_BIBTEX
)
**Parameters:**
- `collection_id`: Filter by collection(s)
- `patientId`: Filter by patient ID(s)
- `studyInstanceUID`: Filter by study UID(s)
- `seriesInstanceUID`: Filter by series UID(s)
- `citation_format`: Use `IDCClient.CITATION_FORMAT_*` constants:
- `CITATION_FORMAT_APA` (default) - APA style
- `CITATION_FORMAT_BIBTEX` - BibTeX for LaTeX
- `CITATION_FORMAT_JSON` - CSL JSON
- `CITATION_FORMAT_TURTLE` - RDF Turtle
**Best practice:** When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.bibtex_citations = client.citations_from_selection(
collection_id="tcga_luad",
citation_format=IDCClient.CITATION_FORMAT_BIBTEX
)
**参数:**
- `collection_id`:按集合筛选
- `patientId`:按患者ID筛选
- `studyInstanceUID`:按检查UID筛选
- `seriesInstanceUID`:按序列UID筛选
- `citation_format`:使用`IDCClient.CITATION_FORMAT_*`常量:
- `CITATION_FORMAT_APA`(默认)- APA格式
- `CITATION_FORMAT_BIBTEX` - 适用于LaTeX的BibTeX格式
- `CITATION_FORMAT_JSON` - CSL JSON格式
- `CITATION_FORMAT_TURTLE` - RDF Turtle格式
**最佳实践:** 使用IDC数据发表结果时,请包含生成的引用,以正确注明数据源并满足许可证要求。6. Batch Processing and Filtering
6. 批量处理与筛选
Process large datasets efficiently with filtering:
python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()通过筛选高效处理大型数据集:
python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()Find chest CT scans from GE scanners
查找GE扫描仪的胸部CT扫描
query = """
SELECT
SeriesInstanceUID,
PatientID,
collection_id,
ManufacturerModelName
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND Manufacturer = 'GE MEDICAL SYSTEMS'
AND license_short_name = 'CC BY 4.0'
LIMIT 100
"""
results = client.sql_query(query)
query = """
SELECT
SeriesInstanceUID,
PatientID,
collection_id,
ManufacturerModelName
FROM index
WHERE Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND Manufacturer = 'GE MEDICAL SYSTEMS'
AND license_short_name = 'CC BY 4.0'
LIMIT 100
""")
results = client.sql_query(query)
Save manifest for later
保存清单供后续使用
results.to_csv('lung_ct_manifest.csv', index=False)
results.to_csv('lung_ct_manifest.csv', index=False)
Download in batches to avoid timeout
分批下载以避免超时
batch_size = 10
for i in range(0, len(results), batch_size):
batch = results.iloc[i:i+batch_size]
client.download_from_selection(
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
downloadDir=f"./data/batch_{i//batch_size}"
)
undefinedbatch_size = 10
for i in range(0, len(results), batch_size):
batch = results.iloc[i:i+batch_size]
client.download_from_selection(
seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
downloadDir=f"./data/batch_{i//batch_size}"
)
undefined7. Advanced Queries with BigQuery
7. 使用BigQuery进行高级查询
For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.
Quick reference:
- Dataset:
bigquery-public-data.idc_current.* - Main table: (combined metadata)
dicom_all - Full metadata: (all DICOM tags)
dicom_metadata - Private elements: column (vendor-specific tags like diffusion b-values)
OtherElements
See for setup, table schemas, query patterns, private element access, and cost optimization.
references/bigquery_guide.md如需查询完整DICOM元数据、复杂关联、临床数据表或私有DICOM元素,请使用Google BigQuery。需要启用计费的GCP账户。
快速参考:
- 数据集:
bigquery-public-data.idc_current.* - 主表:(合并元数据)
dicom_all - 完整元数据:(所有DICOM标签)
dicom_metadata - 私有元素:列(厂商特定标签,如弥散b值)
OtherElements
有关设置、表架构、查询模式、私有元素访问和成本优化,请参阅。
references/bigquery_guide.md8. Tool Selection Guide
8. 工具选择指南
| Task | Tool | Reference |
|---|---|---|
| Programmatic queries & downloads | | This document |
| Interactive exploration | IDC Portal | https://portal.imaging.datacommons.cancer.gov/ |
| Complex metadata queries | BigQuery | |
| 3D visualization & analysis | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser |
Default choice: Use for most tasks (no auth, easy API, batch downloads).
idc-index| 任务 | 工具 | 参考文档 |
|---|---|---|
| 程序化查询与下载 | | 本文档 |
| 交互式探索 | IDC门户 | https://portal.imaging.datacommons.cancer.gov/ |
| 复杂元数据查询 | BigQuery | |
| 3D可视化与分析 | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser |
默认选择: 大多数任务使用(无需身份验证、API易用、支持批量下载)。
idc-index9. Integration with Analysis Pipelines
9. 与分析流水线集成
Integrate IDC data into imaging analysis workflows:
Read downloaded DICOM files:
python
import pydicom
import os将IDC数据集成到影像分析工作流:
读取下载的DICOM文件:
python
import pydicom
import osRead DICOM files from downloaded series
读取下载序列中的DICOM文件
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
if f.endswith('.dcm')]
series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."
dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
if f.endswith('.dcm')]
Load first image
加载第一张影像
ds = pydicom.dcmread(dicom_files[0])
print(f"Patient ID: {ds.PatientID}")
print(f"Modality: {ds.Modality}")
print(f"Image shape: {ds.pixel_array.shape}")
**Build 3D volume from CT series:**
```python
import pydicom
import numpy as np
from pathlib import Path
def load_ct_series(series_path):
"""Load CT series as 3D numpy array"""
files = sorted(Path(series_path).glob('*.dcm'))
slices = [pydicom.dcmread(str(f)) for f in files]
# Sort by slice location
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
# Stack into 3D array
volume = np.stack([s.pixel_array for s in slices])
return volume, slices[0] # Return volume and first slice for metadata
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"Volume shape: {volume.shape}") # (z, y, x)Integrate with SimpleITK:
python
import SimpleITK as sitk
from pathlib import Pathds = pydicom.dcmread(dicom_files[0])
print(f"患者ID: {ds.PatientID}")
print(f"成像模态: {ds.Modality}")
print(f"影像尺寸: {ds.pixel_array.shape}")
**从CT序列构建3D体素:**
```python
import pydicom
import numpy as np
from pathlib import Path
def load_ct_series(series_path):
"""将CT序列加载为3D numpy数组"""
files = sorted(Path(series_path).glob('*.dcm'))
slices = [pydicom.dcmread(str(f)) for f in files]
# 按切片位置排序
slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))
# 堆叠为3D数组
volume = np.stack([s.pixel_array for s in slices])
return volume, slices[0] # 返回体素和第一张切片的元数据
volume, metadata = load_ct_series("./data/lung_ct/series_dir")
print(f"体素尺寸: {volume.shape}") # (z, y, x)与SimpleITK集成:
python
import SimpleITK as sitk
from pathlib import PathRead DICOM series
读取DICOM序列
series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()
series_path = "./data/ct_series"
reader = sitk.ImageSeriesReader()
dicom_names = reader.GetGDCMSeriesFileNames(series_path)
reader.SetFileNames(dicom_names)
image = reader.Execute()
Apply processing
应用处理
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)
Save as NIfTI
保存为NIfTI格式
sitk.WriteImage(smoothed, "processed_volume.nii.gz")
undefinedsitk.WriteImage(smoothed, "processed_volume.nii.gz")
undefinedCommon Use Cases
常见使用场景
Use Case 1: Find and Download Lung CT Scans for Deep Learning
场景1:查找并下载肺部CT扫描用于深度学习
Objective: Build training dataset of lung CT scans from NLST collection
Steps:
python
from idc_index import IDCClient
client = IDCClient()目标: 构建NLST集合的肺部CT扫描训练数据集
步骤:
python
from idc_index import IDCClient
client = IDCClient()1. Query for lung CT scans with specific criteria
1. 按特定条件查询肺部CT扫描
query = """
SELECT
PatientID,
SeriesInstanceUID,
SeriesDescription
FROM index
WHERE collection_id = 'nlst'
AND Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND license_short_name = 'CC BY 4.0'
ORDER BY PatientID
LIMIT 100
"""
results = client.sql_query(query)
print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")
query = """
SELECT
PatientID,
SeriesInstanceUID,
SeriesDescription
FROM index
WHERE collection_id = 'nlst'
AND Modality = 'CT'
AND BodyPartExamined = 'CHEST'
AND license_short_name = 'CC BY 4.0'
ORDER BY PatientID
LIMIT 100
""")
results = client.sql_query(query)
print(f"找到{len(results)}个序列,来自{results['PatientID'].nunique()}名患者")
2. Download data organized by patient
2. 按患者组织数据并下载
client.download_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
downloadDir="./training_data",
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
)
client.download_from_selection(
seriesInstanceUID=list(results['SeriesInstanceUID'].values),
downloadDir="./training_data",
dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
)
3. Save manifest for reproducibility
3. 保存清单以确保可复现性
results.to_csv('training_manifest.csv', index=False)
undefinedresults.to_csv('training_manifest.csv', index=False)
undefinedUse Case 2: Query Brain MRI by Manufacturer for Quality Study
场景2:按厂商查询脑部MRI用于质量研究
Objective: Compare image quality across different MRI scanner manufacturers
Steps:
python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()目标: 比较不同MRI扫描仪厂商的影像质量
步骤:
python
from idc_index import IDCClient
import pandas as pd
client = IDCClient()Query for brain MRI grouped by manufacturer
按厂商分组查询脑部MRI
query = """
SELECT
Manufacturer,
ManufacturerModelName,
COUNT(DISTINCT SeriesInstanceUID) as num_series,
COUNT(DISTINCT PatientID) as num_patients
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
GROUP BY Manufacturer, ManufacturerModelName
HAVING num_series >= 10
ORDER BY num_series DESC
"""
manufacturers = client.sql_query(query)
print(manufacturers)
query = """
SELECT
Manufacturer,
ManufacturerModelName,
COUNT(DISTINCT SeriesInstanceUID) as num_series,
COUNT(DISTINCT PatientID) as num_patients
FROM index
WHERE Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
GROUP BY Manufacturer, ManufacturerModelName
HAVING num_series >= 10
ORDER BY num_series DESC
""")
manufacturers = client.sql_query(query)
print(manufacturers)
Download sample from each manufacturer for comparison
下载每个厂商的样本用于比较
for _, row in manufacturers.head(3).iterrows():
mfr = row['Manufacturer']
model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
AND ManufacturerModelName = '{model}'
AND Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
"""
series = client.sql_query(query)
client.download_from_selection(
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)undefinedfor _, row in manufacturers.head(3).iterrows():
mfr = row['Manufacturer']
model = row['ManufacturerModelName']
query = f"""
SELECT SeriesInstanceUID
FROM index
WHERE Manufacturer = '{mfr}'
AND ManufacturerModelName = '{model}'
AND Modality = 'MR'
AND BodyPartExamined LIKE '%BRAIN%'
LIMIT 5
""")
series = client.sql_query(query)
client.download_from_selection(
seriesInstanceUID=list(series['SeriesInstanceUID'].values),
downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
)undefinedUse Case 3: Visualize Series Without Downloading
场景3:无需下载即可预览序列
Objective: Preview imaging data before committing to download
python
from idc_index import IDCClient
import webbrowser
client = IDCClient()
series_list = client.sql_query("""
SELECT SeriesInstanceUID, PatientID, SeriesDescription
FROM index
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
LIMIT 10
""")目标: 在决定下载前预览影像数据
python
from idc_index import IDCClient
import webbrowser
client = IDCClient()
series_list = client.sql_query("""
SELECT SeriesInstanceUID, PatientID, SeriesDescription
FROM index
WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
LIMIT 10
""")Preview each in browser
在浏览器中预览每个序列
for _, row in series_list.iterrows():
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
print(f" View at: {viewer_url}")
# webbrowser.open(viewer_url) # Uncomment to open automatically
For additional visualization options, see the [IDC Portal getting started guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration.for _, row in series_list.iterrows():
viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
print(f"患者{row['PatientID']}: {row['SeriesDescription']}")
print(f" 查看地址: {viewer_url}")
# webbrowser.open(viewer_url) # 取消注释以自动打开
有关更多可视化选项,请参阅[IDC门户入门指南](https://learn.canceridc.dev/portal/getting-started)或用于3D Slicer集成的[SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser)。Use Case 4: License-Aware Batch Download for Commercial Use
场景4:面向商业用途的许可证感知批量下载
Objective: Download only CC-BY licensed data suitable for commercial applications
Steps:
python
from idc_index import IDCClient
client = IDCClient()目标: 仅下载适用于商业应用的CC-BY许可数据
步骤:
python
from idc_index import IDCClient
client = IDCClient()Query ONLY for CC BY licensed data (allows commercial use with attribution)
仅查询CC BY许可数据(允许商业使用,需注明出处)
query = """
SELECT
SeriesInstanceUID,
collection_id,
PatientID,
Modality
FROM index
WHERE license_short_name LIKE 'CC BY%'
AND license_short_name NOT LIKE '%NC%'
AND Modality IN ('CT', 'MR')
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
LIMIT 200
"""
cc_by_data = client.sql_query(query)
print(f"Found {len(cc_by_data)} CC BY licensed series")
print(f"Collections: {cc_by_data['collection_id'].unique()}")
query = """
SELECT
SeriesInstanceUID,
collection_id,
PatientID,
Modality
FROM index
WHERE license_short_name LIKE 'CC BY%'
AND license_short_name NOT LIKE '%NC%'
AND Modality IN ('CT', 'MR')
AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
LIMIT 200
""")
cc_by_data = client.sql_query(query)
print(f"找到{len(cc_by_data)}个CC BY许可序列")
print(f"涉及集合: {cc_by_data['collection_id'].unique()}")
Download with license verification
下载并验证许可证
client.download_from_selection(
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
downloadDir="./commercial_dataset",
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
)
client.download_from_selection(
seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
downloadDir="./commercial_dataset",
dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
)
Save license information
保存许可证信息
cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
undefinedcc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)
undefinedBest Practices
最佳实践
- Check licenses before use - Always query the field and respect licensing terms (CC BY vs CC BY-NC)
license_short_name - Generate citations for attribution - Use to get properly formatted citations from
citations_from_selection()values; include these in publicationssource_DOI - Start with small queries - Use clause when exploring to avoid long downloads and understand data structure
LIMIT - Use mini-index for simple queries - Only use BigQuery when you need comprehensive metadata or complex JOINs
- Organize downloads with dirTemplate - Use meaningful directory structures like
%collection_id/%PatientID/%Modality - Cache query results - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility
- Estimate size first - Check collection size before downloading - some collection sizes are in terabytes!
- Save manifests - Always save query results with Series UIDs for reproducibility and data provenance
- Read documentation - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/
- Use IDC forum - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/
- 使用前检查许可证 - 务必查询字段并遵守许可条款(CC BY与CC BY-NC)
license_short_name - 生成引用以注明出处 - 使用从
citations_from_selection()值生成格式规范的引用;在出版物中包含这些引用source_DOI - 从小规模查询开始 - 探索时使用子句,避免长时间下载并了解数据结构
LIMIT - 简单查询使用迷你索引 - 仅在需要全面元数据或复杂关联时使用BigQuery
- 使用dirTemplate组织下载 - 使用有意义的目录结构,如
%collection_id/%PatientID/%Modality - 缓存查询结果 - 将DataFrame保存为CSV文件,避免重复查询并确保可复现性
- 先估算大小 - 下载前检查集合大小——部分集合大小可达TB级!
- 保存清单 - 始终保存包含序列UID的查询结果,以确保可复现性和数据溯源
- 阅读文档 - IDC数据结构和元数据字段记录在https://learn.canceridc.dev/
- 使用IDC论坛 - 在https://discourse.canceridc.dev/搜索问题/答案,并向IDC维护者和用户提问
Troubleshooting
故障排除
Issue:
ModuleNotFoundError: No module named 'idc_index'- Cause: idc-index package not installed
- Solution: Install with
pip install --upgrade idc-index
Issue: Download fails with connection timeout
- Cause: Network instability or large download size
- Solution:
- Download smaller batches (e.g., 10-20 series at a time)
- Check network connection
- Use to organize downloads by batch
dirTemplate - Implement retry logic with delays
Issue: or billing errors
BigQuery quota exceeded- Cause: BigQuery requires billing-enabled GCP project
- Solution: Use idc-index mini-index for simple queries (no billing required), or see for cost optimization tips
references/bigquery_guide.md
Issue: Series UID not found or no data returned
- Cause: Typo in UID, data not in current IDC version, or wrong field name
- Solution:
- Check if data is in current IDC version (some old data may be deprecated)
- Use to test query first
LIMIT 5 - Check field names against metadata schema documentation
Issue: Downloaded DICOM files won't open
- Cause: Corrupted download or incompatible viewer
- Solution:
- Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools
- Verify file integrity (check file sizes)
- Use pydicom to validate:
pydicom.dcmread(file, force=True) - Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath)
- Re-download the series
问题:
ModuleNotFoundError: No module named 'idc_index'- 原因: 未安装idc-index包
- 解决方案: 使用安装
pip install --upgrade idc-index
问题:下载因连接超时失败
- 原因: 网络不稳定或下载文件过大
- 解决方案:
- 分批次下载(例如每次10-20个序列)
- 检查网络连接
- 使用按批次组织下载
dirTemplate - 实现带延迟的重试逻辑
问题:或计费错误
BigQuery quota exceeded- 原因: BigQuery需要启用计费的GCP项目
- 解决方案: 简单查询使用idc-index迷你索引(无需计费),或参阅获取成本优化技巧
references/bigquery_guide.md
问题:序列UID未找到或无数据返回
- 原因: UID输入错误、数据不在当前IDC版本中、字段名错误
- 解决方案:
- 检查数据是否在当前IDC版本中(部分旧数据可能已弃用)
- 使用测试查询
LIMIT 5 - 对照元数据架构文档检查字段名
问题:下载的DICOM文件无法打开
- 原因: 下载损坏或查看器不兼容
- 解决方案:
- 检查DICOM对象类型(Modality和SOPClassUID属性)——部分对象类型需要专用工具
- 验证文件完整性(检查文件大小)
- 使用pydicom验证:
pydicom.dcmread(file, force=True) - 尝试其他DICOM查看器(3D Slicer、Horos、RadiAnt、QuPath)
- 重新下载序列
Common SQL Query Patterns
常见SQL查询模式
Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.
常见查询快速参考。有关带上下文的详细示例,请参阅上文核心功能部分。
Discover available filter values
探索筛选列的可用值
python
undefinedpython
undefinedWhat modalities exist?
有哪些成像模态?
client.sql_query("SELECT DISTINCT Modality FROM index")
client.sql_query("SELECT DISTINCT Modality FROM index")
What body parts for a specific modality?
特定模态下有哪些解剖部位?
client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined ORDER BY n DESC
""")
client.sql_query("""
SELECT DISTINCT BodyPartExamined, COUNT(*) as n
FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
GROUP BY BodyPartExamined ORDER BY n DESC
""")
What manufacturers for MR?
MR模态有哪些设备制造商?
client.sql_query("""
SELECT DISTINCT Manufacturer, COUNT(*) as n
FROM index WHERE Modality = 'MR'
GROUP BY Manufacturer ORDER BY n DESC
""")
undefinedclient.sql_query("""
SELECT DISTINCT Manufacturer, COUNT(*) as n
FROM index WHERE Modality = 'MR'
GROUP BY Manufacturer ORDER BY n DESC
""")
undefinedFind annotations and segmentations
查找标注与分割结果
Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.
python
undefined注意: 并非所有影像衍生对象都属于分析结果集合。部分标注随原始影像一起提交。可使用DICOM Modality或SOPClassUID查找所有衍生对象,无论集合类型。
python
undefinedFind ALL segmentations and structure sets by DICOM Modality
按DICOM模态查找所有分割结果和结构集
SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set
SEG = DICOM分割, RTSTRUCT = 放疗结构集
client.sql_query("""
SELECT collection_id, Modality, COUNT(*) as series_count
FROM index
WHERE Modality IN ('SEG', 'RTSTRUCT')
GROUP BY collection_id, Modality
ORDER BY series_count DESC
""")
client.sql_query("""
SELECT collection_id, Modality, COUNT(*) as series_count
FROM index
WHERE Modality IN ('SEG', 'RTSTRUCT')
GROUP BY collection_id, Modality
ORDER BY series_count DESC
""")
Find segmentations for a specific collection (includes non-analysis-result items)
查找特定集合的分割结果(包括非分析结果项)
client.sql_query("""
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
FROM index
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")
client.sql_query("""
SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
FROM index
WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
""")
List analysis result collections (curated derived datasets)
列出分析结果集合(精选衍生数据集)
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
FROM analysis_results_index
""")
client.fetch_index("analysis_results_index")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title, Collections, Modalities
FROM analysis_results_index
""")
Find analysis results for a specific source collection
查找特定源集合的分析结果
client.sql_query("""
SELECT analysis_result_id, analysis_result_title
FROM analysis_results_index
WHERE Collections LIKE '%tcga_luad%'
""")
client.sql_query("""
SELECT analysis_result_id, analysis_result_title
FROM analysis_results_index
WHERE Collections LIKE '%tcga_luad%'
""")
Use seg_index for detailed DICOM Segmentation metadata
使用seg_index获取详细的DICOM分割元数据
client.fetch_index("seg_index")
client.fetch_index("seg_index")
Get segmentation statistics by algorithm
按算法统计分割结果
client.sql_query("""
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
FROM seg_index
WHERE AlgorithmName IS NOT NULL
GROUP BY AlgorithmName, AlgorithmType
ORDER BY seg_count DESC
LIMIT 10
""")
client.sql_query("""
SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count
FROM seg_index
WHERE AlgorithmName IS NOT NULL
GROUP BY AlgorithmName, AlgorithmType
ORDER BY seg_count DESC
LIMIT 10
""")
Find segmentations for specific source images (e.g., chest CT)
查找特定源影像的分割结果(如胸部CT)
client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
s.segmented_SeriesInstanceUID as source_series
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
LIMIT 10
""")
client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.AlgorithmName,
s.total_segments,
s.segmented_SeriesInstanceUID as source_series
FROM seg_index s
JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
LIMIT 10
""")
Find TotalSegmentator results with source image context
查找TotalSegmentator结果及其源影像上下文
client.sql_query("""
SELECT
seg_info.collection_id,
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
SUM(s.total_segments) as total_segments
FROM seg_index s
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
GROUP BY seg_info.collection_id
ORDER BY seg_count DESC
""")
undefinedclient.sql_query("""
SELECT
seg_info.collection_id,
COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
SUM(s.total_segments) as total_segments
FROM seg_index s
JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
GROUP BY seg_info.collection_id
ORDER BY seg_count DESC
""")
undefinedQuery slide microscopy data
查询玻片显微镜数据
python
undefinedpython
undefinedsm_index has detailed metadata; join with index for collection_id
sm_index包含详细元数据;与index关联以获取collection_id
client.fetch_index("sm_index")
client.sql_query("""
SELECT i.collection_id, COUNT(*) as slides,
MIN(s.min_PixelSpacing_2sf) as min_resolution
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY slides DESC
""")
undefinedclient.fetch_index("sm_index")
client.sql_query("""
SELECT i.collection_id, COUNT(*) as slides,
MIN(s.min_PixelSpacing_2sf) as min_resolution
FROM sm_index s
JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
GROUP BY i.collection_id
ORDER BY slides DESC
""")
undefinedEstimate download size
估算下载大小
python
undefinedpython
undefinedSize for specific criteria
特定条件下的下载大小
client.sql_query("""
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
FROM index
WHERE collection_id = 'nlst' AND Modality = 'CT'
""")
undefinedclient.sql_query("""
SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count
FROM index
WHERE collection_id = 'nlst' AND Modality = 'CT'
""")
undefinedLink to clinical data
关联临床数据
python
client.fetch_index("clinical_index")python
client.fetch_index("clinical_index")Find collections with clinical data and their tables
查找包含临床数据的集合及其表
client.sql_query("""
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
FROM clinical_index
GROUP BY collection_id, table_name
ORDER BY collection_id
""")
See `references/clinical_data_guide.md` for complete patterns including value mapping and patient cohort selection.client.sql_query("""
SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
FROM clinical_index
GROUP BY collection_id, table_name
ORDER BY collection_id
""")
有关包括值映射和患者队列选择的完整模式,请参阅`references/clinical_data_guide.md`。Related Skills
相关技能
The following skills complement IDC workflows for downstream analysis and visualization:
以下技能可补充IDC工作流,用于下游分析和可视化:
DICOM Processing
DICOM处理
- pydicom - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
- pydicom - 读取、写入和操作下载的DICOM文件。用于提取像素数据、读取元数据、匿名化和格式转换。是处理IDC放射学数据(CT、MR、PET)的必备工具。
Pathology and Slide Microscopy
病理学与玻片显微镜
- histolab - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.
- pathml - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
- histolab - 轻量级全玻片图像瓦片提取与预处理工具。用于IDC玻片显微镜数据的基础处理、组织检测和数据集准备。
- pathml - 全功能计算病理学工具包。用于高级WSI分析,包括多模态成像、细胞核分割和基于IDC下载病理学数据的ML模型训练。
Metadata Visualization
元数据可视化
- matplotlib - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).
- seaborn - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.
- plotly - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.
- matplotlib - 低级别绘图工具,支持完全自定义。用于创建静态图表汇总IDC查询结果(模态柱状图、序列数量直方图等)。
- seaborn - 与pandas集成的统计可视化工具。用于快速探索IDC元数据分布、变量间关系和分类比较,默认样式美观。
- plotly - 交互式可视化工具。需要悬停信息、缩放和平移功能探索IDC元数据,或创建可嵌入网页的集合统计仪表板时使用。
Data Exploration
数据探索
- exploratory-data-analysis - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.
- exploratory-data-analysis - 科学数据文件的全面EDA工具。下载IDC数据后使用,以了解文件结构、质量和特征,为后续分析做准备。
Resources
资源
Schema Reference (Primary Source)
架构参考(主要来源)
Always use for current column schemas. This ensures accuracy with the installed idc-index version:
client.indices_overviewpython
undefined始终使用获取当前列架构。 这确保与已安装的idc-index版本一致:
client.indices_overviewpython
undefinedGet all column names and types for any table
获取任意表的所有列名和类型
schema = client.indices_overview["index"]["schema"]
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
undefinedschema = client.indices_overview["index"]["schema"]
columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]
undefinedReference Documentation
参考文档
- clinical_data_guide.md - Clinical/tabular data navigation, value mapping, and joining with imaging data
- cloud_storage_guide.md - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility
- cli_guide.md - Complete idc-index command-line interface reference (,
idc download,idc download-from-manifest)idc download-from-selection - bigquery_guide.md - Advanced BigQuery usage guide for complex metadata queries
- dicomweb_guide.md - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details
- indices_reference - External documentation for index tables (may be ahead of the installed version)
- clinical_data_guide.md - 临床/表格数据导航、值映射和与影像关联
- cloud_storage_guide.md - 直接云存储访问(S3/GCS)、文件组织、CRDC UUID、版本控制和可复现性
- cli_guide.md - idc-index命令行界面完整参考(、
idc download、idc download-from-manifest)idc download-from-selection - bigquery_guide.md - BigQuery高级使用指南,用于复杂元数据查询
- dicomweb_guide.md - DICOMweb端点URL、代码示例和Google Healthcare API实现细节
- indices_reference - 索引表的外部文档(可能领先于已安装版本)
External Links
外部链接
- IDC Portal: https://portal.imaging.datacommons.cancer.gov/explore/
- Documentation: https://learn.canceridc.dev/
- Tutorials: https://github.com/ImagingDataCommons/IDC-Tutorials
- User Forum: https://discourse.canceridc.dev/
- idc-index GitHub: https://github.com/ImagingDataCommons/idc-index
- Citation: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180
- IDC门户: https://portal.imaging.datacommons.cancer.gov/explore/
- 文档: https://learn.canceridc.dev/
- 教程: https://github.com/ImagingDataCommons/IDC-Tutorials
- 用户论坛: https://discourse.canceridc.dev/
- idc-index GitHub: https://github.com/ImagingDataCommons/idc-index
- 引用: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180
Skill Updates
技能更新
This skill version is available in skill metadata. To check for updates:
- Visit the releases page
- Watch the repository on GitHub (Watch → Custom → Releases)
本技能版本记录在技能元数据中。检查更新方式:
- 访问发布页面
- 在GitHub上关注该仓库(Watch → Custom → Releases)