tcga-bulk-data-preprocessing-with-omicverse
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTCGA bulk data preprocessing with omicverse
使用omicverse进行TCGA批量数据预处理
Overview
概述
Follow this skill to recreate the preprocessing routine from . It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through .
t_tcga.ipynbov.bulk.pyTCGA按照本指南复现中的预处理流程。它通过自动化加载TCGA下载数据、生成原始/归一化矩阵、初始化元数据并运行生存分析。
t_tcga.ipynbov.bulk.pyTCGAInstructions
操作步骤
- Gather required downloads
- Confirm the user has:
- from the TCGA Sample Sheet export.
gdc_sample_sheet.<date>.tsv - The decompressed directory containing expression archives.
gdc_download_xxxxx - The directory with clinical XML/JSON files.
clinical.cart.<date>
- Mention that sample data are available under .
omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/
- Confirm the user has:
- Initialise the TCGA helper
- Import (and
omicverse as ovif plotting) then callscanpy as sc.ov.plot_set() - Instantiate .
aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir) - Run to build the AnnData object with raw counts, FPKM, and TPM layers.
aml_tcga.adata_init()
- Import
- Persist the dataset
- Encourage saving the initial AnnData: .
aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip') - When reloading, reconstruct the class with the same paths and call .
aml_tcga.adata_read(<path>)
- Encourage saving the initial AnnData:
- Initialise metadata and clinical information
- Populate sample metadata using to convert gene IDs to symbols and attach patient info.
aml_tcga.adata_meta_init() - Add survival attributes via (note the intentional spelling in the API).
aml_tcga.survial_init()
- Populate sample metadata using
- Perform survival analyses
- Plot gene-level survival curves with .
aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True) - To process all genes, call ; warn that it may take time.
aml_tcga.survial_analysis_all()
- Plot gene-level survival curves with
- Export results
- Save enriched metadata to a new AnnData file ().
aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip') - Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python.
- Save enriched metadata to a new AnnData file (
- Troubleshooting tips
- Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors.
- The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not align.
- Survival plots require clinical dates; if absent, instruct users to check the contents.
clinical_cart
- 准备所需的下载数据
- 确认用户已获取:
- 从TCGA样本表导出的文件。
gdc_sample_sheet.<date>.tsv - 解压后的目录,其中包含表达档案。
gdc_download_xxxxx - 包含临床XML/JSON文件的目录。
clinical.cart.<date>
- 从TCGA样本表导出的
- 提示样本数据可在路径下获取。
omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/
- 确认用户已获取:
- 初始化TCGA辅助工具
- 导入(如果需要绘图还需导入
omicverse as ov),然后调用scanpy as sc。ov.plot_set() - 实例化对象:。
aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir) - 运行构建包含原始计数、FPKM和TPM层的AnnData对象。
aml_tcga.adata_init()
- 导入
- 持久化数据集
- 建议保存初始AnnData对象:。
aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip') - 重新加载时,使用相同路径重建类并调用。
aml_tcga.adata_read(<path>)
- 建议保存初始AnnData对象:
- 初始化元数据和临床信息
- 使用填充样本元数据,将基因ID转换为基因符号并关联患者信息。
aml_tcga.adata_meta_init() - 通过添加生存属性(注意API中的拼写为survial)。
aml_tcga.survial_init()
- 使用
- 执行生存分析
- 使用绘制基因水平的生存曲线。
aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True) - 若要对所有基因进行分析,调用;需注意此过程可能耗时较长。
aml_tcga.survial_analysis_all()
- 使用
- 导出结果
- 将富含元数据的数据集保存为新的AnnData文件:。
aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip') - 建议用户若需在Python外共享输出结果,可导出汇总表格(如生存统计数据)。
- 将富含元数据的数据集保存为新的AnnData文件:
- 故障排除提示
- 确保TCGA档案已完全解压;缺失XML/TSV文件会触发解析错误。
- 辅助工具要求样本表和表达文件中的病例ID匹配;若不匹配,指导用户重新下载数据。
- 生存曲线绘制需要临床日期数据;若缺失,指导用户检查目录内容。
clinical_cart
Examples
示例
- "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
- "Reload a saved AnnData file, attach survival annotations, and export the updated ."
.h5ad - "Run survival analysis for all genes and store the enriched dataset."
- "读取我的TCGA OV下载数据,初始化元数据,并使用DESeq归一化计数绘制MYC基因的生存曲线。"
- "重新加载已保存的AnnData文件,添加生存注释,并导出更新后的文件。"
.h5ad - "对所有基因运行生存分析并存储富含信息的数据集。"
References
参考资料
- Tutorial notebook:
t_tcga.ipynb - Sample dataset:
data/TCGA_OV/ - Quick copy/paste commands:
reference.md
- 教程笔记本:
t_tcga.ipynb - 样本数据集:
data/TCGA_OV/ - 快速复制命令:
reference.md