tcga-bulk-data-preprocessing-with-omicverse

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TCGA bulk data preprocessing with omicverse

使用omicverse进行TCGA批量数据预处理

Overview

概述

Follow this skill to recreate the preprocessing routine from
t_tcga.ipynb
. It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through
ov.bulk.pyTCGA
.
按照本指南复现
t_tcga.ipynb
中的预处理流程。它通过
ov.bulk.pyTCGA
自动化加载TCGA下载数据、生成原始/归一化矩阵、初始化元数据并运行生存分析。

Instructions

操作步骤

  1. Gather required downloads
    • Confirm the user has:
      • gdc_sample_sheet.<date>.tsv
        from the TCGA Sample Sheet export.
      • The decompressed
        gdc_download_xxxxx
        directory containing expression archives.
      • The
        clinical.cart.<date>
        directory with clinical XML/JSON files.
    • Mention that sample data are available under
      omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/
      .
  2. Initialise the TCGA helper
    • Import
      omicverse as ov
      (and
      scanpy as sc
      if plotting) then call
      ov.plot_set()
      .
    • Instantiate
      aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)
      .
    • Run
      aml_tcga.adata_init()
      to build the AnnData object with raw counts, FPKM, and TPM layers.
  3. Persist the dataset
    • Encourage saving the initial AnnData:
      aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip')
      .
    • When reloading, reconstruct the class with the same paths and call
      aml_tcga.adata_read(<path>)
      .
  4. Initialise metadata and clinical information
    • Populate sample metadata using
      aml_tcga.adata_meta_init()
      to convert gene IDs to symbols and attach patient info.
    • Add survival attributes via
      aml_tcga.survial_init()
      (note the intentional spelling in the API).
  5. Perform survival analyses
    • Plot gene-level survival curves with
      aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True)
      .
    • To process all genes, call
      aml_tcga.survial_analysis_all()
      ; warn that it may take time.
  6. Export results
    • Save enriched metadata to a new AnnData file (
      aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')
      ).
    • Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python.
  7. Troubleshooting tips
    • Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors.
    • The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not align.
    • Survival plots require clinical dates; if absent, instruct users to check the
      clinical_cart
      contents.
  1. 准备所需的下载数据
    • 确认用户已获取:
      • 从TCGA样本表导出的
        gdc_sample_sheet.<date>.tsv
        文件。
      • 解压后的
        gdc_download_xxxxx
        目录,其中包含表达档案。
      • 包含临床XML/JSON文件的
        clinical.cart.<date>
        目录。
    • 提示样本数据可在
      omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/
      路径下获取。
  2. 初始化TCGA辅助工具
    • 导入
      omicverse as ov
      (如果需要绘图还需导入
      scanpy as sc
      ),然后调用
      ov.plot_set()
    • 实例化对象:
      aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)
    • 运行
      aml_tcga.adata_init()
      构建包含原始计数、FPKM和TPM层的AnnData对象。
  3. 持久化数据集
    • 建议保存初始AnnData对象:
      aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip')
    • 重新加载时,使用相同路径重建类并调用
      aml_tcga.adata_read(<path>)
  4. 初始化元数据和临床信息
    • 使用
      aml_tcga.adata_meta_init()
      填充样本元数据,将基因ID转换为基因符号并关联患者信息。
    • 通过
      aml_tcga.survial_init()
      添加生存属性(注意API中的拼写为survial)。
  5. 执行生存分析
    • 使用
      aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True)
      绘制基因水平的生存曲线。
    • 若要对所有基因进行分析,调用
      aml_tcga.survial_analysis_all()
      ;需注意此过程可能耗时较长。
  6. 导出结果
    • 将富含元数据的数据集保存为新的AnnData文件:
      aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')
    • 建议用户若需在Python外共享输出结果,可导出汇总表格(如生存统计数据)。
  7. 故障排除提示
    • 确保TCGA档案已完全解压;缺失XML/TSV文件会触发解析错误。
    • 辅助工具要求样本表和表达文件中的病例ID匹配;若不匹配,指导用户重新下载数据。
    • 生存曲线绘制需要临床日期数据;若缺失,指导用户检查
      clinical_cart
      目录内容。

Examples

示例

  • "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
  • "Reload a saved AnnData file, attach survival annotations, and export the updated
    .h5ad
    ."
  • "Run survival analysis for all genes and store the enriched dataset."
  • "读取我的TCGA OV下载数据,初始化元数据,并使用DESeq归一化计数绘制MYC基因的生存曲线。"
  • "重新加载已保存的AnnData文件,添加生存注释,并导出更新后的
    .h5ad
    文件。"
  • "对所有基因运行生存分析并存储富含信息的数据集。"

References

参考资料

  • Tutorial notebook:
    t_tcga.ipynb
  • Sample dataset:
    data/TCGA_OV/
  • Quick copy/paste commands:
    reference.md
  • 教程笔记本:
    t_tcga.ipynb
  • 样本数据集:
    data/TCGA_OV/
  • 快速复制命令:
    reference.md