cosmic-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCOSMIC Database
COSMIC数据库
Overview
概述
COSMIC (Catalogue of Somatic Mutations in Cancer) is the world's largest and most comprehensive database for exploring somatic mutations in human cancer. Access COSMIC's extensive collection of cancer genomics data, including millions of mutations across thousands of cancer types, curated gene lists, mutational signatures, and clinical annotations programmatically.
COSMIC(Catalogue of Somatic Mutations in Cancer)是全球规模最大、最全面的人类癌症体细胞突变探索数据库。你可以通过编程方式访问COSMIC丰富的癌症基因组数据,涵盖数千种癌症类型的数百万个突变、经过专家整理的基因列表、突变特征以及临床注释信息。
When to Use This Skill
何时使用该技能
This skill should be used when:
- Downloading cancer mutation data from COSMIC
- Accessing the Cancer Gene Census for curated cancer gene lists
- Retrieving mutational signature profiles
- Querying structural variants, copy number alterations, or gene fusions
- Analyzing drug resistance mutations
- Working with cancer cell line genomics data
- Integrating cancer mutation data into bioinformatics pipelines
- Researching specific genes or mutations in cancer contexts
当你有以下需求时,可使用本技能:
- 从COSMIC下载癌症突变数据
- 访问癌症基因普查(Cancer Gene Census)获取经过专家整理的癌症基因列表
- 检索突变特征图谱
- 查询结构变异、拷贝数改变或基因融合数据
- 分析耐药突变
- 处理癌细胞系基因组数据
- 将癌症突变数据整合到生物信息学流程中
- 在癌症研究场景下探究特定基因或突变
Prerequisites
前提条件
Account Registration
账户注册
COSMIC requires authentication for data downloads:
- Academic users: Free access with registration at https://cancer.sanger.ac.uk/cosmic/register
- Commercial users: License required (contact QIAGEN)
COSMIC的数据下载需要身份验证:
- 学术用户:注册后可免费访问,注册地址:https://cancer.sanger.ac.uk/cosmic/register
- 商业用户:需要获取授权许可(联系QIAGEN)
Python Requirements
Python环境要求
bash
uv pip install requests pandasbash
uv pip install requests pandasQuick Start
快速开始
1. Basic File Download
1. 基础文件下载
Use the script to download COSMIC data files:
scripts/download_cosmic.pypython
from scripts.download_cosmic import download_cosmic_file使用脚本下载COSMIC数据文件:
scripts/download_cosmic.pypython
from scripts.download_cosmic import download_cosmic_fileDownload mutation data
下载突变数据
download_cosmic_file(
email="your_email@institution.edu",
password="your_password",
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz",
output_filename="cosmic_mutations.tsv.gz"
)
undefineddownload_cosmic_file(
email="your_email@institution.edu",
password="your_password",
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz",
output_filename="cosmic_mutations.tsv.gz"
)
undefined2. Command-Line Usage
2. 命令行使用方式
bash
undefinedbash
undefinedDownload using shorthand data type
使用简写数据类型下载
python scripts/download_cosmic.py user@email.com --data-type mutations
python scripts/download_cosmic.py user@email.com --data-type mutations
Download specific file
下载特定文件
python scripts/download_cosmic.py user@email.com
--filepath GRCh38/cosmic/latest/cancer_gene_census.csv
--filepath GRCh38/cosmic/latest/cancer_gene_census.csv
python scripts/download_cosmic.py user@email.com
--filepath GRCh38/cosmic/latest/cancer_gene_census.csv
--filepath GRCh38/cosmic/latest/cancer_gene_census.csv
Download for specific genome assembly
针对特定基因组组装版本下载
python scripts/download_cosmic.py user@email.com
--data-type gene_census --assembly GRCh37 -o cancer_genes.csv
--data-type gene_census --assembly GRCh37 -o cancer_genes.csv
undefinedpython scripts/download_cosmic.py user@email.com
--data-type gene_census --assembly GRCh37 -o cancer_genes.csv
--data-type gene_census --assembly GRCh37 -o cancer_genes.csv
undefined3. Working with Downloaded Data
3. 处理已下载的数据
python
import pandas as pdpython
import pandas as pdRead mutation data
读取突变数据
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
Read Cancer Gene Census
读取癌症基因普查数据
gene_census = pd.read_csv('cancer_gene_census.csv')
gene_census = pd.read_csv('cancer_gene_census.csv')
Read VCF format
读取VCF格式数据
import pysam
vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
undefinedimport pysam
vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
undefinedAvailable Data Types
可用数据类型
Core Mutations
核心突变数据
Download comprehensive mutation data including point mutations, indels, and genomic annotations.
Common data types:
- - Complete coding mutations (TSV format)
mutations - - Coding mutations in VCF format
mutations_vcf - - Sample metadata and tumor information
sample_info
python
undefined下载包含点突变、插入缺失以及基因组注释信息的全面突变数据。
常见数据类型:
- - 完整编码区突变(TSV格式)
mutations - - VCF格式的编码区突变
mutations_vcf - - 样本元数据和肿瘤信息
sample_info
python
undefinedDownload all coding mutations
下载所有编码区突变数据
download_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz"
)
undefineddownload_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz"
)
undefinedCancer Gene Census
癌症基因普查
Access the expert-curated list of ~700+ cancer genes with substantial evidence of cancer involvement.
python
undefined访问由专家整理的约700+个癌症基因列表,这些基因均有充分证据表明与癌症相关。
python
undefinedDownload Cancer Gene Census
下载癌症基因普查数据
download_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/cancer_gene_census.csv"
)
**Use cases**:
- Identifying known cancer genes
- Filtering variants by cancer relevance
- Understanding gene roles (oncogene vs tumor suppressor)
- Target gene selection for researchdownload_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/cancer_gene_census.csv"
)
**使用场景**:
- 识别已知癌症基因
- 根据癌症相关性过滤变异位点
- 了解基因作用(癌基因vs抑癌基因)
- 为研究选择目标基因Mutational Signatures
突变特征
Download signature profiles for mutational signature analysis.
python
undefined下载突变特征图谱用于突变特征分析。
python
undefinedDownload signature definitions
下载突变特征定义文件
download_cosmic_file(
email="user@email.com",
password="password",
filepath="signatures/signatures.tsv"
)
**Signature types**:
- Single Base Substitution (SBS) signatures
- Doublet Base Substitution (DBS) signatures
- Insertion/Deletion (ID) signaturesdownload_cosmic_file(
email="user@email.com",
password="password",
filepath="signatures/signatures.tsv"
)
**特征类型**:
- 单碱基替换(SBS)特征
- 双碱基替换(DBS)特征
- 插入/缺失(ID)特征Structural Variants and Fusions
结构变异与融合基因
Access gene fusion data and structural rearrangements.
Available data types:
- - Structural breakpoints
structural_variants - - Gene fusion events
fusion_genes
python
undefined访问基因融合数据和结构重排信息。
可用数据类型:
- - 结构断点数据
structural_variants - - 基因融合事件数据
fusion_genes
python
undefinedDownload gene fusions
下载基因融合数据
download_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicFusionExport.tsv.gz"
)
undefineddownload_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicFusionExport.tsv.gz"
)
undefinedCopy Number and Expression
拷贝数与表达数据
Retrieve copy number alterations and gene expression data.
Available data types:
- - Copy number gains/losses
copy_number - - Over/under-expression data
gene_expression
python
undefined获取拷贝数改变和基因表达数据。
可用数据类型:
- - 拷贝数增益/缺失数据
copy_number - - 基因过表达/低表达数据
gene_expression
python
undefinedDownload copy number data
下载拷贝数数据
download_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicCompleteCNA.tsv.gz"
)
undefineddownload_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicCompleteCNA.tsv.gz"
)
undefinedResistance Mutations
耐药突变
Access drug resistance mutation data with clinical annotations.
python
undefined访问带有临床注释的耐药突变数据。
python
undefinedDownload resistance mutations
下载耐药突变数据
download_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicResistanceMutations.tsv.gz"
)
undefineddownload_cosmic_file(
email="user@email.com",
password="password",
filepath="GRCh38/cosmic/latest/CosmicResistanceMutations.tsv.gz"
)
undefinedWorking with COSMIC Data
处理COSMIC数据
Genome Assemblies
基因组组装版本
COSMIC provides data for two reference genomes:
- GRCh38 (recommended, current standard)
- GRCh37 (legacy, for older pipelines)
Specify the assembly in file paths:
python
undefinedCOSMIC提供两种参考基因组的数据:
- GRCh38(推荐使用,当前标准版本)
- GRCh37(旧版本,适用于传统分析流程)
在文件路径中指定组装版本:
python
undefinedGRCh38 (recommended)
GRCh38(推荐)
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz"
filepath="GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz"
GRCh37 (legacy)
GRCh37(旧版本)
filepath="GRCh37/cosmic/latest/CosmicMutantExport.tsv.gz"
undefinedfilepath="GRCh37/cosmic/latest/CosmicMutantExport.tsv.gz"
undefinedVersioning
版本控制
- Use in file paths to always get the most recent release
latest - COSMIC is updated quarterly (current version: v102, May 2025)
- Specific versions can be used for reproducibility: ,
v102, etc.v101
- 在文件路径中使用可始终获取最新版本的数据
latest - COSMIC每季度更新一次(当前版本:v102,2025年5月)
- 可指定具体版本以保证分析可复现:、
v102等v101
File Formats
文件格式
- TSV/CSV: Tab/comma-separated, gzip compressed, read with pandas
- VCF: Standard variant format, use with pysam, bcftools, or GATK
- All files include headers describing column contents
- TSV/CSV:制表符/逗号分隔的压缩文件,可使用pandas读取
- VCF:标准变异格式,可与pysam、bcftools或GATK配合使用
- 所有文件均包含表头,用于说明各列内容
Common Analysis Patterns
常见分析模式
Filter mutations by gene:
python
import pandas as pd
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
tp53_mutations = mutations[mutations['Gene name'] == 'TP53']Identify cancer genes by role:
python
gene_census = pd.read_csv('cancer_gene_census.csv')
oncogenes = gene_census[gene_census['Role in Cancer'].str.contains('oncogene', na=False)]
tumor_suppressors = gene_census[gene_census['Role in Cancer'].str.contains('TSG', na=False)]Extract mutations by cancer type:
python
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
lung_mutations = mutations[mutations['Primary site'] == 'lung']Work with VCF files:
python
import pysam
vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
for record in vcf.fetch('17', 7577000, 7579000): # TP53 region
print(record.id, record.ref, record.alts, record.info)按基因筛选突变:
python
import pandas as pd
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
tp53_mutations = mutations[mutations['Gene name'] == 'TP53']按基因作用类型识别癌症基因:
python
gene_census = pd.read_csv('cancer_gene_census.csv')
oncogenes = gene_census[gene_census['Role in Cancer'].str.contains('oncogene', na=False)]
tumor_suppressors = gene_census[gene_census['Role in Cancer'].str.contains('TSG', na=False)]按癌症类型提取突变:
python
mutations = pd.read_csv('cosmic_mutations.tsv.gz', sep='\t', compression='gzip')
lung_mutations = mutations[mutations['Primary site'] == 'lung']处理VCF文件:
python
import pysam
vcf = pysam.VariantFile('CosmicCodingMuts.vcf.gz')
for record in vcf.fetch('17', 7577000, 7579000): # TP53区域
print(record.id, record.ref, record.alts, record.info)Data Reference
数据参考文档
For comprehensive information about COSMIC data structure, available files, and field descriptions, see . This reference includes:
references/cosmic_data_reference.md- Complete list of available data types and files
- Detailed field descriptions for each file type
- File format specifications
- Common file paths and naming conventions
- Data update schedule and versioning
- Citation information
Use this reference when:
- Exploring what data is available in COSMIC
- Understanding specific field meanings
- Determining the correct file path for a data type
- Planning analysis workflows with COSMIC data
如需了解COSMIC数据结构、可用文件及字段说明的全面信息,请查看。该参考文档包含:
references/cosmic_data_reference.md- 所有可用数据类型和文件的完整列表
- 每种文件类型的详细字段说明
- 文件格式规范
- 常见文件路径和命名规则
- 数据更新计划和版本控制说明
- 引用信息
在以下场景中可使用该参考文档:
- 探索COSMIC中的可用数据
- 理解特定字段的含义
- 确定对应数据类型的正确文件路径
- 规划基于COSMIC数据的分析流程
Helper Functions
辅助函数
The download script includes helper functions for common operations:
下载脚本包含用于常见操作的辅助函数:
Get Common File Paths
获取常用文件路径
python
from scripts.download_cosmic import get_common_file_pathpython
from scripts.download_cosmic import get_common_file_pathGet path for mutations file
获取突变数据文件路径
path = get_common_file_path('mutations', genome_assembly='GRCh38')
path = get_common_file_path('mutations', genome_assembly='GRCh38')
Returns: 'GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz'
返回值: 'GRCh38/cosmic/latest/CosmicMutantExport.tsv.gz'
Get path for gene census
获取癌症基因普查数据文件路径
path = get_common_file_path('gene_census')
path = get_common_file_path('gene_census')
Returns: 'GRCh38/cosmic/latest/cancer_gene_census.csv'
返回值: 'GRCh38/cosmic/latest/cancer_gene_census.csv'
**Available shortcuts**:
- `mutations` - Core coding mutations
- `mutations_vcf` - VCF format mutations
- `gene_census` - Cancer Gene Census
- `resistance_mutations` - Drug resistance data
- `structural_variants` - Structural variants
- `gene_expression` - Expression data
- `copy_number` - Copy number alterations
- `fusion_genes` - Gene fusions
- `signatures` - Mutational signatures
- `sample_info` - Sample metadata
**可用快捷方式**:
- `mutations` - 核心编码区突变
- `mutations_vcf` - VCF格式突变数据
- `gene_census` - 癌症基因普查
- `resistance_mutations` - 耐药突变数据
- `structural_variants` - 结构变异
- `gene_expression` - 表达数据
- `copy_number` - 拷贝数改变
- `fusion_genes` - 融合基因
- `signatures` - 突变特征
- `sample_info` - 样本元数据Troubleshooting
故障排除
Authentication Errors
身份验证错误
- Verify email and password are correct
- Ensure account is registered at cancer.sanger.ac.uk/cosmic
- Check if commercial license is required for your use case
- 确认邮箱和密码正确
- 确保账户已在cancer.sanger.ac.uk/cosmic完成注册
- 检查你的使用场景是否需要商业授权许可
File Not Found
文件未找到
- Verify the filepath is correct
- Check that the requested version exists
- Use for the most recent version
latest - Confirm genome assembly (GRCh37 vs GRCh38) is correct
- 确认文件路径正确
- 检查请求的版本是否存在
- 使用获取最新版本数据
latest - 确认基因组组装版本(GRCh37 vs GRCh38)正确
Large File Downloads
大文件下载
- COSMIC files can be several GB in size
- Ensure sufficient disk space
- Download may take several minutes depending on connection
- The script shows download progress for large files
- COSMIC文件大小可达数GB
- 确保有足够的磁盘空间
- 下载时间可能因网络情况需要数分钟
- 脚本会显示大文件的下载进度
Commercial Use
商业使用
- Commercial users must license COSMIC through QIAGEN
- Contact: cosmic-translation@sanger.ac.uk
- Academic access is free but requires registration
- 商业用户必须通过QIAGEN获取COSMIC授权许可
- 联系方式:cosmic-translation@sanger.ac.uk
- 学术用户可免费访问,但需要注册
Integration with Other Tools
与其他工具的集成
COSMIC data integrates well with:
- Variant annotation: VEP, ANNOVAR, SnpEff
- Signature analysis: SigProfiler, deconstructSigs, MuSiCa
- Cancer genomics: cBioPortal, OncoKB, CIViC
- Bioinformatics: Bioconductor, TCGA analysis tools
- Data science: pandas, scikit-learn, PyTorch
COSMIC数据可与以下工具良好集成:
- 变异注释:VEP、ANNOVAR、SnpEff
- 特征分析:SigProfiler、deconstructSigs、MuSiCa
- 癌症基因组学:cBioPortal、OncoKB、CIViC
- 生物信息学:Bioconductor、TCGA分析工具
- 数据科学:pandas、scikit-learn、PyTorch
Additional Resources
其他资源
- COSMIC Website: https://cancer.sanger.ac.uk/cosmic
- Documentation: https://cancer.sanger.ac.uk/cosmic/help
- Release Notes: https://cancer.sanger.ac.uk/cosmic/release_notes
- Contact: cosmic@sanger.ac.uk
- COSMIC官网:https://cancer.sanger.ac.uk/cosmic
- 官方文档:https://cancer.sanger.ac.uk/cosmic/help
- 版本发布说明:https://cancer.sanger.ac.uk/cosmic/release_notes
- 联系方式:cosmic@sanger.ac.uk
Citation
引用说明
When using COSMIC data, cite:
Tate JG, Bamford S, Jubb HC, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2019;47(D1):D941-D947.
使用COSMIC数据时,请引用:
Tate JG, Bamford S, Jubb HC, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2019;47(D1):D941-D947.