bio-cosmic

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

COSMIC Toolkit

COSMIC 工具包

Query COSMIC Cancer Gene Census for cancer gene annotation. Check if genes are known cancer genes and retrieve their properties (role, tier, tumor types, etc.).
查询COSMIC癌症基因普查数据库,进行癌症基因注释。检查基因是否为已知癌症基因,并检索其属性(作用、分级、肿瘤类型等)。

Quick Start

快速开始

Install

安装

Install Python dependencies:
bash
uv pip install pandas typer
安装Python依赖:
bash
uv pip install pandas typer

Setup COSMIC Data

配置COSMIC数据

Download Cancer Gene Census from COSMIC and place it in the
data/
directory:
  1. Register at https://cancer.sanger.ac.uk/cosmic/register (free for academic use)
  2. Download
    cancer_gene_census.csv
    from https://cancer.sanger.ac.uk/cosmic/download
  3. Place the file at:
    cosmic-toolkit/data/cancer_gene_census.csv
See
data/README.md
for detailed instructions.
从COSMIC下载癌症基因普查数据,并放置在
data/
目录下:
  1. https://cancer.sanger.ac.uk/cosmic/register注册(学术用途免费)
  2. https://cancer.sanger.ac.uk/cosmic/download下载`cancer_gene_census.csv`文件
  3. 将文件放置在路径:
    cosmic-toolkit/data/cancer_gene_census.csv
详细说明请查看
data/README.md

Basic Usage

基本使用方法

bash
undefined
bash
undefined

Query single gene

查询单个基因

python scripts/query_cosmic_genes.py --gene TP53
python scripts/query_cosmic_genes.py --gene TP53

Query multiple genes

查询多个基因

python scripts/query_cosmic_genes.py --genes TP53 BRCA1 EGFR
python scripts/query_cosmic_genes.py --genes TP53 BRCA1 EGFR

Query from file

从文件查询基因

python scripts/query_cosmic_genes.py --gene-list genes.txt --output results.json
undefined
python scripts/query_cosmic_genes.py --gene-list genes.txt --output results.json
undefined

Scripts

脚本说明

query_cosmic_genes.py - Cancer Gene Census Query

query_cosmic_genes.py - 癌症基因普查查询脚本

Query COSMIC Cancer Gene Census to check if genes are known cancer genes and retrieve their properties.
查询COSMIC癌症基因普查数据库,检查基因是否为已知癌症基因并检索其属性。

Required Arguments

必填参数

One of the following:
  • --gene TEXT
    - Single gene symbol to query
  • --genes TEXT [TEXT ...]
    - Multiple gene symbols (space-separated)
  • --gene-list PATH
    - File containing gene symbols (one per line)
需指定以下参数之一
  • --gene TEXT
    - 要查询的单个基因符号
  • --genes TEXT [TEXT ...]
    - 多个基因符号(以空格分隔)
  • --gene-list PATH
    - 包含基因符号的文件(每行一个基因)

Optional Arguments

可选参数

Data Source:
  • --gene-census PATH
    - Path to cancer_gene_census.csv (default:
    data/cancer_gene_census.csv
    )
Output:
  • --output PATH
    - Output JSON file path (default: stdout)
数据源设置
  • --gene-census PATH
    - 癌症基因普查文件路径(默认路径:
    data/cancer_gene_census.csv
输出设置
  • --output PATH
    - 输出JSON文件路径(默认输出到标准输出)

Output Format (JSON)

输出格式(JSON)

The script outputs all columns from the Cancer Gene Census CSV as JSON. Common fields include:
json
{
  "summary": {
    "total_genes": 3,
    "found_in_cancer_census": 2,
    "not_found": 1
  },
  "genes": {
    "TP53": {
      "found": true,
      "Gene Symbol": "TP53",
      "Name": "tumor protein p53",
      "Entrez GeneId": "7157",
      "Genome Location": "17:7661779-7687538",
      "Tier": "1",
      "Hallmark": "Yes",
      "Chr Band": "17p13.1",
      "Somatic": "yes",
      "Germline": "yes",
      "Tumour Types(Somatic)": "lung NS, breast NS, colorectal NS, ...",
      "Tumour Types(Germline)": "Li-Fraumeni syndrome",
      "Cancer Syndrome": "Li-Fraumeni syndrome",
      "Tissue Type": "E",
      "Molecular Genetics": "Dom",
      "Role in Cancer": "TSG",
      "Mutation Types": "Mis, N, F, D"
    },
    "BRCA1": {
      "found": true,
      "Gene Symbol": "BRCA1",
      "Name": "BRCA1 DNA repair associated",
      "Entrez GeneId": "672",
      "Genome Location": "17:43044295-43125483",
      "Tier": "1",
      "Hallmark": "Yes",
      "Role in Cancer": "TSG",
      "Somatic": "yes",
      "Germline": "yes",
      "Tumour Types(Somatic)": "breast, ovary",
      "Cancer Syndrome": "Breast-ovarian cancer, familial, susceptibility to, 1"
    },
    "UNKNOWN_GENE": {
      "found": false
    }
  }
}
Note: All columns from the Cancer Gene Census CSV are included in the output. The script dynamically adapts to COSMIC format updates.
脚本会将癌症基因普查CSV文件中的所有列以JSON格式输出。常见字段示例:
json
{
  "summary": {
    "total_genes": 3,
    "found_in_cancer_census": 2,
    "not_found": 1
  },
  "genes": {
    "TP53": {
      "found": true,
      "Gene Symbol": "TP53",
      "Name": "tumor protein p53",
      "Entrez GeneId": "7157",
      "Genome Location": "17:7661779-7687538",
      "Tier": "1",
      "Hallmark": "Yes",
      "Chr Band": "17p13.1",
      "Somatic": "yes",
      "Germline": "yes",
      "Tumour Types(Somatic)": "lung NS, breast NS, colorectal NS, ...",
      "Tumour Types(Germline)": "Li-Fraumeni syndrome",
      "Cancer Syndrome": "Li-Fraumeni syndrome",
      "Tissue Type": "E",
      "Molecular Genetics": "Dom",
      "Role in Cancer": "TSG",
      "Mutation Types": "Mis, N, F, D"
    },
    "BRCA1": {
      "found": true,
      "Gene Symbol": "BRCA1",
      "Name": "BRCA1 DNA repair associated",
      "Entrez GeneId": "672",
      "Genome Location": "17:43044295-43125483",
      "Tier": "1",
      "Hallmark": "Yes",
      "Role in Cancer": "TSG",
      "Somatic": "yes",
      "Germline": "yes",
      "Tumour Types(Somatic)": "breast, ovary",
      "Cancer Syndrome": "Breast-ovarian cancer, familial, susceptibility to, 1"
    },
    "UNKNOWN_GENE": {
      "found": false
    }
  }
}
注意:输出包含癌症基因普查CSV文件中的所有列。脚本会自动适配COSMIC的格式更新。

Usage Examples

使用示例

bash
undefined
bash
undefined

Query single gene

查询单个基因

python scripts/query_cosmic_genes.py --gene TP53
python scripts/query_cosmic_genes.py --gene TP53

Query multiple genes

查询多个基因

python scripts/query_cosmic_genes.py --genes TP53 BRCA1 EGFR KRAS
python scripts/query_cosmic_genes.py --genes TP53 BRCA1 EGFR KRAS

Query from gene list file

从基因列表文件查询

python scripts/query_cosmic_genes.py --gene-list candidate_genes.txt
python scripts/query_cosmic_genes.py --gene-list candidate_genes.txt

Save output to file

将结果保存到文件

python scripts/query_cosmic_genes.py
--genes TP53 BRCA1 EGFR
--output cancer_genes.json
python scripts/query_cosmic_genes.py
--genes TP53 BRCA1 EGFR
--output cancer_genes.json

Use custom Cancer Gene Census file

使用自定义的癌症基因普查文件

python scripts/query_cosmic_genes.py
--gene TP53
--gene-census /path/to/cancer_gene_census.csv
undefined
python scripts/query_cosmic_genes.py
--gene TP53
--gene-census /path/to/cancer_gene_census.csv
undefined

Workflow Examples

工作流示例

Example 1: Annotate WGS Candidate Genes

示例1:注释全基因组测序(WGS)候选基因

Filter WGS results to known cancer genes:
bash
undefined
筛选全基因组测序结果中的已知癌症基因:
bash
undefined

Step 1: Extract gene names from VCF (using bcftools or grep)

步骤1:从VCF文件提取基因名称(使用bcftools或grep)

bcftools query -f '%INFO/GENE\n' variants.vcf | sort -u > candidate_genes.txt
bcftools query -f '%INFO/GENE\n' variants.vcf | sort -u > candidate_genes.txt

Step 2: Check which genes are in Cancer Gene Census

步骤2:检查哪些基因存在于癌症基因普查数据库中

python scripts/query_cosmic_genes.py
--gene-list candidate_genes.txt
--output cancer_gene_annotation.json
python scripts/query_cosmic_genes.py
--gene-list candidate_genes.txt
--output cancer_gene_annotation.json

Step 3: Parse results to filter cancer genes only

步骤3:解析结果,仅筛选癌症基因

jq '.genes | to_entries | map(select(.value.found == true)) | from_entries' cancer_gene_annotation.json
undefined
jq '.genes | to_entries | map(select(.value.found == true)) | from_entries' cancer_gene_annotation.json
undefined

Example 2: Identify Tier 1 Cancer Genes

示例2:识别1级癌症基因

Filter results to only Tier 1 cancer genes (highest confidence):
bash
undefined
筛选结果中的1级癌症基因(最高可信度):
bash
undefined

Query genes

查询基因

python scripts/query_cosmic_genes.py
--gene-list genes.txt
--output results.json
python scripts/query_cosmic_genes.py
--gene-list genes.txt
--output results.json

Filter to Tier 1 genes only

仅筛选1级基因

jq '.genes | to_entries | map(select(.value.Tier == "1")) | from_entries' results.json
undefined
jq '.genes | to_entries | map(select(.value.Tier == "1")) | from_entries' results.json
undefined

Example 3: Separate Oncogenes and Tumor Suppressors

示例3:区分癌基因与抑癌基因

Classify cancer genes by their role:
bash
undefined
根据基因在癌症中的作用进行分类:
bash
undefined

Query genes

查询基因

python scripts/query_cosmic_genes.py
--genes TP53 BRCA1 EGFR KRAS MYC
--output cancer_genes.json
python scripts/query_cosmic_genes.py
--genes TP53 BRCA1 EGFR KRAS MYC
--output cancer_genes.json

Extract tumor suppressor genes (TSG)

提取抑癌基因(TSG)

jq '.genes | to_entries | map(select(.value."Role in Cancer" | contains("TSG"))) | from_entries' cancer_genes.json
jq '.genes | to_entries | map(select(.value."Role in Cancer" | contains("TSG"))) | from_entries' cancer_genes.json

Extract oncogenes

提取癌基因

jq '.genes | to_entries | map(select(.value."Role in Cancer" | contains("oncogene"))) | from_entries' cancer_genes.json
undefined
jq '.genes | to_entries | map(select(.value."Role in Cancer" | contains("oncogene"))) | from_entries' cancer_genes.json
undefined

Example 4: Check Germline vs Somatic Cancer Genes

示例4:区分生殖系与体细胞癌症基因

Identify genes involved in germline or somatic cancer:
bash
undefined
识别参与生殖系或体细胞癌症的基因:
bash
undefined

Query genes

查询基因

python scripts/query_cosmic_genes.py
--gene-list genes.txt
--output results.json
python scripts/query_cosmic_genes.py
--gene-list genes.txt
--output results.json

Filter germline cancer genes

筛选生殖系癌症基因

jq '.genes | to_entries | map(select(.value.Germline == "yes")) | from_entries' results.json
jq '.genes | to_entries | map(select(.value.Germline == "yes")) | from_entries' results.json

Filter somatic cancer genes

筛选体细胞癌症基因

jq '.genes | to_entries | map(select(.value.Somatic == "yes")) | from_entries' results.json
undefined
jq '.genes | to_entries | map(select(.value.Somatic == "yes")) | from_entries' results.json
undefined

Cancer Gene Census Fields

癌症基因普查字段说明

Common fields in the output (exact fields depend on COSMIC version):
  • Gene Symbol - Official gene symbol
  • Name - Full gene name
  • Entrez GeneId - NCBI Entrez Gene ID
  • Genome Location - Chromosomal location (GRCh38)
  • Tier - 1 (high confidence) or 2 (lower confidence)
  • Hallmark - Hallmark cancer gene (Yes/No)
  • Chr Band - Cytogenetic band
  • Somatic - Involved in somatic cancer (yes/no)
  • Germline - Involved in germline cancer (yes/no)
  • Tumour Types(Somatic) - Cancer types (somatic)
  • Tumour Types(Germline) - Cancer syndromes (germline)
  • Cancer Syndrome - Associated cancer syndrome
  • Tissue Type - Tissue type (E=epithelial, M=mesenchymal, L=leukemia/lymphoma, etc.)
  • Molecular Genetics - Inheritance pattern (Dom, Rec)
  • Role in Cancer - TSG (tumor suppressor), oncogene, or fusion
  • Mutation Types - Types of mutations (Mis=missense, N=nonsense, F=frameshift, etc.)
输出中的常见字段(具体字段取决于COSMIC版本):
  • Gene Symbol - 官方基因符号
  • Name - 基因全名
  • Entrez GeneId - NCBI Entrez基因ID
  • Genome Location - 染色体位置(GRCh38)
  • Tier - 分级(1为高可信度,2为较低可信度)
  • Hallmark - 是否为标志性癌症基因(是/否)
  • Chr Band - 细胞遗传学带
  • Somatic - 是否参与体细胞癌症(是/否)
  • Germline - 是否参与生殖系癌症(是/否)
  • Tumour Types(Somatic) - 相关体细胞肿瘤类型
  • Tumour Types(Germline) - 相关生殖系癌症综合征
  • Cancer Syndrome - 关联的癌症综合征
  • Tissue Type - 组织类型(E=上皮组织,M=间叶组织,L=白血病/淋巴瘤等)
  • Molecular Genetics - 遗传模式(Dom=显性,Rec=隐性)
  • Role in Cancer - 在癌症中的作用(TSG=抑癌基因,oncogene=癌基因,或融合基因)
  • Mutation Types - 突变类型(Mis=错义突变,N=无义突变,F=移码突变等)

Error Handling

错误处理

Cancer Gene Census File Not Found

未找到癌症基因普查文件

bash
$ python scripts/query_cosmic_genes.py --gene TP53

Error: Cancer Gene Census file not found at: data/cancer_gene_census.csv

To use this tool, please download COSMIC data:

1. Register for free academic access:
   https://cancer.sanger.ac.uk/cosmic/register

2. Download Cancer Gene Census:
   https://cancer.sanger.ac.uk/cosmic/download
   File: cancer_gene_census.csv (GRCh38)

3. Place the file at:
   cosmic-toolkit/data/cancer_gene_census.csv

For more information, see: cosmic-toolkit/data/README.md
Solution: Follow the instructions in
data/README.md
to download and place the Cancer Gene Census file.
bash
$ python scripts/query_cosmic_genes.py --gene TP53

Error: Cancer Gene Census file not found at: data/cancer_gene_census.csv

To use this tool, please download COSMIC data:

1. Register for free academic access:
   https://cancer.sanger.ac.uk/cosmic/register

2. Download Cancer Gene Census:
   https://cancer.sanger.ac.uk/cosmic/download
   File: cancer_gene_census.csv (GRCh38)

3. Place the file at:
   cosmic-toolkit/data/cancer_gene_census.csv

For more information, see: cosmic-toolkit/data/README.md
解决方法:按照
data/README.md
中的说明下载并放置癌症基因普查文件。

No Input Specified

未指定输入参数

bash
$ python scripts/query_cosmic_genes.py

Error: Must specify --gene, --genes, or --gene-list
Solution: Provide at least one gene to query:
bash
python scripts/query_cosmic_genes.py --gene TP53
bash
$ python scripts/query_cosmic_genes.py

Error: Must specify --gene, --genes, or --gene-list
解决方法:至少指定一个要查询的基因:
bash
python scripts/query_cosmic_genes.py --gene TP53

Gene Not Found

基因未找到

Genes not in the Cancer Gene Census will have
"found": false
:
json
{
  "UNKNOWN_GENE": {
    "found": false
  }
}
This is normal and indicates the gene is not in the expert-curated cancer gene list.
未出现在癌症基因普查数据库中的基因,其结果会显示
"found": false
json
{
  "UNKNOWN_GENE": {
    "found": false
  }
}
这属于正常情况,表明该基因不在专家 curated 的癌症基因列表中。

Best Practices

最佳实践

1. Keep Cancer Gene Census Updated

1. 保持癌症基因普查数据更新

COSMIC is updated quarterly. Periodically download the latest version:
bash
undefined
COSMIC每季度更新一次。请定期下载最新版本:
bash
undefined

Download new version and replace existing file

下载新版本并替换现有文件

mv ~/Downloads/cancer_gene_census.csv cosmic-toolkit/data/
undefined
mv ~/Downloads/cancer_gene_census.csv cosmic-toolkit/data/
undefined

2. Use Gene List Files for Batch Queries

2. 批量查询使用基因列表文件

For multiple genes, use a gene list file instead of command-line arguments:
bash
undefined
对于多个基因,建议使用基因列表文件而非命令行参数:
bash
undefined

✅ Good: Use file for many genes

✅ 推荐:使用文件查询大量基因

python scripts/query_cosmic_genes.py --gene-list genes.txt
python scripts/query_cosmic_genes.py --gene-list genes.txt

❌ Bad: Long command line

❌ 不推荐:过长的命令行参数

python scripts/query_cosmic_genes.py --genes GENE1 GENE2 GENE3 ... GENE100
undefined
python scripts/query_cosmic_genes.py --genes GENE1 GENE2 GENE3 ... GENE100
undefined

3. Filter Results with jq

3. 使用jq工具过滤结果

Use
jq
to post-process JSON output:
bash
undefined
使用
jq
工具对JSON输出进行后处理:
bash
undefined

Extract only Tier 1 genes

仅提取1级基因

python scripts/query_cosmic_genes.py --gene-list genes.txt |
jq '.genes | to_entries | map(select(.value.Tier == "1"))'
python scripts/query_cosmic_genes.py --gene-list genes.txt |
jq '.genes | to_entries | map(select(.value.Tier == "1"))'

Count tumor suppressor genes

统计抑癌基因数量

python scripts/query_cosmic_genes.py --gene-list genes.txt |
jq '[.genes[] | select(."Role in Cancer" | contains("TSG"))] | length'
undefined
python scripts/query_cosmic_genes.py --gene-list genes.txt |
jq '[.genes[] | select(."Role in Cancer" | contains("TSG"))] | length'
undefined

4. Combine with Other Tools

4. 与其他工具结合使用

Integrate with WGS analysis workflow:
bash
undefined
集成到全基因组测序分析工作流中:
bash
undefined

Extract genes from VCF

从VCF文件提取基因

bcftools query -f '%INFO/GENE\n' variants.vcf | sort -u > genes.txt
bcftools query -f '%INFO/GENE\n' variants.vcf | sort -u > genes.txt

Annotate with COSMIC

使用COSMIC进行注释

python scripts/query_cosmic_genes.py --gene-list genes.txt --output cosmic_annotation.json
python scripts/query_cosmic_genes.py --gene-list genes.txt --output cosmic_annotation.json

Filter VCF to cancer genes only (using cancer gene list)

仅筛选癌症基因对应的VCF变异(使用癌症基因列表)

jq -r '.genes | to_entries | map(select(.value.found == true)) | .[].key' cosmic_annotation.json > cancer_genes.txt bcftools view -i "GENE=@cancer_genes.txt" variants.vcf > cancer_variants.vcf
undefined
jq -r '.genes | to_entries | map(select(.value.found == true)) | .[].key' cosmic_annotation.json > cancer_genes.txt bcftools view -i "GENE=@cancer_genes.txt" variants.vcf > cancer_variants.vcf
undefined

Integration with WGS Pipeline

与全基因组测序(WGS)流程集成

Typical WGS Workflow

典型WGS工作流

  1. Variant Calling → VCF file
  2. Gene Extraction → Gene list
  3. COSMIC Annotation → Identify cancer genes
  4. Filtering → Focus on cancer-relevant variants
  1. 变异检测 → 生成VCF文件
  2. 基因提取 → 生成基因列表
  3. COSMIC注释 → 识别癌症基因
  4. 筛选 → 聚焦与癌症相关的变异

Example Pipeline

示例流程

bash
undefined
bash
undefined

1. Extract genes from VCF

1. 从VCF文件提取基因

bcftools query -f '%INFO/GENE\n' variants.vcf | sort -u > all_genes.txt
bcftools query -f '%INFO/GENE\n' variants.vcf | sort -u > all_genes.txt

2. Query COSMIC

2. 查询COSMIC数据库

python scripts/query_cosmic_genes.py
--gene-list all_genes.txt
--output cosmic_results.json
python scripts/query_cosmic_genes.py
--gene-list all_genes.txt
--output cosmic_results.json

3. Extract cancer gene names

3. 提取1级癌症基因名称

jq -r '.genes | to_entries | map(select(.value.found == true and .value.Tier == "1")) | .[].key'
cosmic_results.json > tier1_cancer_genes.txt
jq -r '.genes | to_entries | map(select(.value.found == true and .value.Tier == "1")) | .[].key'
cosmic_results.json > tier1_cancer_genes.txt

4. Filter VCF to Tier 1 cancer genes

4. 筛选VCF文件中的1级癌症基因变异

grep -f tier1_cancer_genes.txt all_genes.txt |
bcftools view -i "GENE=@-" variants.vcf > cancer_variants.vcf
undefined
grep -f tier1_cancer_genes.txt all_genes.txt |
bcftools view -i "GENE=@-" variants.vcf > cancer_variants.vcf
undefined

Related Skills

相关Skill

  • vcf-toolkit - VCF variant analysis and filtering
  • bam-toolkit - BAM alignment file operations
  • sequence-io - FASTA/GenBank sequence operations
  • vcf-toolkit - VCF变异分析与筛选工具
  • bam-toolkit - BAM比对文件操作工具
  • sequence-io - FASTA/GenBank序列操作工具

Troubleshooting

故障排除

CSV Format Changes

CSV格式变更

The script dynamically reads all columns, so it should adapt to COSMIC format updates. If issues occur:
  1. Check the CSV file has a "Gene Symbol" column
  2. Verify the file is properly formatted (no corruption)
  3. Try re-downloading the file
脚本会动态读取所有列,因此可适配COSMIC的格式更新。若出现问题:
  1. 检查CSV文件是否包含"Gene Symbol"列
  2. 验证文件格式是否正确(无损坏)
  3. 尝试重新下载文件

Memory Issues with Large Gene Lists

大型基因列表的内存问题

For very large gene lists (>10,000 genes), consider splitting:
bash
undefined
对于超大型基因列表(>10000个基因),建议拆分处理:
bash
undefined

Split gene list

拆分基因列表

split -l 1000 large_gene_list.txt genes_part_
split -l 1000 large_gene_list.txt genes_part_

Process each part

处理每个拆分后的文件

for file in genes_part_*; do python scripts/query_cosmic_genes.py --gene-list $file --output ${file}.json done
for file in genes_part_*; do python scripts/query_cosmic_genes.py --gene-list $file --output ${file}.json done

Merge results

合并结果

jq -s 'reduce .[] as $item ({}; . * $item)' genes_part_*.json > merged_results.json
undefined
jq -s 'reduce .[] as $item ({}; . * $item)' genes_part_*.json > merged_results.json
undefined

Citation

引用说明

When using COSMIC data, please cite:
Tate JG, Bamford S, Jubb HC, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2019;47(D1):D941-D947.
使用COSMIC数据时,请引用:
Tate JG, Bamford S, Jubb HC, et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research. 2019;47(D1):D941-D947.

Additional Resources

额外资源