clinvar-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ClinVar Database

ClinVar数据库

Overview

概述

ClinVar is NCBI's freely accessible archive of reports on relationships between human genetic variants and phenotypes, with supporting evidence. The database aggregates information about genomic variation and its relationship to human health, providing standardized variant classifications used in clinical genetics and research.
ClinVar是NCBI旗下可免费访问的数据库,收录了人类基因变异与表型之间关联的报告及相关支持证据。该数据库整合了基因组变异及其与人类健康关系的信息,提供临床遗传学和研究中使用的标准化变异分类。

When to Use This Skill

何时使用该技能

This skill should be used when:
  • Searching for variants by gene, condition, or clinical significance
  • Interpreting clinical significance classifications (pathogenic, benign, VUS)
  • Accessing ClinVar data programmatically via E-utilities API
  • Downloading and processing bulk data from FTP
  • Understanding review status and star ratings
  • Resolving conflicting variant interpretations
  • Annotating variant call sets with clinical significance
在以下场景中应使用本技能:
  • 按基因、病症或临床意义搜索变异
  • 解读临床意义分类(致病性、良性、意义未明VUS)
  • 通过E-utilities API以编程方式访问ClinVar数据
  • 从FTP下载并处理批量数据
  • 了解评审状态和星级评分
  • 解决变异解读的冲突
  • 为变异调用集添加临床意义注释

Core Capabilities

核心功能

1. Search and Query ClinVar

1. 搜索和查询ClinVar

Web Interface Queries

网页界面查询

Search ClinVar using the web interface at https://www.ncbi.nlm.nih.gov/clinvar/
Common search patterns:
  • By gene:
    BRCA1[gene]
  • By clinical significance:
    pathogenic[CLNSIG]
  • By condition:
    breast cancer[disorder]
  • By variant:
    NM_000059.3:c.1310_1313del[variant name]
  • By chromosome:
    13[chr]
  • Combined:
    BRCA1[gene] AND pathogenic[CLNSIG]
常见搜索模式:
  • 按基因:
    BRCA1[gene]
  • 按临床意义:
    pathogenic[CLNSIG]
  • 按病症:
    breast cancer[disorder]
  • 按变异:
    NM_000059.3:c.1310_1313del[variant name]
  • 按染色体:
    13[chr]
  • 组合搜索:
    BRCA1[gene] AND pathogenic[CLNSIG]

Programmatic Access via E-utilities

通过E-utilities编程访问

Access ClinVar programmatically using NCBI's E-utilities API. Refer to
references/api_reference.md
for comprehensive API documentation including:
  • esearch - Search for variants matching criteria
  • esummary - Retrieve variant summaries
  • efetch - Download full XML records
  • elink - Find related records in other NCBI databases
Quick example using curl:
bash
undefined
使用NCBI的E-utilities API以编程方式访问ClinVar。参考
references/api_reference.md
获取完整的API文档,包括:
  • esearch - 搜索符合条件的变异
  • esummary - 获取变异摘要
  • efetch - 下载完整XML记录
  • elink - 查找其他NCBI数据库中的相关记录
使用curl的快速示例:
bash
undefined

Search for pathogenic BRCA1 variants

Search for pathogenic BRCA1 variants


**Best practices:**
- Test queries on the web interface before automating
- Use API keys to increase rate limits from 3 to 10 requests/second
- Implement exponential backoff for rate limit errors
- Set `Entrez.email` when using Biopython

**最佳实践:**
- 在自动化前先在网页界面测试查询语句
- 使用API密钥将请求速率限制从3次/秒提升至10次/秒
- 针对速率限制错误实现指数退避机制
- 使用Biopython时设置`Entrez.email`

2. Interpret Clinical Significance

2. 解读临床意义

Understanding Classifications

理解分类术语

ClinVar uses standardized terminology for variant classifications. Refer to
references/clinical_significance.md
for detailed interpretation guidelines.
Key germline classification terms (ACMG/AMP):
  • Pathogenic (P) - Variant causes disease (~99% probability)
  • Likely Pathogenic (LP) - Variant likely causes disease (~90% probability)
  • Uncertain Significance (VUS) - Insufficient evidence to classify
  • Likely Benign (LB) - Variant likely does not cause disease
  • Benign (B) - Variant does not cause disease
Review status (star ratings):
  • ★★★★ Practice guideline - Highest confidence
  • ★★★ Expert panel review (e.g., ClinGen) - High confidence
  • ★★ Multiple submitters, no conflicts - Moderate confidence
  • ★ Single submitter with criteria - Standard weight
  • ☆ No assertion criteria - Low confidence
Critical considerations:
  • Always check review status - prefer ★★★ or ★★★★ ratings
  • Conflicting interpretations require manual evaluation
  • Classifications may change as new evidence emerges
  • VUS (uncertain significance) variants lack sufficient evidence for clinical use
ClinVar使用标准化术语进行变异分类。参考
references/clinical_significance.md
获取详细的解读指南。
关键种系分类术语(ACMG/AMP):
  • 致病性(P) - 变异会引发疾病(约99%的概率)
  • 可能致病性(LP) - 变异很可能引发疾病(约90%的概率)
  • 意义未明(VUS) - 缺乏足够证据进行分类
  • 可能良性(LB) - 变异很可能不会引发疾病
  • 良性(B) - 变异不会引发疾病
评审状态(星级评分):
  • ★★★★ 实践指南 - 最高可信度
  • ★★★ 专家小组评审(如ClinGen) - 高可信度
  • ★★ 多个提交者,无冲突 - 中等可信度
  • ★ 单一提交者且符合标准 - 常规权重
  • ☆ 无断言标准 - 低可信度
关键注意事项:
  • 始终检查评审状态 - 优先选择★★★或★★★★评分的结果
  • 存在冲突的解读需要人工评估
  • 随着新证据出现,分类可能会发生变化
  • VUS(意义未明)变异缺乏足够的临床应用证据

3. Download Bulk Data from FTP

3. 从FTP下载批量数据

Access ClinVar FTP Site

访问ClinVar FTP站点

Download complete datasets from
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/
Refer to
references/data_formats.md
for comprehensive documentation on file formats and processing.
Update schedule:
  • Monthly releases: First Thursday of each month (complete dataset, archived)
  • Weekly updates: Every Monday (incremental updates)
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/
下载完整数据集
参考
references/data_formats.md
获取文件格式和处理的完整文档。
更新时间表:
  • 月度发布:每月第一个周四(完整数据集,存档保存)
  • 每周更新:每周一(增量更新)

Available Formats

可用格式

XML files (most comprehensive):
  • VCV (Variation) files:
    xml/clinvar_variation/
    - Variant-centric aggregation
  • RCV (Record) files:
    xml/RCV/
    - Variant-condition pairs
  • Include full submission details, evidence, and metadata
VCF files (for genomic pipelines):
  • GRCh37:
    vcf_GRCh37/clinvar.vcf.gz
  • GRCh38:
    vcf_GRCh38/clinvar.vcf.gz
  • Limitations: Excludes variants >10kb and complex structural variants
Tab-delimited files (for quick analysis):
  • tab_delimited/variant_summary.txt.gz
    - Summary of all variants
  • tab_delimited/var_citations.txt.gz
    - PubMed citations
  • tab_delimited/cross_references.txt.gz
    - Database cross-references
Example download:
bash
undefined
XML文件(最全面):
  • VCV(变异)文件:
    xml/clinvar_variation/
    - 以变异为中心的聚合数据
  • RCV(记录)文件:
    xml/RCV/
    - 变异-病症配对数据
  • 包含完整的提交详情、证据和元数据
VCF文件(适用于基因组流程):
  • GRCh37:
    vcf_GRCh37/clinvar.vcf.gz
  • GRCh38:
    vcf_GRCh38/clinvar.vcf.gz
  • 局限性:排除长度>10kb的变异和复杂结构变异
制表符分隔文件(适用于快速分析):
  • tab_delimited/variant_summary.txt.gz
    - 所有变异的摘要
  • tab_delimited/var_citations.txt.gz
    - PubMed引用文献
  • tab_delimited/cross_references.txt.gz
    - 数据库交叉引用
下载示例:
bash
undefined

Download latest monthly XML release

Download latest monthly XML release

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz

Download VCF for GRCh38

Download VCF for GRCh38

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
undefined
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
undefined

4. Process and Analyze ClinVar Data

4. 处理和分析ClinVar数据

Working with XML Files

处理XML文件

Process XML files to extract variant details, classifications, and evidence.
Python example with xml.etree:
python
import gzip
import xml.etree.ElementTree as ET

with gzip.open('ClinVarVariationRelease.xml.gz', 'rt') as f:
    for event, elem in ET.iterparse(f, events=('end',)):
        if elem.tag == 'VariationArchive':
            variation_id = elem.attrib.get('VariationID')
            # Extract clinical significance, review status, etc.
            elem.clear()  # Free memory
处理XML文件以提取变异详情、分类和证据。
使用xml.etree的Python示例:
python
import gzip
import xml.etree.ElementTree as ET

with gzip.open('ClinVarVariationRelease.xml.gz', 'rt') as f:
    for event, elem in ET.iterparse(f, events=('end',)):
        if elem.tag == 'VariationArchive':
            variation_id = elem.attrib.get('VariationID')
            # Extract clinical significance, review status, etc.
            elem.clear()  # Free memory

Working with VCF Files

处理VCF文件

Annotate variant calls or filter by clinical significance using bcftools or Python.
Using bcftools:
bash
undefined
使用bcftools或Python注释变异调用或按临床意义过滤。
使用bcftools:
bash
undefined

Filter pathogenic variants

Filter pathogenic variants

bcftools view -i 'INFO/CLNSIG~"Pathogenic"' clinvar.vcf.gz
bcftools view -i 'INFO/CLNSIG~"Pathogenic"' clinvar.vcf.gz

Extract specific genes

Extract specific genes

bcftools view -i 'INFO/GENEINFO~"BRCA"' clinvar.vcf.gz
bcftools view -i 'INFO/GENEINFO~"BRCA"' clinvar.vcf.gz

Annotate your VCF with ClinVar

Annotate your VCF with ClinVar

bcftools annotate -a clinvar.vcf.gz -c INFO your_variants.vcf

**Using PyVCF in Python:**
```python
import vcf

vcf_reader = vcf.Reader(filename='clinvar.vcf.gz')
for record in vcf_reader:
    clnsig = record.INFO.get('CLNSIG', [])
    if 'Pathogenic' in clnsig:
        gene = record.INFO.get('GENEINFO', [''])[0]
        print(f"{record.CHROM}:{record.POS} {gene} - {clnsig}")
bcftools annotate -a clinvar.vcf.gz -c INFO your_variants.vcf

**使用Python的PyVCF:**
```python
import vcf

vcf_reader = vcf.Reader(filename='clinvar.vcf.gz')
for record in vcf_reader:
    clnsig = record.INFO.get('CLNSIG', [])
    if 'Pathogenic' in clnsig:
        gene = record.INFO.get('GENEINFO', [''])[0]
        print(f"{record.CHROM}:{record.POS} {gene} - {clnsig}")

Working with Tab-Delimited Files

处理制表符分隔文件

Use pandas or command-line tools for rapid filtering and analysis.
Using pandas:
python
import pandas as pd
使用pandas或命令行工具进行快速过滤和分析。
使用pandas:
python
import pandas as pd

Load variant summary

Load variant summary

df = pd.read_csv('variant_summary.txt.gz', sep='\t', compression='gzip')
df = pd.read_csv('variant_summary.txt.gz', sep='\t', compression='gzip')

Filter pathogenic variants in specific gene

Filter pathogenic variants in specific gene

pathogenic_brca = df[ (df['GeneSymbol'] == 'BRCA1') & (df['ClinicalSignificance'].str.contains('Pathogenic', na=False)) ]
pathogenic_brca = df[ (df['GeneSymbol'] == 'BRCA1') & (df['ClinicalSignificance'].str.contains('Pathogenic', na=False)) ]

Count variants by clinical significance

Count variants by clinical significance

sig_counts = df['ClinicalSignificance'].value_counts()

**Using command-line tools:**
```bash
sig_counts = df['ClinicalSignificance'].value_counts()

**使用命令行工具:**
```bash

Extract pathogenic variants for specific gene

Extract pathogenic variants for specific gene

zcat variant_summary.txt.gz |
awk -F'\t' '$7=="TP53" && $13~"Pathogenic"' |
cut -f1,5,7,13,14
undefined
zcat variant_summary.txt.gz |
awk -F'\t' '$7=="TP53" && $13~"Pathogenic"' |
cut -f1,5,7,13,14
undefined

5. Handle Conflicting Interpretations

5. 处理冲突的解读结果

When multiple submitters provide different classifications for the same variant, ClinVar reports "Conflicting interpretations of pathogenicity."
Resolution strategy:
  1. Check review status (star rating) - higher ratings carry more weight
  2. Examine evidence and assertion criteria from each submitter
  3. Consider submission dates - newer submissions may reflect updated evidence
  4. Review population frequency data (e.g., gnomAD) for context
  5. Consult expert panel classifications (★★★) when available
  6. For clinical use, always defer to a genetics professional
Search query to exclude conflicts:
TP53[gene] AND pathogenic[CLNSIG] NOT conflicting[RVSTAT]
当多个提交者对同一变异提供不同分类时,ClinVar会标记为“Conflicting interpretations of pathogenicity(致病性解读冲突)”。
解决策略:
  1. 检查评审状态(星级评分) - 评分越高权重越大
  2. 检查每个提交者的证据和断言标准
  3. 考虑提交日期 - 较新的提交可能反映更新的证据
  4. 结合人群频率数据(如gnomAD)进行分析
  5. 如有可用,参考专家小组的分类结果(★★★)
  6. 临床使用时,务必咨询遗传学专业人士
排除冲突结果的搜索查询:
TP53[gene] AND pathogenic[CLNSIG] NOT conflicting[RVSTAT]

6. Track Classification Updates

6. 跟踪分类更新

Variant classifications may change over time as new evidence emerges.
Why classifications change:
  • New functional studies or clinical data
  • Updated population frequency information
  • Revised ACMG/AMP guidelines
  • Segregation data from additional families
Best practices:
  • Document ClinVar version and access date for reproducibility
  • Re-check classifications periodically for critical variants
  • Subscribe to ClinVar mailing list for major updates
  • Use monthly archived releases for stable datasets
随着新证据的出现,变异分类可能会随时间变化。
分类变化的原因:
  • 新的功能研究或临床数据
  • 更新的人群频率信息
  • 修订后的ACMG/AMP指南
  • 更多家庭的分离数据
最佳实践:
  • 记录ClinVar版本和访问日期以保证可复现性
  • 定期重新检查关键变异的分类
  • 订阅ClinVar邮件列表获取重大更新
  • 使用月度存档版本获取稳定数据集

7. Submit Data to ClinVar

7. 向ClinVar提交数据

Organizations can submit variant interpretations to ClinVar.
Submission methods:
Requirements:
  • Organizational account with NCBI
  • Assertion criteria (preferably ACMG/AMP guidelines)
  • Supporting evidence for classification
Contact: clinvar@ncbi.nlm.nih.gov for submission account setup.
机构可向ClinVar提交变异解读结果。
提交方式:
要求:
  • 拥有NCBI机构账户
  • 断言标准(优先使用ACMG/AMP指南)
  • 分类的支持证据
联系方式:clinvar@ncbi.nlm.nih.gov(用于设置提交账户)

Workflow Examples

工作流示例

Example 1: Identify High-Confidence Pathogenic Variants in a Gene

示例1:识别某基因中高可信度的致病性变异

Objective: Find pathogenic variants in CFTR gene with expert panel review.
Steps:
  1. Search using web interface or E-utilities:
    CFTR[gene] AND pathogenic[CLNSIG] AND (reviewed by expert panel[RVSTAT] OR practice guideline[RVSTAT])
  2. Review results, noting review status (should be ★★★ or ★★★★)
  3. Export variant list or retrieve full records via efetch
  4. Cross-reference with clinical presentation if applicable
目标: 查找CFTR基因中经专家小组评审的致病性变异。
步骤:
  1. 使用网页界面或E-utilities进行搜索:
    CFTR[gene] AND pathogenic[CLNSIG] AND (reviewed by expert panel[RVSTAT] OR practice guideline[RVSTAT])
  2. 查看结果,确认评审状态(应为★★★或★★★★)
  3. 导出变异列表或通过efetch获取完整记录
  4. 如有需要,结合临床表型进行交叉参考

Example 2: Annotate VCF with ClinVar Classifications

示例2:用ClinVar分类注释VCF文件

Objective: Add clinical significance annotations to variant calls.
Steps:
  1. Download appropriate ClinVar VCF (match genome build: GRCh37 or GRCh38):
    bash
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
  2. Annotate using bcftools:
    bash
    bcftools annotate -a clinvar.vcf.gz \
      -c INFO/CLNSIG,INFO/CLNDN,INFO/CLNREVSTAT \
      -o annotated_variants.vcf \
      your_variants.vcf
  3. Filter annotated VCF for pathogenic variants:
    bash
    bcftools view -i 'INFO/CLNSIG~"Pathogenic"' annotated_variants.vcf
目标: 为变异调用添加临床意义注释。
步骤:
  1. 下载合适的ClinVar VCF文件(匹配基因组版本:GRCh37或GRCh38):
    bash
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
  2. 使用bcftools进行注释:
    bash
    bcftools annotate -a clinvar.vcf.gz \
      -c INFO/CLNSIG,INFO/CLNDN,INFO/CLNREVSTAT \
      -o annotated_variants.vcf \
      your_variants.vcf
  3. 过滤注释后的VCF文件以获取致病性变异:
    bash
    bcftools view -i 'INFO/CLNSIG~"Pathogenic"' annotated_variants.vcf

Example 3: Analyze Variants for a Specific Disease

示例3:分析特定疾病相关的变异

Objective: Study all variants associated with hereditary breast cancer.
Steps:
  1. Search by condition:
    hereditary breast cancer[disorder] OR "Breast-ovarian cancer, familial"[disorder]
  2. Download results as CSV or retrieve via E-utilities
  3. Filter by review status to prioritize high-confidence variants
  4. Analyze distribution across genes (BRCA1, BRCA2, PALB2, etc.)
  5. Examine variants with conflicting interpretations separately
目标: 研究所有与遗传性乳腺癌相关的变异。
步骤:
  1. 按病症搜索:
    hereditary breast cancer[disorder] OR "Breast-ovarian cancer, familial"[disorder]
  2. 将结果下载为CSV或通过E-utilities获取
  3. 按评审状态过滤,优先选择高可信度变异
  4. 分析变异在各基因(BRCA1、BRCA2、PALB2等)中的分布
  5. 单独分析存在冲突解读的变异

Example 4: Bulk Download and Database Construction

示例4:批量下载并构建本地数据库

Objective: Build a local ClinVar database for analysis pipeline.
Steps:
  1. Download monthly release for reproducibility:
    bash
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_YYYY-MM.xml.gz
  2. Parse XML and load into database (PostgreSQL, MySQL, MongoDB)
  3. Index by gene, position, clinical significance, review status
  4. Implement version tracking for updates
  5. Schedule monthly updates from FTP site
目标: 构建本地ClinVar数据库用于分析流程。
步骤:
  1. 下载月度版本以保证可复现性:
    bash
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_YYYY-MM.xml.gz
  2. 解析XML并加载到数据库(PostgreSQL、MySQL、MongoDB)
  3. 按基因、位置、临床意义、评审状态建立索引
  4. 实现版本跟踪以支持更新
  5. 安排每月从FTP站点获取更新

Important Limitations and Considerations

重要局限性和注意事项

Data Quality

数据质量

  • Not all submissions have equal weight - Check review status (star ratings)
  • Conflicting interpretations exist - Require manual evaluation
  • Historical submissions may be outdated - Newer data may be more accurate
  • VUS classification is not a clinical diagnosis - Means insufficient evidence
  • 并非所有提交内容权重相同 - 检查评审状态(星级评分)
  • 存在冲突的解读 - 需要人工评估
  • 历史提交内容可能过时 - 较新的数据可能更准确
  • VUS分类不是临床诊断 - 仅表示证据不足

Scope Limitations

范围局限性

  • Not for direct clinical diagnosis - Always involve genetics professional
  • Population-specific - Variant frequencies vary by ancestry
  • Incomplete coverage - Not all genes or variants are well-studied
  • Version dependencies - Coordinate genome build (GRCh37/GRCh38) across analyses
  • 不能直接用于临床诊断 - 务必咨询遗传学专业人士
  • 具有人群特异性 - 变异频率因祖先群体而异
  • 覆盖不完整 - 并非所有基因或变异都经过充分研究
  • 版本依赖 - 分析过程中需统一基因组版本(GRCh37/GRCh38)

Technical Limitations

技术局限性

  • VCF files exclude large variants - Variants >10kb not in VCF format
  • Rate limits on API - 3 req/sec without key, 10 req/sec with API key
  • File sizes - Full XML releases are multi-GB compressed files
  • No real-time updates - Website updated weekly, FTP monthly/weekly
  • VCF文件不包含大型变异 - 长度>10kb的变异未纳入VCF格式
  • API存在速率限制 - 无密钥时3次/秒,有API密钥时10次/秒
  • 文件体积大 - 完整XML版本压缩后可达数GB
  • 无实时更新 - 网站每周更新,FTP为月度/每周更新

Resources

资源

Reference Documentation

参考文档

This skill includes comprehensive reference documentation:
  • references/api_reference.md
    - Complete E-utilities API documentation with examples for esearch, esummary, efetch, and elink; includes rate limits, authentication, and Python/Biopython code samples
  • references/clinical_significance.md
    - Detailed guide to interpreting clinical significance classifications, review status star ratings, conflict resolution, and best practices for variant interpretation
  • references/data_formats.md
    - Documentation for XML, VCF, and tab-delimited file formats; FTP directory structure, processing examples, and format selection guidance
本技能包含全面的参考文档:
  • references/api_reference.md
    - 完整的E-utilities API文档,包含esearch、esummary、efetch和elink的示例;包括速率限制、认证以及Python/Biopython代码示例
  • references/clinical_significance.md
    - 解读临床意义分类、评审状态星级评分、冲突解决和变异解读最佳实践的详细指南
  • references/data_formats.md
    - XML、VCF和制表符分隔文件格式的文档;FTP目录结构、处理示例和格式选择指南

External Resources

外部资源

Contact

联系方式

For questions about ClinVar or data submission: clinvar@ncbi.nlm.nih.gov
关于ClinVar或数据提交的问题:clinvar@ncbi.nlm.nih.gov