tooluniverse-variant-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Variant Analysis and Annotation

变异分析与注释

Production-ready VCF processing and variant annotation skill combining local bioinformatics computation with ToolUniverse database integration. Designed to answer bioinformatics analysis questions about VCF data, mutation classification, variant filtering, and clinical annotation.
一款结合本地生物信息学计算与ToolUniverse数据库集成的可用于生产环境的VCF处理及变异注释工具,专为解答关于VCF数据、突变分类、变异过滤和临床注释的生物信息学分析问题设计。

When to Use This Skill

何时使用本工具

Triggers:
  • User provides a VCF file (SNV/indel or SV) and asks questions about its contents
  • Questions about variant allele frequency (VAF) filtering
  • Mutation type classification queries (missense, nonsense, synonymous, etc.)
  • Structural variant interpretation requests (deletions, duplications, CNVs)
  • Variant annotation requests (ClinVar, gnomAD, CADD, dbSNP)
  • CNV pathogenicity assessment using ClinGen dosage sensitivity
  • Cohort comparison questions
  • Population frequency filtering (SNVs or SVs)
  • Intronic/intergenic variant filtering
  • Gene dosage sensitivity queries
Example Questions:
  • "What fraction of variants with VAF < 0.3 are annotated as missense mutations?"
  • "After filtering intronic/intergenic variants, how many non-reference variants remain?"
  • "What is the clinical significance of this deletion affecting BRCA1?"
  • "Which dosage-sensitive genes overlap this 500kb duplication on chr17?"
  • "How many variants have clinical significance annotations?"
  • "Compare variant counts between samples"

触发场景:
  • 用户提供VCF文件(SNV/indel或SV)并询问其内容相关问题
  • 关于变异等位基因频率(VAF)过滤的问题
  • 突变类型分类查询(错义、无义、同义等)
  • 结构变异解读需求(缺失、重复、CNV)
  • 变异注释需求(ClinVar、gnomAD、CADD、dbSNP)
  • 利用ClinGen剂量敏感性进行CNV致病性评估
  • 队列对比问题
  • 群体频率过滤(SNV或SV)
  • 内含子区/基因间区变异过滤
  • 基因剂量敏感性查询
示例问题:
  • "VAF < 0.3的变异中,被注释为错义突变的占比是多少?"
  • "过滤掉内含子区/基因间区变异后,剩余多少非参考变异?"
  • "影响BRCA1的该缺失突变的临床意义是什么?"
  • "chr17上这个500kb的重复区域与哪些剂量敏感性基因重叠?"
  • "有多少变异带有临床意义注释?"
  • "对比样本间的变异数量"

Core Capabilities

核心能力

CapabilityDescription
VCF ParsingPure Python + cyvcf2 parsers. VCF 4.x, gzipped, multi-sample, SNV/indel/SV
Mutation ClassificationMaps SO terms, SnpEff ANN, VEP CSQ, GATK Funcotator to standard types
VAF ExtractionHandles AF, AD, AO/RO, NR/NV, INFO AF formats
FilteringVAF, depth, quality, PASS, variant type, mutation type, consequence, chromosome, SV size
StatisticsTi/Tv ratio, per-sample VAF/depth stats, mutation type distribution, SV size distribution
AnnotationMyVariant.info (aggregates ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen)
SV/CNV AnalysisgnomAD SV population frequencies, DGVa/dbVar known SVs, ClinGen dosage sensitivity
Clinical InterpretationACMG/ClinGen CNV pathogenicity classification using haploinsufficiency/triplosensitivity scores
DataFrameConvert to pandas for advanced analytics
ReportingMarkdown reports with tables and statistics, SV clinical reports

能力描述
VCF解析纯Python + cyvcf2解析器,支持VCF 4.x、压缩文件、多样本、SNV/indel/SV
突变分类将SO术语、SnpEff ANN、VEP CSQ、GATK Funcotator映射为标准类型
VAF提取支持AF、AD、AO/RO、NR/NV、INFO AF等格式
过滤功能支持VAF、深度、质量、PASS标记、变异类型、突变类型、影响类型、染色体、SV大小过滤
统计分析Ti/Tv比率、单样本VAF/深度统计、突变类型分布、SV大小分布
注释功能对接MyVariant.info(聚合ClinVar、dbSNP、gnomAD、CADD、SIFT、PolyPhen)
SV/CNV分析gnomAD SV群体频率、DGVa/dbVar已知SV、ClinGen剂量敏感性
临床解读利用单倍剂量不足/三倍体敏感性评分,按照ACMG/ClinGen指南进行CNV致病性分类
DataFrame转换转换为pandas格式以进行高级分析
报告生成带表格和统计数据的Markdown报告、SV临床报告

Workflow Overview

工作流概述

Input VCF File (SNVs/indels or SVs)
    |
    v
Phase 1: Parse VCF
    |-- Pure Python parser (any VCF 4.x)
    |-- cyvcf2 parser (faster, C-based)
    |-- Extract: CHROM, POS, REF, ALT, QUAL, FILTER, INFO, FORMAT, samples
    |-- Extract per-sample: GT, VAF, depth
    |-- Extract annotations from INFO (ANN, CSQ, FUNCOTATION)
    |-- Detect variant class: SNV/indel vs SV/CNV
    |
    v
Phase 2: Classify Variants
    |-- Variant type: SNV, INS, DEL, MNV, COMPLEX, SV
    |-- Mutation type: missense, nonsense, synonymous, frameshift, splice, etc.
    |-- Impact: HIGH, MODERATE, LOW, MODIFIER
    |-- SV type: DEL, DUP, INV, BND, CNV (if structural variant)
    |
    v
Phase 3: Apply Filters
    |-- VAF range (min/max)
    |-- Read depth minimum
    |-- Quality threshold
    |-- PASS only
    |-- Variant/mutation type inclusion/exclusion
    |-- Consequence exclusion (intronic, intergenic)
    |-- Population frequency range
    |-- Chromosome selection
    |-- SV size range (for structural variants)
    |
    v
Phase 4: Compute Statistics
    |-- Variant type distribution
    |-- Mutation type distribution
    |-- Impact distribution
    |-- Chromosome distribution
    |-- Ti/Tv ratio (for SNVs)
    |-- Per-sample VAF/depth stats
    |-- Gene mutation counts
    |-- SV size distribution (for structural variants)
    |
    v
Phase 5: Annotate with ToolUniverse (optional)
    |-- MyVariant.info: ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen
    |-- dbSNP: Population frequencies, gene associations
    |-- gnomAD: Population allele frequencies
    |-- Ensembl VEP: Consequence prediction
    |
    v
Phase 6: Generate Report / Answer Question
    |-- Markdown report with tables
    |-- Direct answer to specific question
    |-- DataFrame for downstream analysis
    |
    v
Phase 7: Structural Variant & CNV Analysis (if SV/CNV detected)
    |-- Annotate with gnomAD SV population frequencies
    |-- Query DGVa/dbVar for known SVs (Ensembl)
    |-- Identify affected genes
    |-- Query ClinGen dosage sensitivity (HI/TS scores)
    |-- Classify pathogenicity (Pathogenic/Likely Pathogenic/VUS/Benign)
    |-- Generate SV clinical report with ACMG/ClinGen guidelines

输入VCF文件(SNVs/indels或SVs)
    |
    v
阶段1: 解析VCF
    |-- 纯Python解析器(支持所有VCF 4.x版本)
    |-- cyvcf2解析器(基于C实现,速度更快)
    |-- 提取信息:CHROM、POS、REF、ALT、QUAL、FILTER、INFO、FORMAT、样本数据
    |-- 提取单样本信息:GT、VAF、深度
    |-- 从INFO字段提取注释(ANN、CSQ、FUNCOTATION)
    |-- 检测变异类别:SNV/indel 或 SV/CNV
    |
    v
阶段2: 变异分类
    |-- 变异类型:SNV、INS、DEL、MNV、COMPLEX、SV
    |-- 突变类型:错义、无义、同义、移码、剪接等
    |-- 影响程度:HIGH、MODERATE、LOW、MODIFIER
    |-- SV类型:DEL、DUP、INV、BND、CNV(若为结构变异)
    |
    v
阶段3: 应用过滤规则
    |-- VAF范围(最小/最大值)
    |-- 最小读取深度
    |-- 质量阈值
    |-- 仅保留PASS标记的变异
    |-- 包含/排除特定变异/突变类型
    |-- 排除特定影响类型(内含子区、基因间区)
    |-- 群体频率范围
    |-- 染色体选择
    |-- SV大小范围(针对结构变异)
    |
    v
阶段4: 计算统计数据
    |-- 变异类型分布
    |-- 突变类型分布
    |-- 影响程度分布
    |-- 染色体分布
    |-- Ti/Tv比率(针对SNV)
    |-- 单样本VAF/深度统计
    |-- 基因突变计数
    |-- SV大小分布(针对结构变异)
    |
    v
阶段5: 通过ToolUniverse进行注释(可选)
    |-- MyVariant.info:ClinVar、dbSNP、gnomAD、CADD、SIFT、PolyPhen
    |-- dbSNP:群体频率、基因关联信息
    |-- gnomAD:群体等位基因频率
    |-- Ensembl VEP:影响类型预测
    |
    v
阶段6: 生成报告 / 解答问题
    |-- 带表格的Markdown报告
    |-- 针对特定问题的直接答案
    |-- 用于下游分析的DataFrame
    |
    v
阶段7: 结构变异与CNV分析(若检测到SV/CNV)
    |-- 利用gnomAD SV群体频率进行注释
    |-- 查询Ensembl的DGVa/dbVar获取已知SV
    |-- 识别受影响的基因
    |-- 查询ClinGen剂量敏感性(HI/TS评分)
    |-- 致病性分类(致病性/可能致病性/意义不明确/良性)
    |-- 按照ACMG/ClinGen指南生成SV临床报告

Phase Summaries

阶段概述

Phase 1: VCF Parsing

阶段1: VCF解析

Use pandas for:
  • Reading VCF as structured data
  • Quick exploratory analysis
  • When you need to manipulate columns and rows
Use python_implementation tools for:
  • Production parsing with annotation extraction
  • Multi-sample VCF handling
  • VAF extraction from FORMAT fields
  • Large file streaming
Key functions:
python
vcf_data = parse_vcf("input.vcf")           # Pure Python (always works)
vcf_data = parse_vcf_cyvcf2("input.vcf")    # Fast C-based (if installed)
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR")  # For pandas
使用pandas的场景:
  • 将VCF读取为结构化数据
  • 快速探索性分析
  • 需要操作列和行时
使用python_implementation工具的场景:
  • 带注释提取的生产级解析
  • 多样本VCF处理
  • 从FORMAT字段提取VAF
  • 大文件流式处理
核心函数:
python
vcf_data = parse_vcf("input.vcf")           # 纯Python实现(始终可用)
vcf_data = parse_vcf_cyvcf2("input.vcf")    # 基于C的快速实现(需安装)
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR")  # 转换为pandas格式

Phase 2: Variant Classification

阶段2: 变异分类

Automatic classification from annotations:
  • SnpEff ANN field
  • VEP CSQ field
  • GATK Funcotator FUNCOTATION field
  • Standard INFO keys: EFFECT, EFF, TYPE
Mutation types supported: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost
See references/mutation_classification_guide.md for full details
基于注释的自动分类:
  • SnpEff ANN字段
  • VEP CSQ字段
  • GATK Funcotator FUNCOTATION字段
  • 标准INFO键:EFFECT、EFF、TYPE
支持的突变类型: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost
详细说明请参考references/mutation_classification_guide.md

Phase 3: Filtering

阶段3: 过滤功能

Common filtering patterns:
python
undefined
常见过滤模式:
python
undefined

Somatic-like variants

类体细胞变异

criteria = FilterCriteria( min_vaf=0.05, max_vaf=0.95, min_depth=20, pass_only=True, exclude_consequences=["intronic", "intergenic", "upstream", "downstream"] )
criteria = FilterCriteria( min_vaf=0.05, max_vaf=0.95, min_depth=20, pass_only=True, exclude_consequences=["intronic", "intergenic", "upstream", "downstream"] )

High-confidence germline

高可信度种系变异

criteria = FilterCriteria( min_vaf=0.25, min_depth=30, pass_only=True, chromosomes=["1", "2", ..., "22", "X", "Y"] )
criteria = FilterCriteria( min_vaf=0.25, min_depth=30, pass_only=True, chromosomes=["1", "2", ..., "22", "X", "Y"] )

Rare pathogenic candidates

罕见致病性候选变异

criteria = FilterCriteria( min_depth=20, pass_only=True, mutation_types=["missense", "nonsense", "frameshift"] )

**See references/vcf_filtering.md for all filter options**
criteria = FilterCriteria( min_depth=20, pass_only=True, mutation_types=["missense", "nonsense", "frameshift"] )

**所有过滤选项请参考references/vcf_filtering.md**

Phase 4: Statistics

阶段4: 统计分析

Use pandas for:
  • Complex aggregations (groupby, pivot tables)
  • Custom statistical tests
  • Data exploration
Use python_implementation for:
  • Standard variant statistics (Ti/Tv, type distribution)
  • Per-sample VAF/depth summary
  • Quick mutation type counts
使用pandas的场景:
  • 复杂聚合(groupby、透视表)
  • 自定义统计测试
  • 数据探索
使用python_implementation的场景:
  • 标准变异统计(Ti/Tv、类型分布)
  • 单样本VAF/深度汇总
  • 快速突变类型计数

Phase 5: ToolUniverse Annotation

阶段5: ToolUniverse注释

When to use ToolUniverse annotation tools:
  1. ClinVar clinical significance: Use MyVariant.info or dbSNP tools
  2. Population frequencies: Use MyVariant.info (aggregates gnomAD, ExAC, 1000G)
  3. Pathogenicity scores: Use MyVariant.info (aggregates CADD, SIFT, PolyPhen)
  4. Consequence prediction: Use Ensembl VEP tools
Best practices:
  • Annotate variants with rsIDs first (most reliable)
  • Use MyVariant.info for batch annotation (aggregates multiple sources)
  • Limit to top variants (max_annotate=50-100) to respect rate limits
  • Query dbSNP/gnomAD directly for specific use cases
Key tools:
  • MyVariant_query_variants
    : Batch annotation (ClinVar, dbSNP, gnomAD, CADD)
  • dbsnp_get_variant_by_rsid
    : Population frequencies
  • gnomad_get_variant
    : Basic variant metadata
  • EnsemblVEP_annotate_rsid
    : Consequence prediction
See references/annotation_guide.md for detailed examples
何时使用ToolUniverse注释工具:
  1. ClinVar临床意义: 使用MyVariant.info或dbSNP工具
  2. 群体频率: 使用MyVariant.info(聚合gnomAD、ExAC、1000G)
  3. 致病性评分: 使用MyVariant.info(聚合CADD、SIFT、PolyPhen)
  4. 影响类型预测: 使用Ensembl VEP工具
最佳实践:
  • 优先用rsID注释变异(最可靠)
  • 使用MyVariant.info进行批量注释(聚合多源数据)
  • 限制注释的变异数量(max_annotate=50-100)以遵守速率限制
  • 针对特定场景直接查询dbSNP/gnomAD
核心工具:
  • MyVariant_query_variants
    : 批量注释(ClinVar、dbSNP、gnomAD、CADD)
  • dbsnp_get_variant_by_rsid
    : 群体频率查询
  • gnomad_get_variant
    : 基础变异元数据
  • EnsemblVEP_annotate_rsid
    : 影响类型预测
详细示例请参考references/annotation_guide.md

Phase 6: Report Generation

阶段6: 报告生成

Report includes:
  1. Summary Statistics (total variants, type counts, Ti/Tv)
  2. Mutation Type Distribution (table with counts and percentages)
  3. Impact Distribution
  4. Chromosome Distribution
  5. VAF Distribution (per-sample)
  6. Clinical Significance
  7. Top Mutated Genes
  8. Variant Annotations (ClinVar-annotated variants)
报告包含内容:
  1. 汇总统计(总变异数、类型计数、Ti/Tv比率)
  2. 突变类型分布(带计数和百分比的表格)
  3. 影响程度分布
  4. 染色体分布
  5. VAF分布(单样本)
  6. 临床意义
  7. 高频突变基因
  8. 变异注释(带ClinVar注释的变异)

Phase 7: Structural Variant & CNV Analysis

阶段7: 结构变异与CNV分析

When VCF contains SV calls (SVTYPE=DEL/DUP/INV/BND):
  1. Identify affected genes (from VCF annotation or coordinate overlap)
  2. Query ClinGen dosage sensitivity:
    python
    clingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
    # Returns: haploinsufficiency_score, triplosensitivity_score
  3. Check population frequency:
    python
    gnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
    # Returns: SVs with AF, AC, AN
  4. Classify pathogenicity:
    • Pathogenic: Deletion + HI score = 3, AF < 0.0001
    • Likely Pathogenic: Deletion + HI score = 2, AF < 0.001
    • VUS: HI/TS score = 0-1, AF 0.001-0.01
    • Benign: AF > 0.01
ClinGen dosage score interpretation:
  • 3: Sufficient evidence for dosage pathogenicity (HIGH impact)
  • 2: Some evidence (MODERATE impact)
  • 1: Little evidence (LOW impact)
  • 0: No evidence (MINIMAL impact)
  • 40: Dosage sensitivity unlikely
See references/sv_cnv_analysis.md for full SV workflow

当VCF包含SV调用时(SVTYPE=DEL/DUP/INV/BND):
  1. 识别受影响的基因(从VCF注释或坐标重叠获取)
  2. 查询ClinGen剂量敏感性:
    python
    clingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
    # 返回结果: haploinsufficiency_score, triplosensitivity_score
  3. 检查群体频率:
    python
    gnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
    # 返回结果: 带AF、AC、AN的SV数据
  4. 致病性分类:
    • 致病性:缺失突变 + HI评分=3,AF < 0.0001
    • 可能致病性:缺失突变 + HI评分=2,AF < 0.001
    • 意义不明确:HI/TS评分=0-1,AF 0.001-0.01
    • 良性:AF > 0.01
ClinGen剂量评分解读:
  • 3: 有充分证据表明剂量致病性(高影响)
  • 2: 有部分证据(中等影响)
  • 1: 证据不足(低影响)
  • 0: 无证据(极低影响)
  • 40: 不太可能存在剂量敏感性
完整SV工作流请参考references/sv_cnv_analysis.md

Answering BixBench Questions

BixBench问题解答

Pattern 1: VAF + Mutation Type Fraction

模式1: VAF + 突变类型占比

Question: "What fraction of variants with VAF < X are annotated as Y mutations?"
python
result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense",
    sample="TUMOR"
)
问题: "VAF < X的变异中,被注释为Y突变的占比是多少?"
python
result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense",
    sample="TUMOR"
)

Returns: fraction, total_below_vaf, matching_mutation_type

返回结果: 占比、VAF低于阈值的总变异数、匹配的突变类型数量

undefined
undefined

Pattern 2: Cohort Comparison

模式2: 队列对比

Question: "What is the difference in mutation frequency between cohorts?"
python
result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense",
    cohort_names=["Treatment", "Control"]
)
问题: "不同队列间的突变频率差异是什么?"
python
result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense",
    cohort_names=["Treatment", "Control"]
)

Returns: cohorts, frequency_difference

返回结果: 队列信息、频率差异

undefined
undefined

Pattern 3: Filter and Count

模式3: 过滤后计数

Question: "After filtering X, how many Y remain?"
python
result = answer_non_reference_after_filter(
    vcf_path="input.vcf",
    exclude_intronic_intergenic=True
)
问题: "过滤X后,剩余多少Y?"
python
result = answer_non_reference_after_filter(
    vcf_path="input.vcf",
    exclude_intronic_intergenic=True
)

Returns: total_input, non_reference, remaining

返回结果: 输入总变异数、非参考变异数、过滤后剩余数


---

---

ToolUniverse Tools Reference

ToolUniverse工具参考

SNV/Indel Annotation

SNV/Indel注释

ToolWhen to UseParametersResponse
MyVariant_query_variants
Batch annotation
query
(rsID/HGVS)
ClinVar, dbSNP, gnomAD, CADD
dbsnp_get_variant_by_rsid
Population frequencies
rsid
Frequencies, clinical significance
gnomad_get_variant
gnomAD metadata
variant_id
(CHR-POS-REF-ALT)
Basic variant info
EnsemblVEP_annotate_rsid
Consequence prediction
variant_id
(rsID)
Transcript impact
工具使用场景参数返回结果
MyVariant_query_variants
批量注释
query
(rsID/HGVS)
ClinVar、dbSNP、gnomAD、CADD数据
dbsnp_get_variant_by_rsid
群体频率查询
rsid
频率、临床意义
gnomad_get_variant
gnomAD元数据查询
variant_id
(CHR-POS-REF-ALT)
基础变异信息
EnsemblVEP_annotate_rsid
影响类型预测
variant_id
(rsID)
转录本影响

Structural Variant Annotation

结构变异注释

ToolWhen to UseParametersResponse
gnomad_get_sv_by_gene
SV population frequency
gene_symbol
SVs with AF, AC, AN
gnomad_get_sv_by_region
Regional SV search
chrom
,
start
,
end
SVs in region
ClinGen_dosage_by_gene
Dosage sensitivity
gene_symbol
HI/TS scores, disease
ClinGen_dosage_region_search
Dosage-sensitive genes in region
chromosome
,
start
,
end
All genes with HI/TS scores
ensembl_get_structural_variants
Known SVs from DGVa/dbVar
chrom
,
start
,
end
,
species
Clinical significance
See references/annotation_guide.md for detailed tool usage examples

工具使用场景参数返回结果
gnomad_get_sv_by_gene
SV群体频率查询
gene_symbol
带AF、AC、AN的SV数据
gnomad_get_sv_by_region
区域SV搜索
chrom
,
start
,
end
区域内的SV数据
ClinGen_dosage_by_gene
剂量敏感性查询
gene_symbol
HI/TS评分、疾病信息
ClinGen_dosage_region_search
区域内剂量敏感性基因查询
chromosome
,
start
,
end
所有带HI/TS评分的基因
ensembl_get_structural_variants
从DGVa/dbVar获取已知SV
chrom
,
start
,
end
,
species
临床意义
详细工具使用示例请参考references/annotation_guide.md

Common Use Patterns

常见使用模式

Pattern 1: Quick VCF Summary

模式1: 快速VCF汇总

Parse VCF, compute statistics, generate report.
python
report = variant_analysis_pipeline("input.vcf", output_file="report.md")
解析VCF、计算统计数据、生成报告。
python
report = variant_analysis_pipeline("input.vcf", output_file="report.md")

Pattern 2: Filtered Analysis

模式2: 过滤后分析

Parse VCF, apply multi-criteria filter, compute statistics on filtered set.
python
report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True),
    output_file="filtered_report.md"
)
解析VCF、应用多条件过滤、对过滤后的数据集计算统计数据。
python
report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True),
    output_file="filtered_report.md"
)

Pattern 3: Annotated Report

模式3: 带注释的报告

Parse VCF, annotate top variants with ClinVar/gnomAD/CADD, generate clinical report.
python
report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    annotate=True,
    max_annotate=50,
    output_file="annotated_report.md"
)
解析VCF、用ClinVar/gnomAD/CADD注释Top变异、生成临床报告。
python
report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    annotate=True,
    max_annotate=50,
    output_file="annotated_report.md"
)

Pattern 4: BixBench Question Answering

模式4: BixBench问题解答

Parse VCF, apply specific filters, compute targeted statistics to answer precise questions.
python
result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense"
)
解析VCF、应用特定过滤规则、计算目标统计数据以解答精准问题。
python
result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense"
)

Pattern 5: Cohort Comparison

模式5: 队列对比

Parse multiple VCFs, compare mutation frequencies across cohorts.
python
result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense"
)

解析多个VCF、对比不同队列间的突变频率。
python
result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense"
)

When to Use pandas vs python_implementation

何时使用pandas vs python_implementation

Use pandas when:
  • You need to read VCF as a flat table
  • You want to do custom aggregations (groupby, pivot)
  • You need to join with other data
  • You're doing exploratory data analysis
  • You want to export to CSV/Excel
Use python_implementation when:
  • You need production-grade VCF parsing
  • You need to extract INFO annotations (ANN, CSQ)
  • You need per-sample VAF/depth extraction
  • You need to classify mutation types
  • You need standard variant statistics (Ti/Tv)
  • You need to integrate with ToolUniverse annotation
Best approach: Use python_implementation for parsing/classification, then convert to DataFrame for custom analysis:
python
undefined
使用pandas的场景:
  • 需要将VCF读取为扁平表格
  • 想要进行自定义聚合(groupby、透视)
  • 需要与其他数据关联
  • 进行探索性数据分析
  • 想要导出为CSV/Excel
使用python_implementation的场景:
  • 需要生产级VCF解析
  • 需要提取INFO注释(ANN、CSQ)
  • 需要提取单样本VAF/深度
  • 需要对突变类型进行分类
  • 需要标准变异统计(Ti/Tv)
  • 需要对接ToolUniverse注释
最佳实践: 使用python_implementation进行解析/分类,然后转换为DataFrame进行自定义分析:
python
undefined

Parse and classify

解析并分类

vcf_data = parse_vcf("input.vcf") passing, failing = filter_variants(vcf_data.variants, criteria)
vcf_data = parse_vcf("input.vcf") passing, failing = filter_variants(vcf_data.variants, criteria)

Convert to DataFrame for custom analysis

转换为DataFrame进行自定义分析

df = variants_to_dataframe(passing, sample="TUMOR")
df = variants_to_dataframe(passing, sample="TUMOR")

Now use pandas

现在使用pandas进行分析

missense_high_vaf = df[(df['mutation_type'] == 'missense') & (df['vaf'] >= 0.3)]

---
missense_high_vaf = df[(df['mutation_type'] == 'missense') & (df['vaf'] >= 0.3)]

---

Limitations

局限性

  • VCF annotation required for mutation classification: If VCF has no ANN/CSQ/FUNCOTATION in INFO, mutation types will be "unknown" until ToolUniverse annotation is applied
  • Multi-allelic variants: Parser takes first ALT allele for type classification
  • ToolUniverse annotation rate: API-based, limited to ~100 variants per batch by default to respect rate limits
  • gnomAD tool: Returns basic metadata only (not full allele frequencies); use MyVariant.info for gnomAD AF
  • Large VCFs: Pure Python parser streams line-by-line; cyvcf2 is recommended for files with >100K variants

  • 突变分类依赖VCF注释: 如果VCF的INFO字段中没有ANN/CSQ/FUNCOTATION,突变类型会被标记为"unknown",直到应用ToolUniverse注释
  • 多等位基因变异: 解析器会选取第一个ALT等位基因进行类型分类
  • ToolUniverse注释速率: 基于API,默认限制每批约100个变异以遵守速率限制
  • gnomAD工具: 仅返回基础元数据(不包含完整等位基因频率);如需gnomAD AF请使用MyVariant.info
  • 大VCF文件: 纯Python解析器支持逐行流式处理;对于超过100K变异的文件,推荐使用cyvcf2

Reference Documentation

参考文档

  • references/vcf_filtering.md: Complete filter options and examples
  • references/mutation_classification_guide.md: Detailed mutation type classification rules
  • references/annotation_guide.md: ToolUniverse annotation workflows with examples
  • references/sv_cnv_analysis.md: Complete SV/CNV interpretation workflow

  • references/vcf_filtering.md: 完整过滤选项及示例
  • references/mutation_classification_guide.md: 详细突变类型分类规则
  • references/annotation_guide.md: ToolUniverse注释工作流及示例
  • references/sv_cnv_analysis.md: 完整SV/CNV解读工作流

Utility Scripts

实用脚本

  • scripts/parse_vcf.py: Standalone VCF parsing script
  • scripts/filter_variants.py: Command-line variant filtering
  • scripts/annotate_variants.py: Batch variant annotation

  • scripts/parse_vcf.py: 独立VCF解析脚本
  • scripts/filter_variants.py: 命令行变异过滤工具
  • scripts/annotate_variants.py: 批量变异注释脚本

Quick Start

快速开始

See QUICK_START.md for:
  • Python SDK examples (pipeline, question functions, individual tools)
  • MCP conversational examples
  • Common recipes (somatic analysis, clinical screening, population frequency)
  • Expected output formats
  • Troubleshooting guide
请参考QUICK_START.md获取:
  • Python SDK示例(工作流、问题解答函数、独立工具)
  • MCP对话示例
  • 常见场景方案(体细胞分析、临床筛查、群体频率)
  • 预期输出格式
  • 故障排查指南