Loading...
Loading...
Production-ready VCF processing, variant annotation, mutation analysis, and structural variant (SV/CNV) interpretation for bioinformatics questions. Parses VCF files (streaming, large files), classifies mutation types (missense, nonsense, synonymous, frameshift, splice, intronic, intergenic) and structural variants (deletions, duplications, inversions, translocations), applies VAF/depth/quality/consequence filters, annotates with ClinVar/dbSNP/gnomAD/CADD via ToolUniverse, interprets SV/CNV clinical significance using ClinGen dosage sensitivity scores, computes variant statistics, and generates reports. Solves questions like "What fraction of variants with VAF < 0.3 are missense?", "How many non-reference variants remain after filtering intronic/intergenic?", "What is the pathogenicity of this deletion affecting BRCA1?", or "Which dosage-sensitive genes overlap this CNV?". Use when processing VCF files, annotating variants, filtering by VAF/depth/consequence, classifying mutations, interpreting structural variants, assessing CNV pathogenicity, comparing cohorts, or answering variant analysis questions.
npx skill4agent add mims-harvard/tooluniverse tooluniverse-variant-analysis| Capability | Description |
|---|---|
| VCF Parsing | Pure Python + cyvcf2 parsers. VCF 4.x, gzipped, multi-sample, SNV/indel/SV |
| Mutation Classification | Maps SO terms, SnpEff ANN, VEP CSQ, GATK Funcotator to standard types |
| VAF Extraction | Handles AF, AD, AO/RO, NR/NV, INFO AF formats |
| Filtering | VAF, depth, quality, PASS, variant type, mutation type, consequence, chromosome, SV size |
| Statistics | Ti/Tv ratio, per-sample VAF/depth stats, mutation type distribution, SV size distribution |
| Annotation | MyVariant.info (aggregates ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen) |
| SV/CNV Analysis | gnomAD SV population frequencies, DGVa/dbVar known SVs, ClinGen dosage sensitivity |
| Clinical Interpretation | ACMG/ClinGen CNV pathogenicity classification using haploinsufficiency/triplosensitivity scores |
| DataFrame | Convert to pandas for advanced analytics |
| Reporting | Markdown reports with tables and statistics, SV clinical reports |
Input VCF File (SNVs/indels or SVs)
|
v
Phase 1: Parse VCF
|-- Pure Python parser (any VCF 4.x)
|-- cyvcf2 parser (faster, C-based)
|-- Extract: CHROM, POS, REF, ALT, QUAL, FILTER, INFO, FORMAT, samples
|-- Extract per-sample: GT, VAF, depth
|-- Extract annotations from INFO (ANN, CSQ, FUNCOTATION)
|-- Detect variant class: SNV/indel vs SV/CNV
|
v
Phase 2: Classify Variants
|-- Variant type: SNV, INS, DEL, MNV, COMPLEX, SV
|-- Mutation type: missense, nonsense, synonymous, frameshift, splice, etc.
|-- Impact: HIGH, MODERATE, LOW, MODIFIER
|-- SV type: DEL, DUP, INV, BND, CNV (if structural variant)
|
v
Phase 3: Apply Filters
|-- VAF range (min/max)
|-- Read depth minimum
|-- Quality threshold
|-- PASS only
|-- Variant/mutation type inclusion/exclusion
|-- Consequence exclusion (intronic, intergenic)
|-- Population frequency range
|-- Chromosome selection
|-- SV size range (for structural variants)
|
v
Phase 4: Compute Statistics
|-- Variant type distribution
|-- Mutation type distribution
|-- Impact distribution
|-- Chromosome distribution
|-- Ti/Tv ratio (for SNVs)
|-- Per-sample VAF/depth stats
|-- Gene mutation counts
|-- SV size distribution (for structural variants)
|
v
Phase 5: Annotate with ToolUniverse (optional)
|-- MyVariant.info: ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen
|-- dbSNP: Population frequencies, gene associations
|-- gnomAD: Population allele frequencies
|-- Ensembl VEP: Consequence prediction
|
v
Phase 6: Generate Report / Answer Question
|-- Markdown report with tables
|-- Direct answer to specific question
|-- DataFrame for downstream analysis
|
v
Phase 7: Structural Variant & CNV Analysis (if SV/CNV detected)
|-- Annotate with gnomAD SV population frequencies
|-- Query DGVa/dbVar for known SVs (Ensembl)
|-- Identify affected genes
|-- Query ClinGen dosage sensitivity (HI/TS scores)
|-- Classify pathogenicity (Pathogenic/Likely Pathogenic/VUS/Benign)
|-- Generate SV clinical report with ACMG/ClinGen guidelinesvcf_data = parse_vcf("input.vcf") # Pure Python (always works)
vcf_data = parse_vcf_cyvcf2("input.vcf") # Fast C-based (if installed)
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR") # For pandas# Somatic-like variants
criteria = FilterCriteria(
min_vaf=0.05, max_vaf=0.95,
min_depth=20, pass_only=True,
exclude_consequences=["intronic", "intergenic", "upstream", "downstream"]
)
# High-confidence germline
criteria = FilterCriteria(
min_vaf=0.25, min_depth=30, pass_only=True,
chromosomes=["1", "2", ..., "22", "X", "Y"]
)
# Rare pathogenic candidates
criteria = FilterCriteria(
min_depth=20, pass_only=True,
mutation_types=["missense", "nonsense", "frameshift"]
)MyVariant_query_variantsdbsnp_get_variant_by_rsidgnomad_get_variantEnsemblVEP_annotate_rsidclingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
# Returns: haploinsufficiency_score, triplosensitivity_scoregnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
# Returns: SVs with AF, AC, ANresult = answer_vaf_mutation_fraction(
vcf_path="input.vcf",
max_vaf=0.3,
mutation_type="missense",
sample="TUMOR"
)
# Returns: fraction, total_below_vaf, matching_mutation_typeresult = answer_cohort_comparison(
vcf_paths=["cohort1.vcf", "cohort2.vcf"],
mutation_type="missense",
cohort_names=["Treatment", "Control"]
)
# Returns: cohorts, frequency_differenceresult = answer_non_reference_after_filter(
vcf_path="input.vcf",
exclude_intronic_intergenic=True
)
# Returns: total_input, non_reference, remaining| Tool | When to Use | Parameters | Response |
|---|---|---|---|
| Batch annotation | | ClinVar, dbSNP, gnomAD, CADD |
| Population frequencies | | Frequencies, clinical significance |
| gnomAD metadata | | Basic variant info |
| Consequence prediction | | Transcript impact |
| Tool | When to Use | Parameters | Response |
|---|---|---|---|
| SV population frequency | | SVs with AF, AC, AN |
| Regional SV search | | SVs in region |
| Dosage sensitivity | | HI/TS scores, disease |
| Dosage-sensitive genes in region | | All genes with HI/TS scores |
| Known SVs from DGVa/dbVar | | Clinical significance |
report = variant_analysis_pipeline("input.vcf", output_file="report.md")report = variant_analysis_pipeline(
vcf_path="input.vcf",
filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True),
output_file="filtered_report.md"
)report = variant_analysis_pipeline(
vcf_path="input.vcf",
annotate=True,
max_annotate=50,
output_file="annotated_report.md"
)result = answer_vaf_mutation_fraction(
vcf_path="input.vcf",
max_vaf=0.3,
mutation_type="missense"
)result = answer_cohort_comparison(
vcf_paths=["cohort1.vcf", "cohort2.vcf"],
mutation_type="missense"
)# Parse and classify
vcf_data = parse_vcf("input.vcf")
passing, failing = filter_variants(vcf_data.variants, criteria)
# Convert to DataFrame for custom analysis
df = variants_to_dataframe(passing, sample="TUMOR")
# Now use pandas
missense_high_vaf = df[(df['mutation_type'] == 'missense') & (df['vaf'] >= 0.3)]