tooluniverse-variant-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Variant Analysis and Annotation

变异分析与注释

Production-ready VCF processing and variant annotation skill combining local bioinformatics computation with ToolUniverse database integration. Designed to answer bioinformatics analysis questions about VCF data, mutation classification, variant filtering, and clinical annotation.

一款结合本地生物信息学计算与ToolUniverse数据库集成的可用于生产环境的VCF处理及变异注释工具，专为解答关于VCF数据、突变分类、变异过滤和临床注释的生物信息学分析问题设计。

When to Use This Skill

何时使用本工具

Triggers:

User provides a VCF file (SNV/indel or SV) and asks questions about its contents
Questions about variant allele frequency (VAF) filtering
Mutation type classification queries (missense, nonsense, synonymous, etc.)
Structural variant interpretation requests (deletions, duplications, CNVs)
Variant annotation requests (ClinVar, gnomAD, CADD, dbSNP)
CNV pathogenicity assessment using ClinGen dosage sensitivity
Cohort comparison questions
Population frequency filtering (SNVs or SVs)
Intronic/intergenic variant filtering
Gene dosage sensitivity queries

Example Questions:

"What fraction of variants with VAF < 0.3 are annotated as missense mutations?"
"After filtering intronic/intergenic variants, how many non-reference variants remain?"
"What is the clinical significance of this deletion affecting BRCA1?"
"Which dosage-sensitive genes overlap this 500kb duplication on chr17?"
"How many variants have clinical significance annotations?"
"Compare variant counts between samples"

触发场景:

用户提供VCF文件（SNV/indel或SV）并询问其内容相关问题
关于变异等位基因频率（VAF）过滤的问题
突变类型分类查询（错义、无义、同义等）
结构变异解读需求（缺失、重复、CNV）
变异注释需求（ClinVar、gnomAD、CADD、dbSNP）
利用ClinGen剂量敏感性进行CNV致病性评估
队列对比问题
群体频率过滤（SNV或SV）
内含子区/基因间区变异过滤
基因剂量敏感性查询

示例问题:

"VAF < 0.3的变异中，被注释为错义突变的占比是多少？"
"过滤掉内含子区/基因间区变异后，剩余多少非参考变异？"
"影响BRCA1的该缺失突变的临床意义是什么？"
"chr17上这个500kb的重复区域与哪些剂量敏感性基因重叠？"
"有多少变异带有临床意义注释？"
"对比样本间的变异数量"

Core Capabilities

核心能力

Capability	Description
VCF Parsing	Pure Python + cyvcf2 parsers. VCF 4.x, gzipped, multi-sample, SNV/indel/SV
Mutation Classification	Maps SO terms, SnpEff ANN, VEP CSQ, GATK Funcotator to standard types
VAF Extraction	Handles AF, AD, AO/RO, NR/NV, INFO AF formats
Filtering	VAF, depth, quality, PASS, variant type, mutation type, consequence, chromosome, SV size
Statistics	Ti/Tv ratio, per-sample VAF/depth stats, mutation type distribution, SV size distribution
Annotation	MyVariant.info (aggregates ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen)
SV/CNV Analysis	gnomAD SV population frequencies, DGVa/dbVar known SVs, ClinGen dosage sensitivity
Clinical Interpretation	ACMG/ClinGen CNV pathogenicity classification using haploinsufficiency/triplosensitivity scores
DataFrame	Convert to pandas for advanced analytics
Reporting	Markdown reports with tables and statistics, SV clinical reports

能力	描述
VCF解析	纯Python + cyvcf2解析器，支持VCF 4.x、压缩文件、多样本、SNV/indel/SV
突变分类	将SO术语、SnpEff ANN、VEP CSQ、GATK Funcotator映射为标准类型
VAF提取	支持AF、AD、AO/RO、NR/NV、INFO AF等格式
过滤功能	支持VAF、深度、质量、PASS标记、变异类型、突变类型、影响类型、染色体、SV大小过滤
统计分析	Ti/Tv比率、单样本VAF/深度统计、突变类型分布、SV大小分布
注释功能	对接MyVariant.info（聚合ClinVar、dbSNP、gnomAD、CADD、SIFT、PolyPhen）
SV/CNV分析	gnomAD SV群体频率、DGVa/dbVar已知SV、ClinGen剂量敏感性
临床解读	利用单倍剂量不足/三倍体敏感性评分，按照ACMG/ClinGen指南进行CNV致病性分类
DataFrame转换	转换为pandas格式以进行高级分析
报告生成	带表格和统计数据的Markdown报告、SV临床报告

Workflow Overview

工作流概述

Input VCF File (SNVs/indels or SVs)
    |
    v
Phase 1: Parse VCF
    |-- Pure Python parser (any VCF 4.x)
    |-- cyvcf2 parser (faster, C-based)
    |-- Extract: CHROM, POS, REF, ALT, QUAL, FILTER, INFO, FORMAT, samples
    |-- Extract per-sample: GT, VAF, depth
    |-- Extract annotations from INFO (ANN, CSQ, FUNCOTATION)
    |-- Detect variant class: SNV/indel vs SV/CNV
    |
    v
Phase 2: Classify Variants
    |-- Variant type: SNV, INS, DEL, MNV, COMPLEX, SV
    |-- Mutation type: missense, nonsense, synonymous, frameshift, splice, etc.
    |-- Impact: HIGH, MODERATE, LOW, MODIFIER
    |-- SV type: DEL, DUP, INV, BND, CNV (if structural variant)
    |
    v
Phase 3: Apply Filters
    |-- VAF range (min/max)
    |-- Read depth minimum
    |-- Quality threshold
    |-- PASS only
    |-- Variant/mutation type inclusion/exclusion
    |-- Consequence exclusion (intronic, intergenic)
    |-- Population frequency range
    |-- Chromosome selection
    |-- SV size range (for structural variants)
    |
    v
Phase 4: Compute Statistics
    |-- Variant type distribution
    |-- Mutation type distribution
    |-- Impact distribution
    |-- Chromosome distribution
    |-- Ti/Tv ratio (for SNVs)
    |-- Per-sample VAF/depth stats
    |-- Gene mutation counts
    |-- SV size distribution (for structural variants)
    |
    v
Phase 5: Annotate with ToolUniverse (optional)
    |-- MyVariant.info: ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen
    |-- dbSNP: Population frequencies, gene associations
    |-- gnomAD: Population allele frequencies
    |-- Ensembl VEP: Consequence prediction
    |
    v
Phase 6: Generate Report / Answer Question
    |-- Markdown report with tables
    |-- Direct answer to specific question
    |-- DataFrame for downstream analysis
    |
    v
Phase 7: Structural Variant & CNV Analysis (if SV/CNV detected)
    |-- Annotate with gnomAD SV population frequencies
    |-- Query DGVa/dbVar for known SVs (Ensembl)
    |-- Identify affected genes
    |-- Query ClinGen dosage sensitivity (HI/TS scores)
    |-- Classify pathogenicity (Pathogenic/Likely Pathogenic/VUS/Benign)
    |-- Generate SV clinical report with ACMG/ClinGen guidelines

输入VCF文件（SNVs/indels或SVs）
    |
    v
阶段1: 解析VCF
    |-- 纯Python解析器（支持所有VCF 4.x版本）
    |-- cyvcf2解析器（基于C实现，速度更快）
    |-- 提取信息：CHROM、POS、REF、ALT、QUAL、FILTER、INFO、FORMAT、样本数据
    |-- 提取单样本信息：GT、VAF、深度
    |-- 从INFO字段提取注释（ANN、CSQ、FUNCOTATION）
    |-- 检测变异类别：SNV/indel 或 SV/CNV
    |
    v
阶段2: 变异分类
    |-- 变异类型：SNV、INS、DEL、MNV、COMPLEX、SV
    |-- 突变类型：错义、无义、同义、移码、剪接等
    |-- 影响程度：HIGH、MODERATE、LOW、MODIFIER
    |-- SV类型：DEL、DUP、INV、BND、CNV（若为结构变异）
    |
    v
阶段3: 应用过滤规则
    |-- VAF范围（最小/最大值）
    |-- 最小读取深度
    |-- 质量阈值
    |-- 仅保留PASS标记的变异
    |-- 包含/排除特定变异/突变类型
    |-- 排除特定影响类型（内含子区、基因间区）
    |-- 群体频率范围
    |-- 染色体选择
    |-- SV大小范围（针对结构变异）
    |
    v
阶段4: 计算统计数据
    |-- 变异类型分布
    |-- 突变类型分布
    |-- 影响程度分布
    |-- 染色体分布
    |-- Ti/Tv比率（针对SNV）
    |-- 单样本VAF/深度统计
    |-- 基因突变计数
    |-- SV大小分布（针对结构变异）
    |
    v
阶段5: 通过ToolUniverse进行注释（可选）
    |-- MyVariant.info：ClinVar、dbSNP、gnomAD、CADD、SIFT、PolyPhen
    |-- dbSNP：群体频率、基因关联信息
    |-- gnomAD：群体等位基因频率
    |-- Ensembl VEP：影响类型预测
    |
    v
阶段6: 生成报告 / 解答问题
    |-- 带表格的Markdown报告
    |-- 针对特定问题的直接答案
    |-- 用于下游分析的DataFrame
    |
    v
阶段7: 结构变异与CNV分析（若检测到SV/CNV）
    |-- 利用gnomAD SV群体频率进行注释
    |-- 查询Ensembl的DGVa/dbVar获取已知SV
    |-- 识别受影响的基因
    |-- 查询ClinGen剂量敏感性（HI/TS评分）
    |-- 致病性分类（致病性/可能致病性/意义不明确/良性）
    |-- 按照ACMG/ClinGen指南生成SV临床报告

Phase Summaries

阶段概述

Phase 1: VCF Parsing

阶段1: VCF解析

Use pandas for:

Reading VCF as structured data
Quick exploratory analysis
When you need to manipulate columns and rows

Use python_implementation tools for:

Production parsing with annotation extraction
Multi-sample VCF handling
VAF extraction from FORMAT fields
Large file streaming

Key functions:

python

vcf_data = parse_vcf("input.vcf")           # Pure Python (always works)
vcf_data = parse_vcf_cyvcf2("input.vcf")    # Fast C-based (if installed)
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR")  # For pandas

使用pandas的场景:

将VCF读取为结构化数据
快速探索性分析
需要操作列和行时

使用python_implementation工具的场景:

带注释提取的生产级解析
多样本VCF处理
从FORMAT字段提取VAF
大文件流式处理

核心函数:

python

vcf_data = parse_vcf("input.vcf")           # 纯Python实现（始终可用）
vcf_data = parse_vcf_cyvcf2("input.vcf")    # 基于C的快速实现（需安装）
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR")  # 转换为pandas格式

Phase 2: Variant Classification

阶段2: 变异分类

Automatic classification from annotations:

SnpEff ANN field
VEP CSQ field
GATK Funcotator FUNCOTATION field
Standard INFO keys: EFFECT, EFF, TYPE

Mutation types supported: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost

See references/mutation_classification_guide.md for full details

基于注释的自动分类:

SnpEff ANN字段
VEP CSQ字段
GATK Funcotator FUNCOTATION字段
标准INFO键：EFFECT、EFF、TYPE

支持的突变类型: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost

详细说明请参考references/mutation_classification_guide.md

Phase 3: Filtering

阶段3: 过滤功能

Common filtering patterns:

python

undefined

常见过滤模式:

python

undefined

Somatic-like variants

类体细胞变异

criteria = FilterCriteria( min_vaf=0.05, max_vaf=0.95, min_depth=20, pass_only=True, exclude_consequences=["intronic", "intergenic", "upstream", "downstream"] )

High-confidence germline

高可信度种系变异

criteria = FilterCriteria( min_vaf=0.25, min_depth=30, pass_only=True, chromosomes=["1", "2", ..., "22", "X", "Y"] )

Rare pathogenic candidates

罕见致病性候选变异

criteria = FilterCriteria( min_depth=20, pass_only=True, mutation_types=["missense", "nonsense", "frameshift"] )


**See references/vcf_filtering.md for all filter options**

criteria = FilterCriteria( min_depth=20, pass_only=True, mutation_types=["missense", "nonsense", "frameshift"] )


**所有过滤选项请参考references/vcf_filtering.md**

Phase 4: Statistics

阶段4: 统计分析

Use pandas for:

Complex aggregations (groupby, pivot tables)
Custom statistical tests
Data exploration

Use python_implementation for:

Standard variant statistics (Ti/Tv, type distribution)
Per-sample VAF/depth summary
Quick mutation type counts

使用pandas的场景:

复杂聚合（groupby、透视表）
自定义统计测试
数据探索

使用python_implementation的场景:

标准变异统计（Ti/Tv、类型分布）
单样本VAF/深度汇总
快速突变类型计数

Phase 5: ToolUniverse Annotation

阶段5: ToolUniverse注释

When to use ToolUniverse annotation tools:

ClinVar clinical significance: Use MyVariant.info or dbSNP tools
Population frequencies: Use MyVariant.info (aggregates gnomAD, ExAC, 1000G)
Pathogenicity scores: Use MyVariant.info (aggregates CADD, SIFT, PolyPhen)
Consequence prediction: Use Ensembl VEP tools

Best practices:

Annotate variants with rsIDs first (most reliable)
Use MyVariant.info for batch annotation (aggregates multiple sources)
Limit to top variants (max_annotate=50-100) to respect rate limits
Query dbSNP/gnomAD directly for specific use cases

Key tools:

```
MyVariant_query_variants
```
: Batch annotation (ClinVar, dbSNP, gnomAD, CADD)
```
dbsnp_get_variant_by_rsid
```
: Population frequencies
```
gnomad_get_variant
```
: Basic variant metadata
```
EnsemblVEP_annotate_rsid
```
: Consequence prediction

See references/annotation_guide.md for detailed examples

何时使用ToolUniverse注释工具:

ClinVar临床意义: 使用MyVariant.info或dbSNP工具
群体频率: 使用MyVariant.info（聚合gnomAD、ExAC、1000G）
致病性评分: 使用MyVariant.info（聚合CADD、SIFT、PolyPhen）
影响类型预测: 使用Ensembl VEP工具

最佳实践:

优先用rsID注释变异（最可靠）
使用MyVariant.info进行批量注释（聚合多源数据）
限制注释的变异数量（max_annotate=50-100）以遵守速率限制
针对特定场景直接查询dbSNP/gnomAD

核心工具:

```
MyVariant_query_variants
```
: 批量注释（ClinVar、dbSNP、gnomAD、CADD）
```
dbsnp_get_variant_by_rsid
```
: 群体频率查询
```
gnomad_get_variant
```
: 基础变异元数据
```
EnsemblVEP_annotate_rsid
```
: 影响类型预测

详细示例请参考references/annotation_guide.md

Phase 6: Report Generation

阶段6: 报告生成

Report includes:

Summary Statistics (total variants, type counts, Ti/Tv)
Mutation Type Distribution (table with counts and percentages)
Impact Distribution
Chromosome Distribution
VAF Distribution (per-sample)
Clinical Significance
Top Mutated Genes
Variant Annotations (ClinVar-annotated variants)

报告包含内容:

汇总统计（总变异数、类型计数、Ti/Tv比率）
突变类型分布（带计数和百分比的表格）
影响程度分布
染色体分布
VAF分布（单样本）
临床意义
高频突变基因
变异注释（带ClinVar注释的变异）

Phase 7: Structural Variant & CNV Analysis

阶段7: 结构变异与CNV分析

When VCF contains SV calls (SVTYPE=DEL/DUP/INV/BND):

Identify affected genes (from VCF annotation or coordinate overlap)

Query ClinGen dosage sensitivity:

python

clingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
# Returns: haploinsufficiency_score, triplosensitivity_score

Check population frequency:

python

gnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
# Returns: SVs with AF, AC, AN

Classify pathogenicity:
- Pathogenic: Deletion + HI score = 3, AF < 0.0001
- Likely Pathogenic: Deletion + HI score = 2, AF < 0.001
- VUS: HI/TS score = 0-1, AF 0.001-0.01
- Benign: AF > 0.01

ClinGen dosage score interpretation:

3: Sufficient evidence for dosage pathogenicity (HIGH impact)
2: Some evidence (MODERATE impact)
1: Little evidence (LOW impact)
0: No evidence (MINIMAL impact)
40: Dosage sensitivity unlikely

See references/sv_cnv_analysis.md for full SV workflow

当VCF包含SV调用时（SVTYPE=DEL/DUP/INV/BND）:

识别受影响的基因（从VCF注释或坐标重叠获取）

查询ClinGen剂量敏感性:

python

clingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
# 返回结果: haploinsufficiency_score, triplosensitivity_score

检查群体频率:

python

gnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
# 返回结果: 带AF、AC、AN的SV数据

致病性分类:
- 致病性：缺失突变 + HI评分=3，AF < 0.0001
- 可能致病性：缺失突变 + HI评分=2，AF < 0.001
- 意义不明确：HI/TS评分=0-1，AF 0.001-0.01
- 良性：AF > 0.01

ClinGen剂量评分解读:

3: 有充分证据表明剂量致病性（高影响）
2: 有部分证据（中等影响）
1: 证据不足（低影响）
0: 无证据（极低影响）
40: 不太可能存在剂量敏感性

完整SV工作流请参考references/sv_cnv_analysis.md

Answering BixBench Questions

BixBench问题解答

Pattern 1: VAF + Mutation Type Fraction

模式1: VAF + 突变类型占比

Question: "What fraction of variants with VAF < X are annotated as Y mutations?"

python

result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense",
    sample="TUMOR"
)

问题: "VAF < X的变异中，被注释为Y突变的占比是多少？"

python

result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense",
    sample="TUMOR"
)

Returns: fraction, total_below_vaf, matching_mutation_type

返回结果: 占比、VAF低于阈值的总变异数、匹配的突变类型数量

undefined

undefined

Pattern 2: Cohort Comparison

模式2: 队列对比

Question: "What is the difference in mutation frequency between cohorts?"

python

result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense",
    cohort_names=["Treatment", "Control"]
)

问题: "不同队列间的突变频率差异是什么？"

python

result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense",
    cohort_names=["Treatment", "Control"]
)

Returns: cohorts, frequency_difference

返回结果: 队列信息、频率差异

undefined

undefined

Pattern 3: Filter and Count

模式3: 过滤后计数

Question: "After filtering X, how many Y remain?"

python

result = answer_non_reference_after_filter(
    vcf_path="input.vcf",
    exclude_intronic_intergenic=True
)

问题: "过滤X后，剩余多少Y？"

python

result = answer_non_reference_after_filter(
    vcf_path="input.vcf",
    exclude_intronic_intergenic=True
)

Returns: total_input, non_reference, remaining

返回结果: 输入总变异数、非参考变异数、过滤后剩余数

---

---

ToolUniverse Tools Reference

ToolUniverse工具参考

SNV/Indel Annotation

SNV/Indel注释

Tool	When to Use	Parameters	Response
`MyVariant_query_variants`	Batch annotation	`query` (rsID/HGVS)	ClinVar, dbSNP, gnomAD, CADD
`dbsnp_get_variant_by_rsid`	Population frequencies	`rsid`	Frequencies, clinical significance
`gnomad_get_variant`	gnomAD metadata	`variant_id` (CHR-POS-REF-ALT)	Basic variant info
`EnsemblVEP_annotate_rsid`	Consequence prediction	`variant_id` (rsID)	Transcript impact

工具	使用场景	参数	返回结果
`MyVariant_query_variants`	批量注释	`query` (rsID/HGVS)	ClinVar、dbSNP、gnomAD、CADD数据
`dbsnp_get_variant_by_rsid`	群体频率查询	`rsid`	频率、临床意义
`gnomad_get_variant`	gnomAD元数据查询	`variant_id` (CHR-POS-REF-ALT)	基础变异信息
`EnsemblVEP_annotate_rsid`	影响类型预测	`variant_id` (rsID)	转录本影响

Structural Variant Annotation

结构变异注释

Tool	When to Use	Parameters	Response
`gnomad_get_sv_by_gene`	SV population frequency	`gene_symbol`	SVs with AF, AC, AN
`gnomad_get_sv_by_region`	Regional SV search	`chrom` , `start` , `end`	SVs in region
`ClinGen_dosage_by_gene`	Dosage sensitivity	`gene_symbol`	HI/TS scores, disease
`ClinGen_dosage_region_search`	Dosage-sensitive genes in region	`chromosome` , `start` , `end`	All genes with HI/TS scores
`ensembl_get_structural_variants`	Known SVs from DGVa/dbVar	`chrom` , `start` , `end` , `species`	Clinical significance

See references/annotation_guide.md for detailed tool usage examples

工具	使用场景	参数	返回结果
`gnomad_get_sv_by_gene`	SV群体频率查询	`gene_symbol`	带AF、AC、AN的SV数据
`gnomad_get_sv_by_region`	区域SV搜索	`chrom` , `start` , `end`	区域内的SV数据
`ClinGen_dosage_by_gene`	剂量敏感性查询	`gene_symbol`	HI/TS评分、疾病信息
`ClinGen_dosage_region_search`	区域内剂量敏感性基因查询	`chromosome` , `start` , `end`	所有带HI/TS评分的基因
`ensembl_get_structural_variants`	从DGVa/dbVar获取已知SV	`chrom` , `start` , `end` , `species`	临床意义

详细工具使用示例请参考references/annotation_guide.md

Common Use Patterns

常见使用模式

Pattern 1: Quick VCF Summary

模式1: 快速VCF汇总

Parse VCF, compute statistics, generate report.

python

report = variant_analysis_pipeline("input.vcf", output_file="report.md")

解析VCF、计算统计数据、生成报告。

python

report = variant_analysis_pipeline("input.vcf", output_file="report.md")

Pattern 2: Filtered Analysis

模式2: 过滤后分析

Parse VCF, apply multi-criteria filter, compute statistics on filtered set.

python

report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True),
    output_file="filtered_report.md"
)

解析VCF、应用多条件过滤、对过滤后的数据集计算统计数据。

python

report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True),
    output_file="filtered_report.md"
)

Pattern 3: Annotated Report

模式3: 带注释的报告

Parse VCF, annotate top variants with ClinVar/gnomAD/CADD, generate clinical report.

python

report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    annotate=True,
    max_annotate=50,
    output_file="annotated_report.md"
)

解析VCF、用ClinVar/gnomAD/CADD注释Top变异、生成临床报告。

python

report = variant_analysis_pipeline(
    vcf_path="input.vcf",
    annotate=True,
    max_annotate=50,
    output_file="annotated_report.md"
)

Pattern 4: BixBench Question Answering

模式4: BixBench问题解答

Parse VCF, apply specific filters, compute targeted statistics to answer precise questions.

python

result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense"
)

解析VCF、应用特定过滤规则、计算目标统计数据以解答精准问题。

python

result = answer_vaf_mutation_fraction(
    vcf_path="input.vcf",
    max_vaf=0.3,
    mutation_type="missense"
)

Pattern 5: Cohort Comparison

模式5: 队列对比

Parse multiple VCFs, compare mutation frequencies across cohorts.

python

result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense"
)

解析多个VCF、对比不同队列间的突变频率。

python

result = answer_cohort_comparison(
    vcf_paths=["cohort1.vcf", "cohort2.vcf"],
    mutation_type="missense"
)

When to Use pandas vs python_implementation

何时使用pandas vs python_implementation

Use pandas when:

You need to read VCF as a flat table
You want to do custom aggregations (groupby, pivot)
You need to join with other data
You're doing exploratory data analysis
You want to export to CSV/Excel

Use python_implementation when:

You need production-grade VCF parsing
You need to extract INFO annotations (ANN, CSQ)
You need per-sample VAF/depth extraction
You need to classify mutation types
You need standard variant statistics (Ti/Tv)
You need to integrate with ToolUniverse annotation

Best approach: Use python_implementation for parsing/classification, then convert to DataFrame for custom analysis:

python

undefined

使用pandas的场景:

需要将VCF读取为扁平表格
想要进行自定义聚合（groupby、透视）
需要与其他数据关联
进行探索性数据分析
想要导出为CSV/Excel

使用python_implementation的场景:

需要生产级VCF解析
需要提取INFO注释（ANN、CSQ）
需要提取单样本VAF/深度
需要对突变类型进行分类
需要标准变异统计（Ti/Tv）
需要对接ToolUniverse注释

最佳实践: 使用python_implementation进行解析/分类，然后转换为DataFrame进行自定义分析:

python

undefined

Parse and classify

解析并分类

vcf_data = parse_vcf("input.vcf") passing, failing = filter_variants(vcf_data.variants, criteria)

Convert to DataFrame for custom analysis

转换为DataFrame进行自定义分析

df = variants_to_dataframe(passing, sample="TUMOR")

Now use pandas

现在使用pandas进行分析

missense_high_vaf = df[(df['mutation_type'] == 'missense') & (df['vaf'] >= 0.3)]

---

missense_high_vaf = df[(df['mutation_type'] == 'missense') & (df['vaf'] >= 0.3)]

---

Limitations

局限性

VCF annotation required for mutation classification: If VCF has no ANN/CSQ/FUNCOTATION in INFO, mutation types will be "unknown" until ToolUniverse annotation is applied
Multi-allelic variants: Parser takes first ALT allele for type classification
ToolUniverse annotation rate: API-based, limited to ~100 variants per batch by default to respect rate limits
gnomAD tool: Returns basic metadata only (not full allele frequencies); use MyVariant.info for gnomAD AF
Large VCFs: Pure Python parser streams line-by-line; cyvcf2 is recommended for files with >100K variants

突变分类依赖VCF注释: 如果VCF的INFO字段中没有ANN/CSQ/FUNCOTATION，突变类型会被标记为"unknown"，直到应用ToolUniverse注释
多等位基因变异: 解析器会选取第一个ALT等位基因进行类型分类
ToolUniverse注释速率: 基于API，默认限制每批约100个变异以遵守速率限制
gnomAD工具: 仅返回基础元数据（不包含完整等位基因频率）；如需gnomAD AF请使用MyVariant.info
大VCF文件: 纯Python解析器支持逐行流式处理；对于超过100K变异的文件，推荐使用cyvcf2

Reference Documentation

参考文档

references/vcf_filtering.md: Complete filter options and examples
references/mutation_classification_guide.md: Detailed mutation type classification rules
references/annotation_guide.md: ToolUniverse annotation workflows with examples
references/sv_cnv_analysis.md: Complete SV/CNV interpretation workflow

references/vcf_filtering.md: 完整过滤选项及示例
references/mutation_classification_guide.md: 详细突变类型分类规则
references/annotation_guide.md: ToolUniverse注释工作流及示例
references/sv_cnv_analysis.md: 完整SV/CNV解读工作流

Utility Scripts

实用脚本

scripts/parse_vcf.py: Standalone VCF parsing script
scripts/filter_variants.py: Command-line variant filtering
scripts/annotate_variants.py: Batch variant annotation

scripts/parse_vcf.py: 独立VCF解析脚本
scripts/filter_variants.py: 命令行变异过滤工具
scripts/annotate_variants.py: 批量变异注释脚本

Quick Start

快速开始

See QUICK_START.md for:

Python SDK examples (pipeline, question functions, individual tools)
MCP conversational examples
Common recipes (somatic analysis, clinical screening, population frequency)
Expected output formats
Troubleshooting guide

请参考QUICK_START.md获取:

Python SDK示例（工作流、问题解答函数、独立工具）
MCP对话示例
常见场景方案（体细胞分析、临床筛查、群体频率）
预期输出格式
故障排查指南