pysam

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Pysam

Pysam

Overview

概述

Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.
Pysam是一个用于读取、操作和写入基因组数据集的Python模块。它通过对htslib的Python化接口,支持读写SAM/BAM/CRAM比对文件、VCF/BCF变异文件以及FASTA/FASTQ序列。可查询tabix索引文件、执行覆盖度的堆积分析,以及运行samtools/bcftools命令。

When to Use This Skill

何时使用该工具

This skill should be used when:
  • Working with sequencing alignment files (BAM/CRAM)
  • Analyzing genetic variants (VCF/BCF)
  • Extracting reference sequences or gene regions
  • Processing raw sequencing data (FASTQ)
  • Calculating coverage or read depth
  • Implementing bioinformatics analysis pipelines
  • Quality control of sequencing data
  • Variant calling and annotation workflows
在以下场景中应使用该工具:
  • 处理测序比对文件(BAM/CRAM)
  • 分析遗传变异(VCF/BCF)
  • 提取参考序列或基因区域
  • 处理原始测序数据(FASTQ)
  • 计算覆盖度或读取深度
  • 实现生物信息学分析流程
  • 测序数据的质量控制
  • 变异检测与注释工作流

Quick Start

快速开始

Installation

安装

bash
uv pip install pysam
bash
uv pip install pysam

Basic Examples

基础示例

Read alignment file:
python
import pysam
读取比对文件:
python
import pysam

Open BAM file and fetch reads in region

打开BAM文件并获取指定区域的读取数据

samfile = pysam.AlignmentFile("example.bam", "rb") for read in samfile.fetch("chr1", 1000, 2000): print(f"{read.query_name}: {read.reference_start}") samfile.close()

**Read variant file:**
```python
samfile = pysam.AlignmentFile("example.bam", "rb") for read in samfile.fetch("chr1", 1000, 2000): print(f"{read.query_name}: {read.reference_start}") samfile.close()

**读取变异文件:**
```python

Open VCF file and iterate variants

打开VCF文件并遍历变异数据

vcf = pysam.VariantFile("variants.vcf") for variant in vcf: print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}") vcf.close()

**Query reference sequence:**
```python
vcf = pysam.VariantFile("variants.vcf") for variant in vcf: print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}") vcf.close()

**查询参考序列:**
```python

Open FASTA and extract sequence

打开FASTA文件并提取序列

fasta = pysam.FastaFile("reference.fasta") sequence = fasta.fetch("chr1", 1000, 2000) print(sequence) fasta.close()
undefined
fasta = pysam.FastaFile("reference.fasta") sequence = fasta.fetch("chr1", 1000, 2000) print(sequence) fasta.close()
undefined

Core Capabilities

核心功能

1. Alignment File Operations (SAM/BAM/CRAM)

1. 比对文件操作(SAM/BAM/CRAM)

Use the
AlignmentFile
class to work with aligned sequencing reads. This is appropriate for analyzing mapping results, calculating coverage, extracting reads, or quality control.
Common operations:
  • Open and read BAM/SAM/CRAM files
  • Fetch reads from specific genomic regions
  • Filter reads by mapping quality, flags, or other criteria
  • Write filtered or modified alignments
  • Calculate coverage statistics
  • Perform pileup analysis (base-by-base coverage)
  • Access read sequences, quality scores, and alignment information
Reference: See
references/alignment_files.md
for detailed documentation on:
  • Opening and reading alignment files
  • AlignedSegment attributes and methods
  • Region-based fetching with
    fetch()
  • Pileup analysis for coverage
  • Writing and creating BAM files
  • Coordinate systems and indexing
  • Performance optimization tips
使用
AlignmentFile
类处理比对后的测序读取数据。适用于分析比对结果、计算覆盖度、提取读取数据或质量控制。
常见操作:
  • 打开并读取BAM/SAM/CRAM文件
  • 从特定基因组区域获取读取数据
  • 按比对质量、标志或其他条件过滤读取数据
  • 写入过滤或修改后的比对数据
  • 计算覆盖度统计信息
  • 执行堆积分析(逐碱基覆盖度)
  • 访问读取序列、质量分数和比对信息
参考文档: 详见
references/alignment_files.md
中的详细说明:
  • 打开和读取比对文件
  • AlignedSegment属性与方法
  • 使用
    fetch()
    进行基于区域的获取
  • 用于覆盖度的堆积分析
  • 写入和创建BAM文件
  • 坐标系统与索引
  • 性能优化技巧

2. Variant File Operations (VCF/BCF)

2. 变异文件操作(VCF/BCF)

Use the
VariantFile
class to work with genetic variants from variant calling pipelines. This is appropriate for variant analysis, filtering, annotation, or population genetics.
Common operations:
  • Read and write VCF/BCF files
  • Query variants in specific regions
  • Access variant information (position, alleles, quality)
  • Extract genotype data for samples
  • Filter variants by quality, allele frequency, or other criteria
  • Annotate variants with additional information
  • Subset samples or regions
Reference: See
references/variant_files.md
for detailed documentation on:
  • Opening and reading variant files
  • VariantRecord attributes and methods
  • Accessing INFO and FORMAT fields
  • Working with genotypes and samples
  • Creating and writing VCF files
  • Filtering and subsetting variants
  • Multi-sample VCF operations
使用
VariantFile
类处理来自变异检测流程的遗传变异数据。适用于变异分析、过滤、注释或群体遗传学研究。
常见操作:
  • 读取和写入VCF/BCF文件
  • 查询特定区域的变异数据
  • 访问变异信息(位置、等位基因、质量)
  • 提取样本的基因型数据
  • 按质量、等位基因频率或其他条件过滤变异
  • 为变异添加注释信息
  • 对样本或区域进行子集划分
参考文档: 详见
references/variant_files.md
中的详细说明:
  • 打开和读取变异文件
  • VariantRecord属性与方法
  • 访问INFO和FORMAT字段
  • 处理基因型与样本
  • 创建和写入VCF文件
  • 过滤和子集划分变异
  • 多样本VCF操作

3. Sequence File Operations (FASTA/FASTQ)

3. 序列文件操作(FASTA/FASTQ)

Use
FastaFile
for random access to reference sequences and
FastxFile
for reading raw sequencing data. This is appropriate for extracting gene sequences, validating variants against reference, or processing raw reads.
Common operations:
  • Query reference sequences by genomic coordinates
  • Extract sequences for genes or regions of interest
  • Read FASTQ files with quality scores
  • Validate variant reference alleles
  • Calculate sequence statistics
  • Filter reads by quality or length
  • Convert between FASTA and FASTQ formats
Reference: See
references/sequence_files.md
for detailed documentation on:
  • FASTA file access and indexing
  • Extracting sequences by region
  • Handling reverse complement for genes
  • Reading FASTQ files sequentially
  • Quality score conversion and filtering
  • Working with tabix-indexed files (BED, GTF, GFF)
  • Common sequence processing patterns
使用
FastaFile
类随机访问参考序列,使用
FastxFile
类读取原始测序数据。适用于提取基因序列、验证变异与参考序列的一致性,或处理原始读取数据。
常见操作:
  • 按基因组坐标查询参考序列
  • 提取目标基因或区域的序列
  • 读取带质量分数的FASTQ文件
  • 验证变异的参考等位基因
  • 计算序列统计信息
  • 按质量或长度过滤读取数据
  • 在FASTA与FASTQ格式之间转换
参考文档: 详见
references/sequence_files.md
中的详细说明:
  • FASTA文件访问与索引
  • 按区域提取序列
  • 处理基因的反向互补序列
  • 顺序读取FASTQ文件
  • 质量分数转换与过滤
  • 处理tabix索引文件(BED、GTF、GFF)
  • 常见序列处理模式

4. Integrated Bioinformatics Workflows

4. 集成生物信息学工作流

Pysam excels at integrating multiple file types for comprehensive genomic analyses. Common workflows combine alignment files, variant files, and reference sequences.
Common workflows:
  • Calculate coverage statistics for specific regions
  • Validate variants against aligned reads
  • Annotate variants with coverage information
  • Extract sequences around variant positions
  • Filter alignments or variants based on multiple criteria
  • Generate coverage tracks for visualization
  • Quality control across multiple data types
Reference: See
references/common_workflows.md
for detailed examples of:
  • Quality control workflows (BAM statistics, reference consistency)
  • Coverage analysis (per-base coverage, low coverage detection)
  • Variant analysis (annotation, filtering by read support)
  • Sequence extraction (variant contexts, gene sequences)
  • Read filtering and subsetting
  • Integration patterns (BAM+VCF, VCF+BED, etc.)
  • Performance optimization for complex workflows
Pysam擅长整合多种文件类型进行全面的基因组分析。常见工作流会结合比对文件、变异文件和参考序列。
常见工作流:
  • 计算特定区域的覆盖度统计信息
  • 根据比对读取数据验证变异
  • 为变异添加覆盖度信息注释
  • 提取变异位置周围的序列
  • 根据多个条件过滤比对数据或变异
  • 生成用于可视化的覆盖度轨迹
  • 多数据类型的质量控制
参考文档: 详见
references/common_workflows.md
中的详细示例:
  • 质量控制工作流(BAM统计、参考序列一致性)
  • 覆盖度分析(逐碱基覆盖度、低覆盖度检测)
  • 变异分析(注释、按读取支持过滤)
  • 序列提取(变异上下文、基因序列)
  • 读取数据过滤与子集划分
  • 集成模式(BAM+VCF、VCF+BED等)
  • 复杂工作流的性能优化

Key Concepts

关键概念

Coordinate Systems

坐标系统

Critical: Pysam uses 0-based, half-open coordinates (Python convention):
  • Start positions are 0-based (first base is position 0)
  • End positions are exclusive (not included in the range)
  • Region 1000-2000 includes bases 1000-1999 (1000 bases total)
Exception: Region strings in
fetch()
follow samtools convention (1-based):
python
samfile.fetch("chr1", 999, 2000)      # 0-based: positions 999-1999
samfile.fetch("chr1:1000-2000")       # 1-based string: positions 1000-2000
VCF files: Use 1-based coordinates in the file format, but
VariantRecord.start
is 0-based.
重要提示: Pysam使用0-based、半开区间坐标(Python惯例):
  • 起始位置为0-based(第一个碱基是位置0)
  • 结束位置是排他的(不包含在范围内)
  • 区域1000-2000包含碱基1000-1999(共1000个碱基)
例外情况:
fetch()
中的区域字符串遵循samtools惯例(1-based):
python
samfile.fetch("chr1", 999, 2000)      # 0-based:位置999-1999
samfile.fetch("chr1:1000-2000")       # 1-based字符串:位置1000-2000
VCF文件: 文件格式中使用1-based坐标,但
VariantRecord.start
是0-based。

Indexing Requirements

索引要求

Random access to specific genomic regions requires index files:
  • BAM files: Require
    .bai
    index (create with
    pysam.index()
    )
  • CRAM files: Require
    .crai
    index
  • FASTA files: Require
    .fai
    index (create with
    pysam.faidx()
    )
  • VCF.gz files: Require
    .tbi
    tabix index (create with
    pysam.tabix_index()
    )
  • BCF files: Require
    .csi
    index
Without an index, use
fetch(until_eof=True)
for sequential reading.
对特定基因组区域的随机访问需要索引文件:
  • BAM文件:需要
    .bai
    索引(使用
    pysam.index()
    创建)
  • CRAM文件:需要
    .crai
    索引
  • FASTA文件:需要
    .fai
    索引(使用
    pysam.faidx()
    创建)
  • VCF.gz文件:需要
    .tbi
    tabix索引(使用
    pysam.tabix_index()
    创建)
  • BCF文件:需要
    .csi
    索引
如果没有索引,可使用
fetch(until_eof=True)
进行顺序读取。

File Modes

文件模式

Specify format when opening files:
  • "rb"
    - Read BAM (binary)
  • "r"
    - Read SAM (text)
  • "rc"
    - Read CRAM
  • "wb"
    - Write BAM
  • "w"
    - Write SAM
  • "wc"
    - Write CRAM
打开文件时需指定格式:
  • "rb"
    - 读取BAM(二进制)
  • "r"
    - 读取SAM(文本)
  • "rc"
    - 读取CRAM
  • "wb"
    - 写入BAM
  • "w"
    - 写入SAM
  • "wc"
    - 写入CRAM

Performance Considerations

性能注意事项

  1. Always use indexed files for random access operations
  2. Use
    pileup()
    for column-wise analysis
    instead of repeated fetch operations
  3. Use
    count()
    for counting
    instead of iterating and counting manually
  4. Process regions in parallel when analyzing independent genomic regions
  5. Close files explicitly to free resources
  6. Use
    until_eof=True
    for sequential processing without index
  7. Avoid multiple iterators unless necessary (use
    multiple_iterators=True
    if needed)
  1. 始终使用索引文件进行随机访问操作
  2. 使用
    pileup()
    进行列分析
    ,而非重复的fetch操作
  3. 使用
    count()
    进行计数
    ,而非手动迭代计数
  4. 并行处理区域,当分析独立的基因组区域时
  5. 显式关闭文件以释放资源
  6. **使用
    until_eof=True
    **进行无索引的顺序处理
  7. 避免多个迭代器,除非必要(若需要则使用
    multiple_iterators=True

Common Pitfalls

常见陷阱

  1. Coordinate confusion: Remember 0-based vs 1-based systems in different contexts
  2. Missing indices: Many operations require index files—create them first
  3. Partial overlaps:
    fetch()
    returns reads overlapping region boundaries, not just those fully contained
  4. Iterator scope: Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
  5. Quality score editing: Cannot modify
    query_qualities
    in place after changing
    query_sequence
    —create a copy first
  6. Stream limitations: Only stdin/stdout are supported for streaming, not arbitrary Python file objects
  7. Thread safety: While GIL is released during I/O, comprehensive thread-safety hasn't been fully validated
  1. 坐标混淆: 记住不同场景下的0-based与1-based系统
  2. 缺失索引: 许多操作需要索引文件——请先创建
  3. 部分重叠:
    fetch()
    返回与区域边界重叠的读取数据,而非仅完全包含在区域内的
  4. 迭代器作用域: 保持堆积迭代器的引用,避免出现“PileupProxy在迭代器结束后被访问”的错误
  5. 质量分数编辑: 修改
    query_sequence
    后,无法原地修改
    query_qualities
    ——请先创建副本
  6. 流限制: 仅支持stdin/stdout进行流式处理,不支持任意Python文件对象
  7. 线程安全: 虽然I/O期间会释放GIL,但全面的线程安全性尚未完全验证

Command-Line Tools

命令行工具

Pysam provides access to samtools and bcftools commands:
python
undefined
Pysam提供对samtools和bcftools命令的访问:
python
undefined

Sort BAM file

排序BAM文件

pysam.samtools.sort("-o", "sorted.bam", "input.bam")
pysam.samtools.sort("-o", "sorted.bam", "input.bam")

Index BAM

索引BAM文件

pysam.samtools.index("sorted.bam")
pysam.samtools.index("sorted.bam")

View specific region

查看特定区域

pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")
pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")

BCF tools

BCF工具

pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")

**Error handling:**
```python
try:
    pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
    print(f"Error: {e}")
pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")

**错误处理:**
```python
try:
    pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
    print(f"错误: {e}")

Resources

资源

references/

references/

Detailed documentation for each major capability:
  • alignment_files.md - Complete guide to SAM/BAM/CRAM operations, including AlignmentFile class, AlignedSegment attributes, fetch operations, pileup analysis, and writing alignments
  • variant_files.md - Complete guide to VCF/BCF operations, including VariantFile class, VariantRecord attributes, genotype handling, INFO/FORMAT fields, and multi-sample operations
  • sequence_files.md - Complete guide to FASTA/FASTQ operations, including FastaFile and FastxFile classes, sequence extraction, quality score handling, and tabix-indexed file access
  • common_workflows.md - Practical examples of integrated bioinformatics workflows combining multiple file types, including quality control, coverage analysis, variant validation, and sequence extraction
各主要功能的详细文档:
  • alignment_files.md - SAM/BAM/CRAM操作的完整指南,包括AlignmentFile类、AlignedSegment属性、fetch操作、堆积分析、写入比对文件等
  • variant_files.md - VCF/BCF操作的完整指南,包括VariantFile类、VariantRecord属性、基因型处理、INFO/FORMAT字段、多样本操作等
  • sequence_files.md - FASTA/FASTQ操作的完整指南,包括FastaFile和FastxFile类、序列提取、质量分数处理、tabix索引文件访问等
  • common_workflows.md - 整合多种文件类型的实用生物信息学工作流示例,包括质量控制、覆盖度分析、变异验证、序列提取等

Getting Help

For detailed information on specific operations, refer to the appropriate reference document:
  • Working with BAM files or calculating coverage →
    alignment_files.md
  • Analyzing variants or genotypes →
    variant_files.md
  • Extracting sequences or processing FASTQ →
    sequence_files.md
  • Complex workflows integrating multiple file types →
    common_workflows.md
Official documentation: https://pysam.readthedocs.io/