pysam

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Pysam

Overview

概述

Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.

Pysam是一个用于读取、操作和写入基因组数据集的Python模块。它通过对htslib的Python化接口，支持读写SAM/BAM/CRAM比对文件、VCF/BCF变异文件以及FASTA/FASTQ序列。可查询tabix索引文件、执行覆盖度的堆积分析，以及运行samtools/bcftools命令。

When to Use This Skill

何时使用该工具

This skill should be used when:

Working with sequencing alignment files (BAM/CRAM)
Analyzing genetic variants (VCF/BCF)
Extracting reference sequences or gene regions
Processing raw sequencing data (FASTQ)
Calculating coverage or read depth
Implementing bioinformatics analysis pipelines
Quality control of sequencing data
Variant calling and annotation workflows

在以下场景中应使用该工具：

处理测序比对文件（BAM/CRAM）
分析遗传变异（VCF/BCF）
提取参考序列或基因区域
处理原始测序数据（FASTQ）
计算覆盖度或读取深度
实现生物信息学分析流程
测序数据的质量控制
变异检测与注释工作流

Quick Start

快速开始

Installation

安装

bash

uv pip install pysam

bash

uv pip install pysam

Basic Examples

基础示例

Read alignment file:

python

import pysam

读取比对文件：

python

import pysam

Open BAM file and fetch reads in region

打开BAM文件并获取指定区域的读取数据

samfile = pysam.AlignmentFile("example.bam", "rb") for read in samfile.fetch("chr1", 1000, 2000): print(f"{read.query_name}: {read.reference_start}") samfile.close()


**Read variant file:**
```python

samfile = pysam.AlignmentFile("example.bam", "rb") for read in samfile.fetch("chr1", 1000, 2000): print(f"{read.query_name}: {read.reference_start}") samfile.close()


**读取变异文件：**
```python

Open VCF file and iterate variants

打开VCF文件并遍历变异数据

vcf = pysam.VariantFile("variants.vcf") for variant in vcf: print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}") vcf.close()


**Query reference sequence:**
```python

vcf = pysam.VariantFile("variants.vcf") for variant in vcf: print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}") vcf.close()


**查询参考序列：**
```python

Open FASTA and extract sequence

打开FASTA文件并提取序列

fasta = pysam.FastaFile("reference.fasta") sequence = fasta.fetch("chr1", 1000, 2000) print(sequence) fasta.close()

undefined

fasta = pysam.FastaFile("reference.fasta") sequence = fasta.fetch("chr1", 1000, 2000) print(sequence) fasta.close()

undefined

Core Capabilities

核心功能

1. Alignment File Operations (SAM/BAM/CRAM)

1. 比对文件操作（SAM/BAM/CRAM）

Use the

AlignmentFile

class to work with aligned sequencing reads. This is appropriate for analyzing mapping results, calculating coverage, extracting reads, or quality control.

Common operations:

Open and read BAM/SAM/CRAM files
Fetch reads from specific genomic regions
Filter reads by mapping quality, flags, or other criteria
Write filtered or modified alignments
Calculate coverage statistics
Perform pileup analysis (base-by-base coverage)
Access read sequences, quality scores, and alignment information

Reference: See

references/alignment_files.md

for detailed documentation on:

Opening and reading alignment files
AlignedSegment attributes and methods
Region-based fetching with
```
fetch()
```
Pileup analysis for coverage
Writing and creating BAM files
Coordinate systems and indexing
Performance optimization tips

使用

AlignmentFile

类处理比对后的测序读取数据。适用于分析比对结果、计算覆盖度、提取读取数据或质量控制。

常见操作：

打开并读取BAM/SAM/CRAM文件
从特定基因组区域获取读取数据
按比对质量、标志或其他条件过滤读取数据
写入过滤或修改后的比对数据
计算覆盖度统计信息
执行堆积分析（逐碱基覆盖度）
访问读取序列、质量分数和比对信息

参考文档： 详见

references/alignment_files.md

中的详细说明：

打开和读取比对文件
AlignedSegment属性与方法
使用
```
fetch()
```
进行基于区域的获取
用于覆盖度的堆积分析
写入和创建BAM文件
坐标系统与索引
性能优化技巧

2. Variant File Operations (VCF/BCF)

2. 变异文件操作（VCF/BCF）

Use the

VariantFile

class to work with genetic variants from variant calling pipelines. This is appropriate for variant analysis, filtering, annotation, or population genetics.

Common operations:

Read and write VCF/BCF files
Query variants in specific regions
Access variant information (position, alleles, quality)
Extract genotype data for samples
Filter variants by quality, allele frequency, or other criteria
Annotate variants with additional information
Subset samples or regions

Reference: See

references/variant_files.md

for detailed documentation on:

Opening and reading variant files
VariantRecord attributes and methods
Accessing INFO and FORMAT fields
Working with genotypes and samples
Creating and writing VCF files
Filtering and subsetting variants
Multi-sample VCF operations

使用

VariantFile

类处理来自变异检测流程的遗传变异数据。适用于变异分析、过滤、注释或群体遗传学研究。

常见操作：

读取和写入VCF/BCF文件
查询特定区域的变异数据
访问变异信息（位置、等位基因、质量）
提取样本的基因型数据
按质量、等位基因频率或其他条件过滤变异
为变异添加注释信息
对样本或区域进行子集划分

参考文档： 详见

references/variant_files.md

中的详细说明：

打开和读取变异文件
VariantRecord属性与方法
访问INFO和FORMAT字段
处理基因型与样本
创建和写入VCF文件
过滤和子集划分变异
多样本VCF操作

3. Sequence File Operations (FASTA/FASTQ)

3. 序列文件操作（FASTA/FASTQ）

Use

FastaFile

for random access to reference sequences and

FastxFile

for reading raw sequencing data. This is appropriate for extracting gene sequences, validating variants against reference, or processing raw reads.

Common operations:

Query reference sequences by genomic coordinates
Extract sequences for genes or regions of interest
Read FASTQ files with quality scores
Validate variant reference alleles
Calculate sequence statistics
Filter reads by quality or length
Convert between FASTA and FASTQ formats

Reference: See

references/sequence_files.md

for detailed documentation on:

FASTA file access and indexing
Extracting sequences by region
Handling reverse complement for genes
Reading FASTQ files sequentially
Quality score conversion and filtering
Working with tabix-indexed files (BED, GTF, GFF)
Common sequence processing patterns

使用

FastaFile

类随机访问参考序列，使用

FastxFile

类读取原始测序数据。适用于提取基因序列、验证变异与参考序列的一致性，或处理原始读取数据。

常见操作：

按基因组坐标查询参考序列
提取目标基因或区域的序列
读取带质量分数的FASTQ文件
验证变异的参考等位基因
计算序列统计信息
按质量或长度过滤读取数据
在FASTA与FASTQ格式之间转换

参考文档： 详见

references/sequence_files.md

中的详细说明：

FASTA文件访问与索引
按区域提取序列
处理基因的反向互补序列
顺序读取FASTQ文件
质量分数转换与过滤
处理tabix索引文件（BED、GTF、GFF）
常见序列处理模式

4. Integrated Bioinformatics Workflows

4. 集成生物信息学工作流

Pysam excels at integrating multiple file types for comprehensive genomic analyses. Common workflows combine alignment files, variant files, and reference sequences.

Common workflows:

Calculate coverage statistics for specific regions
Validate variants against aligned reads
Annotate variants with coverage information
Extract sequences around variant positions
Filter alignments or variants based on multiple criteria
Generate coverage tracks for visualization
Quality control across multiple data types

Reference: See

references/common_workflows.md

for detailed examples of:

Quality control workflows (BAM statistics, reference consistency)
Coverage analysis (per-base coverage, low coverage detection)
Variant analysis (annotation, filtering by read support)
Sequence extraction (variant contexts, gene sequences)
Read filtering and subsetting
Integration patterns (BAM+VCF, VCF+BED, etc.)
Performance optimization for complex workflows

Pysam擅长整合多种文件类型进行全面的基因组分析。常见工作流会结合比对文件、变异文件和参考序列。

常见工作流：

计算特定区域的覆盖度统计信息
根据比对读取数据验证变异
为变异添加覆盖度信息注释
提取变异位置周围的序列
根据多个条件过滤比对数据或变异
生成用于可视化的覆盖度轨迹
多数据类型的质量控制

参考文档： 详见

references/common_workflows.md

中的详细示例：

质量控制工作流（BAM统计、参考序列一致性）
覆盖度分析（逐碱基覆盖度、低覆盖度检测）
变异分析（注释、按读取支持过滤）
序列提取（变异上下文、基因序列）
读取数据过滤与子集划分
集成模式（BAM+VCF、VCF+BED等）
复杂工作流的性能优化

Key Concepts

关键概念

Coordinate Systems

坐标系统

Critical: Pysam uses 0-based, half-open coordinates (Python convention):

Start positions are 0-based (first base is position 0)
End positions are exclusive (not included in the range)
Region 1000-2000 includes bases 1000-1999 (1000 bases total)

Exception: Region strings in

fetch()

follow samtools convention (1-based):

python

samfile.fetch("chr1", 999, 2000)      # 0-based: positions 999-1999
samfile.fetch("chr1:1000-2000")       # 1-based string: positions 1000-2000

VCF files: Use 1-based coordinates in the file format, but

VariantRecord.start

is 0-based.

重要提示： Pysam使用0-based、半开区间坐标（Python惯例）：

起始位置为0-based（第一个碱基是位置0）
结束位置是排他的（不包含在范围内）
区域1000-2000包含碱基1000-1999（共1000个碱基）

例外情况：

fetch()

中的区域字符串遵循samtools惯例（1-based）：

python

samfile.fetch("chr1", 999, 2000)      # 0-based：位置999-1999
samfile.fetch("chr1:1000-2000")       # 1-based字符串：位置1000-2000

VCF文件： 文件格式中使用1-based坐标，但

VariantRecord.start

是0-based。

Indexing Requirements

索引要求

Random access to specific genomic regions requires index files:

BAM files: Require
```
.bai
```
index (create with
```
pysam.index()
```
)
CRAM files: Require
```
.crai
```
index
FASTA files: Require
```
.fai
```
index (create with
```
pysam.faidx()
```
)
VCF.gz files: Require
```
.tbi
```
tabix index (create with
```
pysam.tabix_index()
```
)
BCF files: Require
```
.csi
```
index

Without an index, use

fetch(until_eof=True)

for sequential reading.

对特定基因组区域的随机访问需要索引文件：

BAM文件：需要
```
.bai
```
索引（使用
```
pysam.index()
```
创建）
CRAM文件：需要
```
.crai
```
索引
FASTA文件：需要
```
.fai
```
索引（使用
```
pysam.faidx()
```
创建）
VCF.gz文件：需要
```
.tbi
```
tabix索引（使用
```
pysam.tabix_index()
```
创建）
BCF文件：需要
```
.csi
```
索引

如果没有索引，可使用

fetch(until_eof=True)

进行顺序读取。

File Modes

文件模式

Specify format when opening files:

```
"rb"
```
- Read BAM (binary)
```
"r"
```
- Read SAM (text)
```
"rc"
```
- Read CRAM
```
"wb"
```
- Write BAM
```
"w"
```
- Write SAM
```
"wc"
```
- Write CRAM

打开文件时需指定格式：

```
"rb"
```
- 读取BAM（二进制）
```
"r"
```
- 读取SAM（文本）
```
"rc"
```
- 读取CRAM
```
"wb"
```
- 写入BAM
```
"w"
```
- 写入SAM
```
"wc"
```
- 写入CRAM

Performance Considerations

性能注意事项

Always use indexed files for random access operations
Use
pileup()
for column-wise analysis instead of repeated fetch operations
Use
count()
for counting instead of iterating and counting manually
Process regions in parallel when analyzing independent genomic regions
Close files explicitly to free resources
Use
until_eof=True
for sequential processing without index
Avoid multiple iterators unless necessary (use
```
multiple_iterators=True
```
if needed)

始终使用索引文件进行随机访问操作
使用
pileup()
进行列分析，而非重复的fetch操作
使用
count()
进行计数，而非手动迭代计数
并行处理区域，当分析独立的基因组区域时
显式关闭文件以释放资源
**使用
```
until_eof=True
```
**进行无索引的顺序处理
避免多个迭代器，除非必要（若需要则使用
```
multiple_iterators=True
```
）

Common Pitfalls

常见陷阱

Coordinate confusion: Remember 0-based vs 1-based systems in different contexts
Missing indices: Many operations require index files—create them first
Partial overlaps:
```
fetch()
```
returns reads overlapping region boundaries, not just those fully contained
Iterator scope: Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
Quality score editing: Cannot modify
```
query_qualities
```
in place after changing
```
query_sequence
```
—create a copy first
Stream limitations: Only stdin/stdout are supported for streaming, not arbitrary Python file objects
Thread safety: While GIL is released during I/O, comprehensive thread-safety hasn't been fully validated

坐标混淆： 记住不同场景下的0-based与1-based系统
缺失索引： 许多操作需要索引文件——请先创建
部分重叠：
```
fetch()
```
返回与区域边界重叠的读取数据，而非仅完全包含在区域内的
迭代器作用域： 保持堆积迭代器的引用，避免出现“PileupProxy在迭代器结束后被访问”的错误
质量分数编辑： 修改
```
query_sequence
```
后，无法原地修改
```
query_qualities
```
——请先创建副本
流限制： 仅支持stdin/stdout进行流式处理，不支持任意Python文件对象
线程安全： 虽然I/O期间会释放GIL，但全面的线程安全性尚未完全验证

Command-Line Tools

命令行工具

Pysam provides access to samtools and bcftools commands:

python

undefined

Pysam提供对samtools和bcftools命令的访问：

python

undefined

Sort BAM file

排序BAM文件

pysam.samtools.sort("-o", "sorted.bam", "input.bam")

Index BAM

索引BAM文件

pysam.samtools.index("sorted.bam")

View specific region

查看特定区域

pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")

BCF tools

BCF工具

pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")


**Error handling:**
```python
try:
    pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
    print(f"Error: {e}")

pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")


**错误处理：**
```python
try:
    pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
    print(f"错误: {e}")

Resources

资源

references/

Detailed documentation for each major capability:

alignment_files.md - Complete guide to SAM/BAM/CRAM operations, including AlignmentFile class, AlignedSegment attributes, fetch operations, pileup analysis, and writing alignments
variant_files.md - Complete guide to VCF/BCF operations, including VariantFile class, VariantRecord attributes, genotype handling, INFO/FORMAT fields, and multi-sample operations
sequence_files.md - Complete guide to FASTA/FASTQ operations, including FastaFile and FastxFile classes, sequence extraction, quality score handling, and tabix-indexed file access
common_workflows.md - Practical examples of integrated bioinformatics workflows combining multiple file types, including quality control, coverage analysis, variant validation, and sequence extraction

各主要功能的详细文档：

alignment_files.md - SAM/BAM/CRAM操作的完整指南，包括AlignmentFile类、AlignedSegment属性、fetch操作、堆积分析、写入比对文件等
variant_files.md - VCF/BCF操作的完整指南，包括VariantFile类、VariantRecord属性、基因型处理、INFO/FORMAT字段、多样本操作等
sequence_files.md - FASTA/FASTQ操作的完整指南，包括FastaFile和FastxFile类、序列提取、质量分数处理、tabix索引文件访问等
common_workflows.md - 整合多种文件类型的实用生物信息学工作流示例，包括质量控制、覆盖度分析、变异验证、序列提取等

官方文档：https://pysam.readthedocs.io/

Getting Help

—

For detailed information on specific operations, refer to the appropriate reference document:

Working with BAM files or calculating coverage →
```
alignment_files.md
```
Analyzing variants or genotypes →
```
variant_files.md
```
Extracting sequences or processing FASTQ →
```
sequence_files.md
```
Complex workflows integrating multiple file types →
```
common_workflows.md
```

Official documentation: https://pysam.readthedocs.io/

—