seqpro

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

seqpro

seqpro

Python/Rust package for fast biological-sequence processing. Python+NumPy+Numba for hot loops, a small Rust extension (
src/kshuffle.rs
) for graph-algorithm ops, and an awkward-backed
Ragged
array for variable-length batches. Imported as
import seqpro as sp
.
一款用于快速处理生物序列的Python/Rust包。热点循环采用Python+NumPy+Numba实现,图算法操作由一个小型Rust扩展(
src/kshuffle.rs
)完成,可变长度批次处理则基于awkward实现的
Ragged
数组。导入方式为
import seqpro as sp

When to use

使用场景

  • Encoding/decoding DNA, RNA, or protein sequences (OHE, integer tokens, padding).
  • Sequence augmentation: reverse complement, k-mer shuffle, jitter, random draws.
  • Sequence stats: GC content, nucleotide composition, length.
  • Variable-length batches (e.g. peaks, transcripts of different sizes) →
    sp.Ragged
    .
  • Genomic interval I/O:
    sp.bed
    ,
    sp.gtf
    .
  • DNA、RNA或蛋白质序列的编码/解码(OHE、整数标记、填充)。
  • 序列增强:反向互补、k-mer洗牌、抖动、随机抽取。
  • 序列统计:GC含量、核苷酸组成、序列长度。
  • 可变长度批次处理(如峰信号、不同长度的转录本)→ 使用
    sp.Ragged
  • 基因组区间I/O:
    sp.bed
    sp.gtf

Conventions (load these into working memory)

约定(请牢记这些规则)

  • Public API: see
    python/seqpro/__init__.py
    for the full export list. Re-read it before assuming a symbol exists.
  • Input types (
    SeqType
    in
    python/seqpro/_utils.py
    ): str, bytes, nested str lists, or
    ndarray
    with dtype
    str_
    /
    object_
    /
    bytes_
    /
    uint8
    .
    sp.cast_seqs(...)
    normalizes string-like inputs to
    |S1
    bytes arrays;
    uint8
    (OHE) is left untouched.
  • Canonical in-memory dtypes:
    |S1
    for string sequences,
    uint8
    for one-hot.
  • Axis arguments are required and explicit: most functions take
    length_axis
    and (for OHE)
    ohe_axis
    as positional/keyword ints. Negative indices allowed.
    check_axes()
    validates and raises early — don't catch and paper over.
  • No Python loops over sequences in library code. Hot paths use NumPy, Numba kernels in
    _numba.py
    , or the Rust extension. If you're tempted to write a
    for
    over residues, look for an existing vectorized op or a Numba kernel first.
  • Alphabets are singletons:
    sp.DNA
    ,
    sp.RNA
    ,
    sp.AA
    . Construct custom ones via
    sp.NucleotideAlphabet
    /
    sp.AminoAlphabet
    (
    python/seqpro/alphabets/_alphabets.py
    ).
  • Transforms (
    python/seqpro/transforms/
    ) wrap functional ops as callables — use these in data pipelines instead of inline lambdas.
  • 公开API:完整导出列表请查看
    python/seqpro/__init__.py
    。在假设某个符号存在之前,请重新阅读该文件。
  • 输入类型
    python/seqpro/_utils.py
    中的
    SeqType
    ):字符串、字节、嵌套字符串列表,或 dtype 为
    str_
    /
    object_
    /
    bytes_
    /
    uint8
    ndarray
    sp.cast_seqs(...)
    会将类字符串输入标准化为
    |S1
    字节数组;OHE格式的
    uint8
    数组则直接保留。
  • 标准内存 dtype:字符串序列使用
    |S1
    ,独热编码使用
    uint8
  • 轴参数必须显式指定:大多数函数需要传入
    length_axis
    和(针对OHE的)
    ohe_axis
    作为位置参数或关键字参数,支持负索引。
    check_axes()
    会提前验证参数并抛出异常——不要捕获并掩盖错误。
  • 库代码中禁止使用Python循环遍历序列:热点路径使用NumPy、
    _numba.py
    中的Numba内核或Rust扩展实现。如果你想要编写遍历残基的
    for
    循环,请先寻找现有的向量化操作或Numba内核。
  • 字母表为单例对象
    sp.DNA
    sp.RNA
    sp.AA
    。可通过
    sp.NucleotideAlphabet
    /
    sp.AminoAlphabet
    python/seqpro/alphabets/_alphabets.py
    )构造自定义字母表。
  • Transforms
    python/seqpro/transforms/
    )将函数式操作封装为可调用对象——在数据管道中使用这些对象,而非内联lambda表达式。

Quick reference

速查指南

TaskCallNotes
Normalize input
sp.cast_seqs(x)
→ `
One-hot encode
sp.ohe(x, alphabet, length_axis=-1)
last axis added for OHE dim
Decode OHE
sp.decode_ohe(x, alphabet, ohe_axis=-1)
Tokenize / detokenize
sp.tokenize
/
sp.decode_tokens
integer ids
Pad
sp.pad_seqs(x, pad_val, length=...)
Reverse complement
sp.reverse_complement(x, alphabet, length_axis=-1)
works on str/bytes/OHE
K-mer shuffle
sp.k_shuffle(x, k, length_axis=-1, seed=...)
calls Rust
_k_shuffle
Jitter
sp.jitter(x, max_jitter, length_axis=-1)
Random sequences
sp.random_seqs(shape, alphabet, seed=...)
GC content
sp.gc_content(x, length_axis=-1)
Nucleotide content
sp.nucleotide_content(x, alphabet, length_axis=-1)
Coverage binning
sp.bin_coverage(arr, bin_width, length_axis)
BED / GTF I/O
sp.bed.read_bedlike(...)
,
sp.gtf.read_gtf(...)
polars/pyranges-backed
For exact signatures and kwargs, read the docstring directly (
sp.<fn>?
in a REPL, or open the source — files are short).
任务调用方式说明
标准化输入
sp.cast_seqs(x)
→ 转换为`
独热编码
sp.ohe(x, alphabet, length_axis=-1)
最后一个轴为OHE维度
解码OHE
sp.decode_ohe(x, alphabet, ohe_axis=-1)
标记化/反标记化
sp.tokenize
/
sp.decode_tokens
整数ID形式
填充
sp.pad_seqs(x, pad_val, length=...)
反向互补
sp.reverse_complement(x, alphabet, length_axis=-1)
支持字符串/字节/OHE格式
K-mer洗牌
sp.k_shuffle(x, k, length_axis=-1, seed=...)
调用Rust的
_k_shuffle
实现
抖动
sp.jitter(x, max_jitter, length_axis=-1)
随机序列生成
sp.random_seqs(shape, alphabet, seed=...)
GC含量计算
sp.gc_content(x, length_axis=-1)
核苷酸组成计算
sp.nucleotide_content(x, alphabet, length_axis=-1)
覆盖度分箱
sp.bin_coverage(arr, bin_width, length_axis)
BED/GTF文件读写
sp.bed.read_bedlike(...)
,
sp.gtf.read_gtf(...)
基于polars/pyranges实现
如需查看精确的函数签名和关键字参数,请直接阅读文档字符串(在REPL中执行
sp.<fn>?
,或打开源码——文件篇幅较短)。

Ragged
— variable-length sequence batches

Ragged
——可变长度序列批次

sp.Ragged
(in
python/seqpro/rag/_array.py
) is the canonical container for batches where sequences differ in length. It is a thin subclass of
ak.Array
with exactly one ragged dimension, plus zero-copy access to the underlying flat NumPy buffer and offsets.
sp.Ragged
(位于
python/seqpro/rag/_array.py
)是处理长度不一的序列批次的标准容器。它是
ak.Array
的轻量级子类,恰好包含一个不规则维度,并支持零拷贝访问底层的扁平NumPy缓冲区和偏移量。

Mental model

心智模型

A
Ragged
has three things:
  • data
    : a flat contiguous
    NDArray
    of shape
    (total_elements, *fixed_trailing_dims)
    . Zero-copy access via
    rag.data
    .
  • offsets
    : an
    int64
    array. Shape
    (N+1,)
    (contiguous, the common case) or
    (2, N)
    starts/stops (after some slices). Access via
    rag.offsets
    .
  • shape
    : a tuple like
    (batch, None, ohe_dim)
    where exactly one entry is
    None
    — that's the ragged axis.
    rag.rag_dim
    gives its index.
rag.lengths
derives segment lengths from offsets (cheap, returns an
ndarray
).
Ragged
包含三个核心部分:
  • data
    :形状为
    (total_elements, *fixed_trailing_dims)
    的扁平连续
    NDArray
    。可通过
    rag.data
    进行零拷贝访问。
  • offsets
    :一个
    int64
    数组。形状为
    (N+1,)
    (连续型,最常见的情况)
    (2, N)
    (起始/结束位置,切片后可能出现)。可通过
    rag.offsets
    访问。
  • shape
    :类似
    (batch, None, ohe_dim)
    的元组,其中恰好有一个元素为
    None
    ——这就是不规则轴。
    rag.rag_dim
    可获取其索引。
rag.lengths
会从偏移量计算出各段的长度(开销极低,返回
ndarray
)。

Construction

构造方式

python
import numpy as np, seqpro as sp
python
import numpy as np, seqpro as sp

From lengths (most common — you have a flat data buffer and per-segment lengths)

From lengths (most common — you have a flat data buffer and per-segment lengths)

data = np.frombuffer(b"ACGTACGTACG", dtype="S1") lengths = np.array([4, 3, 4]) rag = sp.rag.Ragged.from_lengths(data, lengths) # shape (3, None)
data = np.frombuffer(b"ACGTACGTACG", dtype="S1") lengths = np.array([4, 3, 4]) rag = sp.rag.Ragged.from_lengths(data, lengths) # shape (3, None)

From explicit offsets

From explicit offsets

offsets = np.array([0, 4, 7, 11], dtype=np.int64) rag = sp.rag.Ragged.from_offsets(data, shape=(3, None), offsets=offsets)
offsets = np.array([0, 4, 7, 11], dtype=np.int64) rag = sp.rag.Ragged.from_offsets(data, shape=(3, None), offsets=offsets)

Empty with known shape

Empty with known shape

rag = sp.rag.Ragged.empty((10, None, 4), dtype=np.uint8) # batch of 10 OHE seqs

`Ragged.empty(shape, dtype)` requires exactly one `None` in `shape`. Trailing fixed dims (e.g. the OHE axis) go after the `None`.
rag = sp.rag.Ragged.empty((10, None, 4), dtype=np.uint8) # batch of 10 OHE seqs

`Ragged.empty(shape, dtype)`要求`shape`中恰好有一个`None`。后续的固定维度(如OHE轴)需放在`None`之后。

Working with
Ragged
— do this, not that

Ragged
使用规范——正确做法vs错误做法

TaskDoDon't
Bulk numeric op on the flat data
rag.data[:] = ...
or
rag.data.view(...)
— zero-copy
Iterate
for seq in rag:
Apply a
np.ufunc
Just call it:
np.exp(rag)
— dispatched via awkward behavior to return a
Ragged
Manually unpack and rebuild
Reinterpret bytes/dtype
rag.view(np.uint8)
np.asarray(rag).view(...)
(loses ragged structure)
Reshape non-ragged axes
rag.reshape(batch, None, k, 4)
Touch
rag.data.shape
directly
Drop a size-1 axis
rag.squeeze(axis)
(returns
ndarray
if collapses to 1D)
Densify to NumPy
rag.to_numpy()
(pads/raises per
allow_missing
)
Loop and stack
Strip to plain awkward
rag.to_ak()
任务正确做法错误做法
对扁平数据执行批量数值操作
rag.data[:] = ...
rag.data.view(...)
——零拷贝
遍历
for seq in rag:
应用
np.ufunc
直接调用:
np.exp(rag)
——通过awkward的调度机制返回
Ragged
手动拆分后重新构建
重新解释字节/dtype
rag.view(np.uint8)
np.asarray(rag).view(...)
(丢失不规则结构)
重塑非不规则轴
rag.reshape(batch, None, k, 4)
直接修改
rag.data.shape
移除长度为1的轴
rag.squeeze(axis)
(如果折叠为1D则返回
ndarray
转换为NumPy数组
rag.to_numpy()
(根据
allow_missing
参数进行填充或抛出异常)
循环后堆叠
转换为纯awkward数组
rag.to_ak()

Record-layout
Ragged
(multi-field)

记录型
Ragged
(多字段)

Build by
ak.zip
-ing existing
Ragged
s that share offsets. The result is already a
Ragged
— no manual wrap needed (it's registered via
ak.behavior
):
python
import awkward as ak
seq_rag   = sp.rag.Ragged.from_lengths(seq_flat,   lengths)  # |S1
score_rag = sp.rag.Ragged.from_lengths(score_flat, lengths)  # f4
batch = ak.zip({"seq": seq_rag, "score": score_rag})         # → Ragged (record layout)
assert isinstance(batch, sp.rag.Ragged)

batch["score"].data[:] *= 2.0   # zero-copy mutation of the flat score buffer
The two inputs must share offsets (same
lengths
/ same
offsets
array) — that's what makes the result a single-ragged-dim record. Mixing mismatched offsets falls back to a plain
ak.Array
.
  • rag.dtype
    returns a NumPy structured dtype (e.g.
    [("seq","S1"),("score","f4")]
    ), purely as a descriptor — memory is SoA, not AoS.
  • rag.data
    and
    rag.parts
    return dicts keyed by field name, not single arrays. Always type-check before indexing.
  • rag["field"]
    gives zero-copy single-field access and shares the parent's offsets object. Its
    .data
    is the flat NumPy buffer for that field.
  • view
    ,
    apply
    , and
    to_numpy
    are not defined on record layouts — operate per-field.
通过
ak.zip
合并共享偏移量的现有
Ragged
对象即可构建。结果自动成为
Ragged
——无需手动封装(已通过
ak.behavior
注册):
python
import awkward as ak
seq_rag   = sp.rag.Ragged.from_lengths(seq_flat,   lengths)  # |S1
score_rag = sp.rag.Ragged.from_lengths(score_flat, lengths)  # f4
batch = ak.zip({"seq": seq_rag, "score": score_rag})         # → Ragged (record layout)
assert isinstance(batch, sp.rag.Ragged)

batch["score"].data[:] *= 2.0   # zero-copy mutation of the flat score buffer
两个输入必须共享偏移量(相同的
lengths
/相同的
offsets
数组)——这是结果成为单一不规则维度记录的前提。如果偏移量不匹配,结果会退化为普通的
ak.Array
  • rag.dtype
    返回NumPy的结构化dtype(如
    [("seq","S1"),("score","f4")]
    ),仅作为描述符——内存布局为SoA,而非AoS。
  • rag.data
    rag.parts
    返回按字段名索引的字典,而非单一数组。索引前务必进行类型检查。
  • rag["field"]
    提供零拷贝的单字段访问,并共享父对象的偏移量。其
    .data
    为该字段的扁平NumPy缓冲区。
  • 记录型
    Ragged
    不支持
    view
    apply
    to_numpy
    操作——请按字段分别操作。

Awkward interop — what you can rely on

与Awkward的互操作性——可依赖的特性

Ragged
registers itself with
ak.behavior
so most awkward APIs Just Work:
  • NumPy ufuncs (
    np.add
    ,
    np.exp
    , etc.) on a
    Ragged
    return a
    Ragged
    .
  • ak.zip
    , slicing,
    ak.fields
    , etc. work and produce
    Ragged
    when the result still has exactly one ragged dim. If a slice produces zero or >1 ragged dims, you get a plain
    ak.Array
    back.
  • ak.to_packed(rag)
    is the canonical way to materialize a contiguous buffer (used internally before round-tripping through the Rust extension and other zero-copy paths).
  • Don't rely on awkward features outside this contract: strings (use
    |S1
    bytes), union types, and >1 ragged dim are unsupported by design.
When in doubt, read
python/seqpro/rag/_array.py
— it's ~800 lines and the docstrings are the source of truth.
_gufuncs.py
and
_utils.py
in the same dir are small.
Ragged
已通过
ak.behavior
注册,因此大多数awkward API可直接使用:
  • Ragged
    调用NumPy ufuncs(
    np.add
    np.exp
    等)会返回
    Ragged
  • ak.zip
    、切片、
    ak.fields
    等操作在结果仍仅含一个不规则维度时会生成
    Ragged
    。如果切片操作导致零个或多个不规则维度,将返回普通的
    ak.Array
  • ak.to_packed(rag)
    是生成连续缓冲区的标准方式(内部在与Rust扩展及其他零拷贝路径交互前会使用此方法)。
  • 不要依赖此约定之外的awkward特性:设计上不支持字符串(请使用
    |S1
    字节)、联合类型和多个不规则维度。
如有疑问,请阅读
python/seqpro/rag/_array.py
——该文件约800行,文档字符串为权威来源。同目录下的
_gufuncs.py
_utils.py
篇幅较短。

Common pitfalls

常见陷阱

  • Offsets layout drifts after slicing.
    rag.offsets
    may become
    (2, N)
    starts/stops instead of
    (N+1,)
    . Check
    rag.is_contiguous
    / call
    ak.to_packed(rag)
    before any code that assumes
    (N+1,)
    .
  • rag.data
    on a record layout is a dict.
    Code like
    rag.data.shape
    will fail; branch on
    isinstance(rag.data, dict)
    or use
    rag.parts
    and inspect.
  • Ragged
    must have exactly one
    None
    in
    shape
    .
    Constructing from data whose ragged structure doesn't match raises in
    __init__
    . Use
    from_lengths
    /
    from_offsets
    when in doubt.
  • The Rust k-shuffle expects contiguous
    uint8
    with the last axis as sequence length.
    sp.k_shuffle
    handles this for you; if calling
    seqpro._k_shuffle
    directly, ensure layout.
  • 切片后偏移量布局可能变化
    rag.offsets
    可能从
    (N+1,)
    变为
    (2, N)
    的起始/结束位置。在执行任何假设
    (N+1,)
    布局的代码前,请检查
    rag.is_contiguous
    或调用
    ak.to_packed(rag)
  • 记录型
    Ragged
    rag.data
    是字典
    。类似
    rag.data.shape
    的代码会执行失败;请通过
    isinstance(rag.data, dict)
    进行分支判断,或使用
    rag.parts
    进行检查。
  • Ragged
    shape
    中必须恰好有一个
    None
    。如果构造数据的不规则结构不匹配,
    __init__
    会抛出异常。如有疑问,请使用
    from_lengths
    /
    from_offsets
    方法。
  • Rust实现的k-shuffle要求
    uint8
    格式的连续数组,且最后一个轴为序列长度
    sp.k_shuffle
    会自动处理此要求;如果直接调用
    seqpro._k_shuffle
    ,请确保布局正确。

Where to look (don't memorize — read the source)

源码查阅指南(无需记忆,直接阅读源码)

NeedFile
Public surface
python/seqpro/__init__.py
Input casting / axis helpers
python/seqpro/_utils.py
OHE / tokens / padding
python/seqpro/_encoders.py
Augmentations
python/seqpro/_modifiers.py
Stats
python/seqpro/_analyzers.py
Alphabets
python/seqpro/alphabets/_alphabets.py
Ragged
python/seqpro/rag/_array.py
Transforms (pipeline objects)
python/seqpro/transforms/
BED/GTF
python/seqpro/bed.py
,
gtf.py
Rust k-shuffle
src/kshuffle.rs
Tests as usage examples
tests/
Rendered docs
site/
(built from
docs/
)
需求文件路径
公开API列表
python/seqpro/__init__.py
输入转换/轴辅助工具
python/seqpro/_utils.py
OHE/标记化/填充
python/seqpro/_encoders.py
序列增强
python/seqpro/_modifiers.py
序列统计
python/seqpro/_analyzers.py
字母表定义
python/seqpro/alphabets/_alphabets.py
Ragged数组实现
python/seqpro/rag/_array.py
Transforms(管道对象)
python/seqpro/transforms/
BED/GTF处理
python/seqpro/bed.py
,
gtf.py
Rust k-shuffle实现
src/kshuffle.rs
测试用例(使用示例)
tests/
渲染后的文档
site/
(从
docs/
构建)

Don'ts

禁忌事项

  • Don't write Python
    for
    loops over residues or positions in library code. Look for a vectorized op, a Numba kernel in
    _numba.py
    , or extend one.
  • Don't assume an axis — always pass
    length_axis
    (and
    ohe_axis
    where relevant) explicitly.
  • Don't reach into
    Ragged
    internals (
    _parts
    ,
    __init__
    shortcuts) from user code; use
    data
    ,
    offsets
    ,
    parts
    ,
    from_lengths
    ,
    from_offsets
    ,
    empty
    .
  • Don't introduce strings into
    Ragged
    . ASCII bytes (
    |S1
    ) only.
  • Don't add a feature or change a public signature without updating this skill — see CLAUDE.md.
  • 不要在库代码中使用Python
    for
    循环遍历残基或位置。请寻找向量化操作、
    _numba.py
    中的Numba内核,或扩展现有实现。
  • 不要假设轴参数——务必显式传递
    length_axis
    (以及相关的
    ohe_axis
    )。
  • 不要在用户代码中直接访问
    Ragged
    的内部结构(
    _parts
    __init__
    快捷方式);请使用
    data
    offsets
    parts
    from_lengths
    from_offsets
    empty
    等公开接口。
  • 不要在
    Ragged
    中使用字符串类型。仅支持ASCII字节(
    |S1
    )。
  • 添加功能或修改公开签名前,请务必更新本技能文档——参考CLAUDE.md。