seqpro

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

seqpro

Python/Rust package for fast biological-sequence processing. Python+NumPy+Numba for hot loops, a small Rust extension (

src/kshuffle.rs

) for graph-algorithm ops, and an awkward-backed

Ragged

array for variable-length batches. Imported as

import seqpro as sp

一款用于快速处理生物序列的Python/Rust包。热点循环采用Python+NumPy+Numba实现，图算法操作由一个小型Rust扩展（

src/kshuffle.rs

）完成，可变长度批次处理则基于awkward实现的

Ragged

数组。导入方式为

import seqpro as sp

。

When to use

使用场景

Encoding/decoding DNA, RNA, or protein sequences (OHE, integer tokens, padding).
Sequence augmentation: reverse complement, k-mer shuffle, jitter, random draws.
Sequence stats: GC content, nucleotide composition, length.
Variable-length batches (e.g. peaks, transcripts of different sizes) →
```
sp.Ragged
```
.
Genomic interval I/O:
```
sp.bed
```
,
```
sp.gtf
```
.

DNA、RNA或蛋白质序列的编码/解码（OHE、整数标记、填充）。
序列增强：反向互补、k-mer洗牌、抖动、随机抽取。
序列统计：GC含量、核苷酸组成、序列长度。
可变长度批次处理（如峰信号、不同长度的转录本）→ 使用
```
sp.Ragged
```
。
基因组区间I/O：
```
sp.bed
```
、
```
sp.gtf
```
。

Conventions (load these into working memory)

约定（请牢记这些规则）

Public API: see
```
python/seqpro/__init__.py
```
for the full export list. Re-read it before assuming a symbol exists.
Input types (
```
SeqType
```
in
```
python/seqpro/_utils.py
```
): str, bytes, nested str lists, or
```
ndarray
```
with dtype
```
str_
```
/
```
object_
```
/
```
bytes_
```
/
```
uint8
```
.
```
sp.cast_seqs(...)
```
normalizes string-like inputs to
```
|S1
```
bytes arrays;
```
uint8
```
(OHE) is left untouched.
Canonical in-memory dtypes:
```
|S1
```
for string sequences,
```
uint8
```
for one-hot.
Axis arguments are required and explicit: most functions take
```
length_axis
```
and (for OHE)
```
ohe_axis
```
as positional/keyword ints. Negative indices allowed.
```
check_axes()
```
validates and raises early — don't catch and paper over.
No Python loops over sequences in library code. Hot paths use NumPy, Numba kernels in
```
_numba.py
```
, or the Rust extension. If you're tempted to write a
```
for
```
over residues, look for an existing vectorized op or a Numba kernel first.

Alphabets are singletons:

sp.DNA

sp.RNA

sp.AA

. Construct custom ones via

sp.NucleotideAlphabet

sp.AminoAlphabet

(

python/seqpro/alphabets/_alphabets.py

Transforms (
```
python/seqpro/transforms/
```
) wrap functional ops as callables — use these in data pipelines instead of inline lambdas.

公开API：完整导出列表请查看
```
python/seqpro/__init__.py
```
。在假设某个符号存在之前，请重新阅读该文件。
输入类型（
```
python/seqpro/_utils.py
```
中的
```
SeqType
```
）：字符串、字节、嵌套字符串列表，或 dtype 为
```
str_
```
/
```
object_
```
/
```
bytes_
```
/
```
uint8
```
的
```
ndarray
```
。
```
sp.cast_seqs(...)
```
会将类字符串输入标准化为
```
|S1
```
字节数组；OHE格式的
```
uint8
```
数组则直接保留。
标准内存 dtype：字符串序列使用
```
|S1
```
，独热编码使用
```
uint8
```
。
轴参数必须显式指定：大多数函数需要传入
```
length_axis
```
和（针对OHE的）
```
ohe_axis
```
作为位置参数或关键字参数，支持负索引。
```
check_axes()
```
会提前验证参数并抛出异常——不要捕获并掩盖错误。
库代码中禁止使用Python循环遍历序列：热点路径使用NumPy、
```
_numba.py
```
中的Numba内核或Rust扩展实现。如果你想要编写遍历残基的
```
for
```
循环，请先寻找现有的向量化操作或Numba内核。

字母表为单例对象：

sp.DNA

、

sp.RNA

、

sp.AA

。可通过

sp.NucleotideAlphabet

sp.AminoAlphabet

（

python/seqpro/alphabets/_alphabets.py

）构造自定义字母表。

Transforms（
```
python/seqpro/transforms/
```
）将函数式操作封装为可调用对象——在数据管道中使用这些对象，而非内联lambda表达式。

Quick reference

速查指南

Task	Call	Notes
Normalize input	`sp.cast_seqs(x)`	→ `
One-hot encode	`sp.ohe(x, alphabet, length_axis=-1)`	last axis added for OHE dim
Decode OHE	`sp.decode_ohe(x, alphabet, ohe_axis=-1)`
Tokenize / detokenize	`sp.tokenize` / `sp.decode_tokens`	integer ids
Pad	`sp.pad_seqs(x, pad_val, length=...)`
Reverse complement	`sp.reverse_complement(x, alphabet, length_axis=-1)`	works on str/bytes/OHE
K-mer shuffle	`sp.k_shuffle(x, k, length_axis=-1, seed=...)`	calls Rust `_k_shuffle`
Jitter	`sp.jitter(x, max_jitter, length_axis=-1)`
Random sequences	`sp.random_seqs(shape, alphabet, seed=...)`
GC content	`sp.gc_content(x, length_axis=-1)`
Nucleotide content	`sp.nucleotide_content(x, alphabet, length_axis=-1)`
Coverage binning	`sp.bin_coverage(arr, bin_width, length_axis)`
BED / GTF I/O	`sp.bed.read_bedlike(...)` , `sp.gtf.read_gtf(...)`	polars/pyranges-backed

For exact signatures and kwargs, read the docstring directly (

sp.<fn>?

in a REPL, or open the source — files are short).

任务	调用方式	说明
标准化输入	`sp.cast_seqs(x)`	→ 转换为`
独热编码	`sp.ohe(x, alphabet, length_axis=-1)`	最后一个轴为OHE维度
解码OHE	`sp.decode_ohe(x, alphabet, ohe_axis=-1)`
标记化/反标记化	`sp.tokenize` / `sp.decode_tokens`	整数ID形式
填充	`sp.pad_seqs(x, pad_val, length=...)`
反向互补	`sp.reverse_complement(x, alphabet, length_axis=-1)`	支持字符串/字节/OHE格式
K-mer洗牌	`sp.k_shuffle(x, k, length_axis=-1, seed=...)`	调用Rust的 `_k_shuffle` 实现
抖动	`sp.jitter(x, max_jitter, length_axis=-1)`
随机序列生成	`sp.random_seqs(shape, alphabet, seed=...)`
GC含量计算	`sp.gc_content(x, length_axis=-1)`
核苷酸组成计算	`sp.nucleotide_content(x, alphabet, length_axis=-1)`
覆盖度分箱	`sp.bin_coverage(arr, bin_width, length_axis)`
BED/GTF文件读写	`sp.bed.read_bedlike(...)` , `sp.gtf.read_gtf(...)`	基于polars/pyranges实现

如需查看精确的函数签名和关键字参数，请直接阅读文档字符串（在REPL中执行

sp.<fn>?

，或打开源码——文件篇幅较短）。

Ragged

— variable-length sequence batches

Ragged

——可变长度序列批次

sp.Ragged

(in

python/seqpro/rag/_array.py

) is the canonical container for batches where sequences differ in length. It is a thin subclass of

ak.Array

with exactly one ragged dimension, plus zero-copy access to the underlying flat NumPy buffer and offsets.

sp.Ragged

（位于

python/seqpro/rag/_array.py

）是处理长度不一的序列批次的标准容器。它是

ak.Array

的轻量级子类，恰好包含一个不规则维度，并支持零拷贝访问底层的扁平NumPy缓冲区和偏移量。

Mental model

心智模型

Ragged

has three things:

data
: a flat contiguous

NDArray

of shape

(total_elements, *fixed_trailing_dims)

. Zero-copy access via

rag.data

offsets
: an
```
int64
```
array. Shape
```
(N+1,)
```
(contiguous, the common case) or
```
(2, N)
```
starts/stops (after some slices). Access via
```
rag.offsets
```
.
shape
: a tuple like
```
(batch, None, ohe_dim)
```
where exactly one entry is
```
None
```
— that's the ragged axis.
```
rag.rag_dim
```
gives its index.

rag.lengths

derives segment lengths from offsets (cheap, returns an

ndarray

Ragged

包含三个核心部分：

data
：形状为
```
(total_elements, *fixed_trailing_dims)
```
的扁平连续
```
NDArray
```
。可通过
```
rag.data
```
进行零拷贝访问。
offsets
：一个
```
int64
```
数组。形状为
```
(N+1,)
```
（连续型，最常见的情况）或
```
(2, N)
```
（起始/结束位置，切片后可能出现）。可通过
```
rag.offsets
```
访问。
shape
：类似
```
(batch, None, ohe_dim)
```
的元组，其中恰好有一个元素为
```
None
```
——这就是不规则轴。
```
rag.rag_dim
```
可获取其索引。

rag.lengths

会从偏移量计算出各段的长度（开销极低，返回

ndarray

）。

Construction

构造方式

python

import numpy as np, seqpro as sp

python

import numpy as np, seqpro as sp

From lengths (most common — you have a flat data buffer and per-segment lengths)

data = np.frombuffer(b"ACGTACGTACG", dtype="S1") lengths = np.array([4, 3, 4]) rag = sp.rag.Ragged.from_lengths(data, lengths) # shape (3, None)

From explicit offsets

offsets = np.array([0, 4, 7, 11], dtype=np.int64) rag = sp.rag.Ragged.from_offsets(data, shape=(3, None), offsets=offsets)

Empty with known shape

rag = sp.rag.Ragged.empty((10, None, 4), dtype=np.uint8) # batch of 10 OHE seqs


`Ragged.empty(shape, dtype)` requires exactly one `None` in `shape`. Trailing fixed dims (e.g. the OHE axis) go after the `None`.

rag = sp.rag.Ragged.empty((10, None, 4), dtype=np.uint8) # batch of 10 OHE seqs


`Ragged.empty(shape, dtype)`要求`shape`中恰好有一个`None`。后续的固定维度（如OHE轴）需放在`None`之后。

Working with

Ragged

— do this, not that

Ragged

使用规范——正确做法vs错误做法

Task	Do	Don't
Bulk numeric op on the flat data	`rag.data[:] = ...` or `rag.data.view(...)` — zero-copy	Iterate `for seq in rag:`
Apply a `np.ufunc`	Just call it: `np.exp(rag)` — dispatched via awkward behavior to return a `Ragged`	Manually unpack and rebuild
Reinterpret bytes/dtype	`rag.view(np.uint8)`	`np.asarray(rag).view(...)` (loses ragged structure)
Reshape non-ragged axes	`rag.reshape(batch, None, k, 4)`	Touch `rag.data.shape` directly
Drop a size-1 axis	`rag.squeeze(axis)` (returns `ndarray` if collapses to 1D)
Densify to NumPy	`rag.to_numpy()` (pads/raises per `allow_missing` )	Loop and stack
Strip to plain awkward	`rag.to_ak()`

任务	正确做法	错误做法
对扁平数据执行批量数值操作	`rag.data[:] = ...` 或 `rag.data.view(...)` ——零拷贝	遍历 `for seq in rag:`
应用 `np.ufunc`	直接调用： `np.exp(rag)` ——通过awkward的调度机制返回 `Ragged`	手动拆分后重新构建
重新解释字节/dtype	`rag.view(np.uint8)`	`np.asarray(rag).view(...)` （丢失不规则结构）
重塑非不规则轴	`rag.reshape(batch, None, k, 4)`	直接修改 `rag.data.shape`
移除长度为1的轴	`rag.squeeze(axis)` （如果折叠为1D则返回 `ndarray` ）
转换为NumPy数组	`rag.to_numpy()` （根据 `allow_missing` 参数进行填充或抛出异常）	循环后堆叠
转换为纯awkward数组	`rag.to_ak()`

Record-layout

Ragged

(multi-field)

记录型

Ragged

（多字段）

Build by

ak.zip

-ing existing

Ragged

s that share offsets. The result is already a

Ragged

— no manual wrap needed (it's registered via

ak.behavior

python

import awkward as ak
seq_rag   = sp.rag.Ragged.from_lengths(seq_flat,   lengths)  # |S1
score_rag = sp.rag.Ragged.from_lengths(score_flat, lengths)  # f4
batch = ak.zip({"seq": seq_rag, "score": score_rag})         # → Ragged (record layout)
assert isinstance(batch, sp.rag.Ragged)

batch["score"].data[:] *= 2.0   # zero-copy mutation of the flat score buffer

The two inputs must share offsets (same

lengths

/ same

offsets

array) — that's what makes the result a single-ragged-dim record. Mixing mismatched offsets falls back to a plain

ak.Array

```
rag.dtype
```
returns a NumPy structured dtype (e.g.
```
[("seq","S1"),("score","f4")]
```
), purely as a descriptor — memory is SoA, not AoS.
```
rag.data
```
and
```
rag.parts
```
return dicts keyed by field name, not single arrays. Always type-check before indexing.
```
rag["field"]
```
gives zero-copy single-field access and shares the parent's offsets object. Its
```
.data
```
is the flat NumPy buffer for that field.
```
view
```
,
```
apply
```
, and
```
to_numpy
```
are not defined on record layouts — operate per-field.

通过

ak.zip

合并共享偏移量的现有

Ragged

对象即可构建。结果自动成为

Ragged

——无需手动封装（已通过

ak.behavior

python

import awkward as ak
seq_rag   = sp.rag.Ragged.from_lengths(seq_flat,   lengths)  # |S1
score_rag = sp.rag.Ragged.from_lengths(score_flat, lengths)  # f4
batch = ak.zip({"seq": seq_rag, "score": score_rag})         # → Ragged (record layout)
assert isinstance(batch, sp.rag.Ragged)

batch["score"].data[:] *= 2.0   # zero-copy mutation of the flat score buffer

两个输入必须共享偏移量（相同的

lengths

/相同的

offsets

数组）——这是结果成为单一不规则维度记录的前提。如果偏移量不匹配，结果会退化为普通的

ak.Array

。

```
rag.dtype
```
返回NumPy的结构化dtype（如
```
[("seq","S1"),("score","f4")]
```
），仅作为描述符——内存布局为SoA，而非AoS。
```
rag.data
```
和
```
rag.parts
```
返回按字段名索引的字典，而非单一数组。索引前务必进行类型检查。
```
rag["field"]
```
提供零拷贝的单字段访问，并共享父对象的偏移量。其
```
.data
```
为该字段的扁平NumPy缓冲区。
记录型
```
Ragged
```
不支持
```
view
```
、
```
apply
```
和
```
to_numpy
```
操作——请按字段分别操作。

Awkward interop — what you can rely on

与Awkward的互操作性——可依赖的特性

Ragged

registers itself with

ak.behavior

so most awkward APIs Just Work:

NumPy ufuncs (
```
np.add
```
,
```
np.exp
```
, etc.) on a
```
Ragged
```
return a
```
Ragged
```
.
```
ak.zip
```
, slicing,
```
ak.fields
```
, etc. work and produce
```
Ragged
```
when the result still has exactly one ragged dim. If a slice produces zero or >1 ragged dims, you get a plain
```
ak.Array
```
back.
```
ak.to_packed(rag)
```
is the canonical way to materialize a contiguous buffer (used internally before round-tripping through the Rust extension and other zero-copy paths).
Don't rely on awkward features outside this contract: strings (use
```
|S1
```
bytes), union types, and >1 ragged dim are unsupported by design.

When in doubt, read

python/seqpro/rag/_array.py

— it's ~800 lines and the docstrings are the source of truth.

_gufuncs.py

and

_utils.py

in the same dir are small.

Ragged

已通过

ak.behavior

对
```
Ragged
```
调用NumPy ufuncs（
```
np.add
```
、
```
np.exp
```
等）会返回
```
Ragged
```
。
```
ak.zip
```
、切片、
```
ak.fields
```
等操作在结果仍仅含一个不规则维度时会生成
```
Ragged
```
。如果切片操作导致零个或多个不规则维度，将返回普通的
```
ak.Array
```
。
```
ak.to_packed(rag)
```
是生成连续缓冲区的标准方式（内部在与Rust扩展及其他零拷贝路径交互前会使用此方法）。
不要依赖此约定之外的awkward特性：设计上不支持字符串（请使用
```
|S1
```
字节）、联合类型和多个不规则维度。

如有疑问，请阅读

python/seqpro/rag/_array.py

——该文件约800行，文档字符串为权威来源。同目录下的

_gufuncs.py

和

_utils.py

篇幅较短。

Common pitfalls

常见陷阱

Offsets layout drifts after slicing.
```
rag.offsets
```
may become
```
(2, N)
```
starts/stops instead of
```
(N+1,)
```
. Check
```
rag.is_contiguous
```
/ call
```
ak.to_packed(rag)
```
before any code that assumes
```
(N+1,)
```
.
rag.data
on a record layout is a dict. Code like
```
rag.data.shape
```
will fail; branch on
```
isinstance(rag.data, dict)
```
or use
```
rag.parts
```
and inspect.
Ragged
must have exactly one
None
in
shape
. Constructing from data whose ragged structure doesn't match raises in
```
__init__
```
. Use
```
from_lengths
```
/
```
from_offsets
```
when in doubt.
The Rust k-shuffle expects contiguous
uint8
with the last axis as sequence length.
```
sp.k_shuffle
```
handles this for you; if calling
```
seqpro._k_shuffle
```
directly, ensure layout.

切片后偏移量布局可能变化。
```
rag.offsets
```
可能从
```
(N+1,)
```
变为
```
(2, N)
```
的起始/结束位置。在执行任何假设
```
(N+1,)
```
布局的代码前，请检查
```
rag.is_contiguous
```
或调用
```
ak.to_packed(rag)
```
。
记录型
Ragged
的
rag.data
是字典。类似
```
rag.data.shape
```
的代码会执行失败；请通过
```
isinstance(rag.data, dict)
```
进行分支判断，或使用
```
rag.parts
```
进行检查。
Ragged
的
shape
中必须恰好有一个
None
。如果构造数据的不规则结构不匹配，
```
__init__
```
会抛出异常。如有疑问，请使用
```
from_lengths
```
/
```
from_offsets
```
方法。
Rust实现的k-shuffle要求
uint8
格式的连续数组，且最后一个轴为序列长度。
```
sp.k_shuffle
```
会自动处理此要求；如果直接调用
```
seqpro._k_shuffle
```
，请确保布局正确。

Where to look (don't memorize — read the source)

源码查阅指南（无需记忆，直接阅读源码）

Need	File
Public surface	`python/seqpro/__init__.py`
Input casting / axis helpers	`python/seqpro/_utils.py`
OHE / tokens / padding	`python/seqpro/_encoders.py`
Augmentations	`python/seqpro/_modifiers.py`
Stats	`python/seqpro/_analyzers.py`
Alphabets	`python/seqpro/alphabets/_alphabets.py`
Ragged	`python/seqpro/rag/_array.py`
Transforms (pipeline objects)	`python/seqpro/transforms/`
BED/GTF	`python/seqpro/bed.py` , `gtf.py`
Rust k-shuffle	`src/kshuffle.rs`
Tests as usage examples	`tests/`
Rendered docs	`site/` (built from `docs/` )

需求	文件路径
公开API列表	`python/seqpro/__init__.py`
输入转换/轴辅助工具	`python/seqpro/_utils.py`
OHE/标记化/填充	`python/seqpro/_encoders.py`
序列增强	`python/seqpro/_modifiers.py`
序列统计	`python/seqpro/_analyzers.py`
字母表定义	`python/seqpro/alphabets/_alphabets.py`
Ragged数组实现	`python/seqpro/rag/_array.py`
Transforms（管道对象）	`python/seqpro/transforms/`
BED/GTF处理	`python/seqpro/bed.py` , `gtf.py`
Rust k-shuffle实现	`src/kshuffle.rs`
测试用例（使用示例）	`tests/`
渲染后的文档	`site/` （从 `docs/` 构建）

Don'ts

禁忌事项

Don't write Python
```
for
```
loops over residues or positions in library code. Look for a vectorized op, a Numba kernel in
```
_numba.py
```
, or extend one.
Don't assume an axis — always pass
```
length_axis
```
(and
```
ohe_axis
```
where relevant) explicitly.

Don't reach into

Ragged

internals (

_parts

__init__

shortcuts) from user code; use

data

offsets

parts

from_lengths

from_offsets

empty

Don't introduce strings into
```
Ragged
```
. ASCII bytes (
```
|S1
```
) only.
Don't add a feature or change a public signature without updating this skill — see CLAUDE.md.

不要在库代码中使用Python
```
for
```
循环遍历残基或位置。请寻找向量化操作、
```
_numba.py
```
中的Numba内核，或扩展现有实现。
不要假设轴参数——务必显式传递
```
length_axis
```
（以及相关的
```
ohe_axis
```
）。
不要在用户代码中直接访问
```
Ragged
```
的内部结构（
```
_parts
```
、
```
__init__
```
快捷方式）；请使用
```
data
```
、
```
offsets
```
、
```
parts
```
、
```
from_lengths
```
、
```
from_offsets
```
、
```
empty
```
等公开接口。
不要在
```
Ragged
```
中使用字符串类型。仅支持ASCII字节（
```
|S1
```
）。
添加功能或修改公开签名前，请务必更新本技能文档——参考CLAUDE.md。

seqpro

Original

Translation

seqpro

seqpro

When to use

使用场景

Conventions (load these into working memory)

约定（请牢记这些规则）

Quick reference

速查指南

`Ragged`
— variable-length sequence batches

`Ragged`
——可变长度序列批次

Mental model

心智模型

Construction

构造方式

From lengths (most common — you have a flat data buffer and per-segment lengths)

From lengths (most common — you have a flat data buffer and per-segment lengths)

From explicit offsets

From explicit offsets

Empty with known shape

Empty with known shape

Working with
`Ragged`
— do this, not that

`Ragged`
使用规范——正确做法vs错误做法

Record-layout
`Ragged`
(multi-field)

记录型
`Ragged`
（多字段）

Awkward interop — what you can rely on

与Awkward的互操作性——可依赖的特性

Common pitfalls

常见陷阱

Where to look (don't memorize — read the source)

源码查阅指南（无需记忆，直接阅读源码）

Don'ts

禁忌事项

seqpro

Original

Translation

seqpro

seqpro

When to use

使用场景

Conventions (load these into working memory)

约定（请牢记这些规则）

Quick reference

速查指南

Ragged — variable-length sequence batches

Ragged——可变长度序列批次

Mental model

心智模型

Construction

构造方式

From lengths (most common — you have a flat data buffer and per-segment lengths)

From lengths (most common — you have a flat data buffer and per-segment lengths)

From explicit offsets

From explicit offsets

Empty with known shape

Empty with known shape

Working with Ragged — do this, not that

Ragged使用规范——正确做法vs错误做法

Record-layout Ragged (multi-field)

记录型Ragged（多字段）

Awkward interop — what you can rely on

与Awkward的互操作性——可依赖的特性

Common pitfalls

常见陷阱

Where to look (don't memorize — read the source)

源码查阅指南（无需记忆，直接阅读源码）

Don'ts

禁忌事项

`Ragged`
— variable-length sequence batches

`Ragged`
——可变长度序列批次

Working with
`Ragged`
— do this, not that

`Ragged`
使用规范——正确做法vs错误做法

Record-layout
`Ragged`
(multi-field)

记录型
`Ragged`
（多字段）