seqpro
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseseqpro
seqpro
Python/Rust package for fast biological-sequence processing. Python+NumPy+Numba for hot loops, a small Rust extension () for graph-algorithm ops, and an awkward-backed array for variable-length batches. Imported as .
src/kshuffle.rsRaggedimport seqpro as sp一款用于快速处理生物序列的Python/Rust包。热点循环采用Python+NumPy+Numba实现,图算法操作由一个小型Rust扩展()完成,可变长度批次处理则基于awkward实现的数组。导入方式为。
src/kshuffle.rsRaggedimport seqpro as spWhen to use
使用场景
- Encoding/decoding DNA, RNA, or protein sequences (OHE, integer tokens, padding).
- Sequence augmentation: reverse complement, k-mer shuffle, jitter, random draws.
- Sequence stats: GC content, nucleotide composition, length.
- Variable-length batches (e.g. peaks, transcripts of different sizes) → .
sp.Ragged - Genomic interval I/O: ,
sp.bed.sp.gtf
- DNA、RNA或蛋白质序列的编码/解码(OHE、整数标记、填充)。
- 序列增强:反向互补、k-mer洗牌、抖动、随机抽取。
- 序列统计:GC含量、核苷酸组成、序列长度。
- 可变长度批次处理(如峰信号、不同长度的转录本)→ 使用。
sp.Ragged - 基因组区间I/O:、
sp.bed。sp.gtf
Conventions (load these into working memory)
约定(请牢记这些规则)
- Public API: see for the full export list. Re-read it before assuming a symbol exists.
python/seqpro/__init__.py - Input types (in
SeqType): str, bytes, nested str lists, orpython/seqpro/_utils.pywith dtypendarray/str_/object_/bytes_.uint8normalizes string-like inputs tosp.cast_seqs(...)bytes arrays;|S1(OHE) is left untouched.uint8 - Canonical in-memory dtypes: for string sequences,
|S1for one-hot.uint8 - Axis arguments are required and explicit: most functions take and (for OHE)
length_axisas positional/keyword ints. Negative indices allowed.ohe_axisvalidates and raises early — don't catch and paper over.check_axes() - No Python loops over sequences in library code. Hot paths use NumPy, Numba kernels in , or the Rust extension. If you're tempted to write a
_numba.pyover residues, look for an existing vectorized op or a Numba kernel first.for - Alphabets are singletons: ,
sp.DNA,sp.RNA. Construct custom ones viasp.AA/sp.NucleotideAlphabet(sp.AminoAlphabet).python/seqpro/alphabets/_alphabets.py - Transforms () wrap functional ops as callables — use these in data pipelines instead of inline lambdas.
python/seqpro/transforms/
- 公开API:完整导出列表请查看。在假设某个符号存在之前,请重新阅读该文件。
python/seqpro/__init__.py - 输入类型(中的
python/seqpro/_utils.py):字符串、字节、嵌套字符串列表,或 dtype 为SeqType/str_/object_/bytes_的uint8。ndarray会将类字符串输入标准化为sp.cast_seqs(...)字节数组;OHE格式的|S1数组则直接保留。uint8 - 标准内存 dtype:字符串序列使用,独热编码使用
|S1。uint8 - 轴参数必须显式指定:大多数函数需要传入和(针对OHE的)
length_axis作为位置参数或关键字参数,支持负索引。ohe_axis会提前验证参数并抛出异常——不要捕获并掩盖错误。check_axes() - 库代码中禁止使用Python循环遍历序列:热点路径使用NumPy、中的Numba内核或Rust扩展实现。如果你想要编写遍历残基的
_numba.py循环,请先寻找现有的向量化操作或Numba内核。for - 字母表为单例对象:、
sp.DNA、sp.RNA。可通过sp.AA/sp.NucleotideAlphabet(sp.AminoAlphabet)构造自定义字母表。python/seqpro/alphabets/_alphabets.py - Transforms()将函数式操作封装为可调用对象——在数据管道中使用这些对象,而非内联lambda表达式。
python/seqpro/transforms/
Quick reference
速查指南
| Task | Call | Notes |
|---|---|---|
| Normalize input | | → ` |
| One-hot encode | | last axis added for OHE dim |
| Decode OHE | | |
| Tokenize / detokenize | | integer ids |
| Pad | | |
| Reverse complement | | works on str/bytes/OHE |
| K-mer shuffle | | calls Rust |
| Jitter | | |
| Random sequences | | |
| GC content | | |
| Nucleotide content | | |
| Coverage binning | | |
| BED / GTF I/O | | polars/pyranges-backed |
For exact signatures and kwargs, read the docstring directly ( in a REPL, or open the source — files are short).
sp.<fn>?| 任务 | 调用方式 | 说明 |
|---|---|---|
| 标准化输入 | | → 转换为` |
| 独热编码 | | 最后一个轴为OHE维度 |
| 解码OHE | | |
| 标记化/反标记化 | | 整数ID形式 |
| 填充 | | |
| 反向互补 | | 支持字符串/字节/OHE格式 |
| K-mer洗牌 | | 调用Rust的 |
| 抖动 | | |
| 随机序列生成 | | |
| GC含量计算 | | |
| 核苷酸组成计算 | | |
| 覆盖度分箱 | | |
| BED/GTF文件读写 | | 基于polars/pyranges实现 |
如需查看精确的函数签名和关键字参数,请直接阅读文档字符串(在REPL中执行,或打开源码——文件篇幅较短)。
sp.<fn>?Ragged
— variable-length sequence batches
RaggedRagged
——可变长度序列批次
Raggedsp.Raggedpython/seqpro/rag/_array.pyak.Arraysp.Raggedpython/seqpro/rag/_array.pyak.ArrayMental model
心智模型
A has three things:
Ragged- : a flat contiguous
dataof shapeNDArray. Zero-copy access via(total_elements, *fixed_trailing_dims).rag.data - : an
offsetsarray. Shapeint64(contiguous, the common case) or(N+1,)starts/stops (after some slices). Access via(2, N).rag.offsets - : a tuple like
shapewhere exactly one entry is(batch, None, ohe_dim)— that's the ragged axis.Nonegives its index.rag.rag_dim
rag.lengthsndarrayRagged- :形状为
data的扁平连续(total_elements, *fixed_trailing_dims)。可通过NDArray进行零拷贝访问。rag.data - :一个
offsets数组。形状为int64(连续型,最常见的情况)或(N+1,)(起始/结束位置,切片后可能出现)。可通过(2, N)访问。rag.offsets - :类似
shape的元组,其中恰好有一个元素为(batch, None, ohe_dim)——这就是不规则轴。None可获取其索引。rag.rag_dim
rag.lengthsndarrayConstruction
构造方式
python
import numpy as np, seqpro as sppython
import numpy as np, seqpro as spFrom lengths (most common — you have a flat data buffer and per-segment lengths)
From lengths (most common — you have a flat data buffer and per-segment lengths)
data = np.frombuffer(b"ACGTACGTACG", dtype="S1")
lengths = np.array([4, 3, 4])
rag = sp.rag.Ragged.from_lengths(data, lengths) # shape (3, None)
data = np.frombuffer(b"ACGTACGTACG", dtype="S1")
lengths = np.array([4, 3, 4])
rag = sp.rag.Ragged.from_lengths(data, lengths) # shape (3, None)
From explicit offsets
From explicit offsets
offsets = np.array([0, 4, 7, 11], dtype=np.int64)
rag = sp.rag.Ragged.from_offsets(data, shape=(3, None), offsets=offsets)
offsets = np.array([0, 4, 7, 11], dtype=np.int64)
rag = sp.rag.Ragged.from_offsets(data, shape=(3, None), offsets=offsets)
Empty with known shape
Empty with known shape
rag = sp.rag.Ragged.empty((10, None, 4), dtype=np.uint8) # batch of 10 OHE seqs
`Ragged.empty(shape, dtype)` requires exactly one `None` in `shape`. Trailing fixed dims (e.g. the OHE axis) go after the `None`.rag = sp.rag.Ragged.empty((10, None, 4), dtype=np.uint8) # batch of 10 OHE seqs
`Ragged.empty(shape, dtype)`要求`shape`中恰好有一个`None`。后续的固定维度(如OHE轴)需放在`None`之后。Working with Ragged
— do this, not that
RaggedRagged
使用规范——正确做法vs错误做法
Ragged| Task | Do | Don't |
|---|---|---|
| Bulk numeric op on the flat data | | Iterate |
Apply a | Just call it: | Manually unpack and rebuild |
| Reinterpret bytes/dtype | | |
| Reshape non-ragged axes | | Touch |
| Drop a size-1 axis | | |
| Densify to NumPy | | Loop and stack |
| Strip to plain awkward | |
| 任务 | 正确做法 | 错误做法 |
|---|---|---|
| 对扁平数据执行批量数值操作 | | 遍历 |
应用 | 直接调用: | 手动拆分后重新构建 |
| 重新解释字节/dtype | | |
| 重塑非不规则轴 | | 直接修改 |
| 移除长度为1的轴 | | |
| 转换为NumPy数组 | | 循环后堆叠 |
| 转换为纯awkward数组 | |
Record-layout Ragged
(multi-field)
Ragged记录型Ragged
(多字段)
RaggedBuild by -ing existing s that share offsets. The result is already a — no manual wrap needed (it's registered via ):
ak.zipRaggedRaggedak.behaviorpython
import awkward as ak
seq_rag = sp.rag.Ragged.from_lengths(seq_flat, lengths) # |S1
score_rag = sp.rag.Ragged.from_lengths(score_flat, lengths) # f4
batch = ak.zip({"seq": seq_rag, "score": score_rag}) # → Ragged (record layout)
assert isinstance(batch, sp.rag.Ragged)
batch["score"].data[:] *= 2.0 # zero-copy mutation of the flat score bufferThe two inputs must share offsets (same / same array) — that's what makes the result a single-ragged-dim record. Mixing mismatched offsets falls back to a plain .
lengthsoffsetsak.Array- returns a NumPy structured dtype (e.g.
rag.dtype), purely as a descriptor — memory is SoA, not AoS.[("seq","S1"),("score","f4")] - and
rag.datareturn dicts keyed by field name, not single arrays. Always type-check before indexing.rag.parts - gives zero-copy single-field access and shares the parent's offsets object. Its
rag["field"]is the flat NumPy buffer for that field..data - ,
view, andapplyare not defined on record layouts — operate per-field.to_numpy
通过合并共享偏移量的现有对象即可构建。结果自动成为——无需手动封装(已通过注册):
ak.zipRaggedRaggedak.behaviorpython
import awkward as ak
seq_rag = sp.rag.Ragged.from_lengths(seq_flat, lengths) # |S1
score_rag = sp.rag.Ragged.from_lengths(score_flat, lengths) # f4
batch = ak.zip({"seq": seq_rag, "score": score_rag}) # → Ragged (record layout)
assert isinstance(batch, sp.rag.Ragged)
batch["score"].data[:] *= 2.0 # zero-copy mutation of the flat score buffer两个输入必须共享偏移量(相同的/相同的数组)——这是结果成为单一不规则维度记录的前提。如果偏移量不匹配,结果会退化为普通的。
lengthsoffsetsak.Array- 返回NumPy的结构化dtype(如
rag.dtype),仅作为描述符——内存布局为SoA,而非AoS。[("seq","S1"),("score","f4")] - 和
rag.data返回按字段名索引的字典,而非单一数组。索引前务必进行类型检查。rag.parts - 提供零拷贝的单字段访问,并共享父对象的偏移量。其
rag["field"]为该字段的扁平NumPy缓冲区。.data - 记录型不支持
Ragged、view和apply操作——请按字段分别操作。to_numpy
Awkward interop — what you can rely on
与Awkward的互操作性——可依赖的特性
Raggedak.behavior- NumPy ufuncs (,
np.add, etc.) on anp.expreturn aRagged.Ragged - , slicing,
ak.zip, etc. work and produceak.fieldswhen the result still has exactly one ragged dim. If a slice produces zero or >1 ragged dims, you get a plainRaggedback.ak.Array - is the canonical way to materialize a contiguous buffer (used internally before round-tripping through the Rust extension and other zero-copy paths).
ak.to_packed(rag) - Don't rely on awkward features outside this contract: strings (use bytes), union types, and >1 ragged dim are unsupported by design.
|S1
When in doubt, read — it's ~800 lines and the docstrings are the source of truth. and in the same dir are small.
python/seqpro/rag/_array.py_gufuncs.py_utils.pyRaggedak.behavior- 对调用NumPy ufuncs(
Ragged、np.add等)会返回np.exp。Ragged - 、切片、
ak.zip等操作在结果仍仅含一个不规则维度时会生成ak.fields。如果切片操作导致零个或多个不规则维度,将返回普通的Ragged。ak.Array - 是生成连续缓冲区的标准方式(内部在与Rust扩展及其他零拷贝路径交互前会使用此方法)。
ak.to_packed(rag) - 不要依赖此约定之外的awkward特性:设计上不支持字符串(请使用字节)、联合类型和多个不规则维度。
|S1
如有疑问,请阅读——该文件约800行,文档字符串为权威来源。同目录下的和篇幅较短。
python/seqpro/rag/_array.py_gufuncs.py_utils.pyCommon pitfalls
常见陷阱
- Offsets layout drifts after slicing. may become
rag.offsetsstarts/stops instead of(2, N). Check(N+1,)/ callrag.is_contiguousbefore any code that assumesak.to_packed(rag).(N+1,) - on a record layout is a dict. Code like
rag.datawill fail; branch onrag.data.shapeor useisinstance(rag.data, dict)and inspect.rag.parts - must have exactly one
RaggedinNone. Constructing from data whose ragged structure doesn't match raises inshape. Use__init__/from_lengthswhen in doubt.from_offsets - The Rust k-shuffle expects contiguous with the last axis as sequence length.
uint8handles this for you; if callingsp.k_shuffledirectly, ensure layout.seqpro._k_shuffle
- 切片后偏移量布局可能变化。可能从
rag.offsets变为(N+1,)的起始/结束位置。在执行任何假设(2, N)布局的代码前,请检查(N+1,)或调用rag.is_contiguous。ak.to_packed(rag) - 记录型的
Ragged是字典。类似rag.data的代码会执行失败;请通过rag.data.shape进行分支判断,或使用isinstance(rag.data, dict)进行检查。rag.parts - 的
Ragged中必须恰好有一个shape。如果构造数据的不规则结构不匹配,None会抛出异常。如有疑问,请使用__init__/from_lengths方法。from_offsets - Rust实现的k-shuffle要求格式的连续数组,且最后一个轴为序列长度。
uint8会自动处理此要求;如果直接调用sp.k_shuffle,请确保布局正确。seqpro._k_shuffle
Where to look (don't memorize — read the source)
源码查阅指南(无需记忆,直接阅读源码)
| Need | File |
|---|---|
| Public surface | |
| Input casting / axis helpers | |
| OHE / tokens / padding | |
| Augmentations | |
| Stats | |
| Alphabets | |
| Ragged | |
| Transforms (pipeline objects) | |
| BED/GTF | |
| Rust k-shuffle | |
| Tests as usage examples | |
| Rendered docs | |
| 需求 | 文件路径 |
|---|---|
| 公开API列表 | |
| 输入转换/轴辅助工具 | |
| OHE/标记化/填充 | |
| 序列增强 | |
| 序列统计 | |
| 字母表定义 | |
| Ragged数组实现 | |
| Transforms(管道对象) | |
| BED/GTF处理 | |
| Rust k-shuffle实现 | |
| 测试用例(使用示例) | |
| 渲染后的文档 | |
Don'ts
禁忌事项
- Don't write Python loops over residues or positions in library code. Look for a vectorized op, a Numba kernel in
for, or extend one._numba.py - Don't assume an axis — always pass (and
length_axiswhere relevant) explicitly.ohe_axis - Don't reach into internals (
Ragged,_partsshortcuts) from user code; use__init__,data,offsets,parts,from_lengths,from_offsets.empty - Don't introduce strings into . ASCII bytes (
Ragged) only.|S1 - Don't add a feature or change a public signature without updating this skill — see CLAUDE.md.
- 不要在库代码中使用Python循环遍历残基或位置。请寻找向量化操作、
for中的Numba内核,或扩展现有实现。_numba.py - 不要假设轴参数——务必显式传递(以及相关的
length_axis)。ohe_axis - 不要在用户代码中直接访问的内部结构(
Ragged、_parts快捷方式);请使用__init__、data、offsets、parts、from_lengths、from_offsets等公开接口。empty - 不要在中使用字符串类型。仅支持ASCII字节(
Ragged)。|S1 - 添加功能或修改公开签名前,请务必更新本技能文档——参考CLAUDE.md。