mamba-architecture

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Mamba - Selective State Space Models

Mamba - 选择性状态空间模型

Quick start

快速开始

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.

Installation:

bash

undefined

Mamba是一种状态空间模型架构，实现了序列建模的O(n)线性复杂度。

安装:

bash

undefined

Install causal-conv1d (optional, for efficiency)

pip install causal-conv1d>=1.4.0

Install Mamba

pip install mamba-ssm

Or both together

pip install mamba-ssm[causal-conv1d]


**Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

**Basic usage** (Mamba block):
```python
import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape

pip install mamba-ssm[causal-conv1d]


**前置要求**: Linux系统、NVIDIA GPU、PyTorch 1.12及以上版本、CUDA 11.6及以上版本

**基础用法**（Mamba模块）:
```python
import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape

Common workflows

常见工作流

Workflow 1: Language model with Mamba-2

工作流1：基于Mamba-2的语言模型

Complete LM with generation:

python

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch

完整生成式语言模型:

python

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch

Configure Mamba-2 LM

config = MambaConfig( d_model=1024, # Hidden dimension n_layer=24, # Number of layers vocab_size=50277, # Vocabulary size ssm_cfg=dict( layer="Mamba2", # Use Mamba-2 d_state=128, # Larger state for Mamba-2 headdim=64, # Head dimension ngroups=1 # Number of groups ) )

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

Generate text

input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 )

undefined

input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 )

undefined

Workflow 2: Use pretrained Mamba models

工作流2：使用预训练Mamba模型

Load from HuggingFace:

python

from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

从HuggingFace加载:

python

from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

Load pretrained model

model_name = "state-spaces/mamba-2.8b" tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

Generate

prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output_ids = model.generate( input_ids=input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) generated_text = tokenizer.decode(output_ids[0]) print(generated_text)


**Available models**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`


**可用模型**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`

Workflow 3: Mamba-1 vs Mamba-2

工作流3：Mamba-1 vs Mamba-2

Mamba-1 (smaller state):

python

from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")

Mamba-2 (multi-head, larger state):

python

from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")

Key differences:

State size: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
Architecture: Mamba-2 has multi-head structure
Normalization: Mamba-2 uses RMSNorm
Distributed: Mamba-2 supports tensor parallelism

Mamba-1（较小状态维度）:

python

from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")

Mamba-2（多头结构，较大状态维度）:

python

from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")

核心差异:

状态大小: Mamba-1（d_state=16） vs Mamba-2（d_state=128）
架构: Mamba-2采用多头结构
归一化: Mamba-2使用RMSNorm
分布式: Mamba-2支持张量并行

Workflow 4: Benchmark vs Transformers

工作流4：与Transformers的基准测试

Generation speed comparison:

bash

undefined

生成速度对比:

bash

undefined

Benchmark Mamba

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Benchmark Transformer

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2


**Expected results**:
- **Mamba**: 5× faster inference
- **Memory**: No KV cache needed
- **Scaling**: Linear with sequence length

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2


**预期结果**:
- **Mamba**: 推理速度快5倍
- **内存**: 无需KV缓存
- **扩展性**: 随序列长度线性扩展

When to use vs alternatives

适用场景与替代方案对比

Use Mamba when:

Need long sequences (100K+ tokens)
Want faster inference than Transformers
Memory-constrained (no KV cache)
Building streaming applications
Linear scaling important

Advantages:

O(n) complexity: Linear vs quadratic
5× faster inference: No attention overhead
No KV cache: Lower memory usage
Million-token sequences: Hardware-efficient
Streaming: Constant memory per token

Use alternatives instead:

Transformers: Need best-in-class performance, have compute
RWKV: Want RNN+Transformer hybrid
RetNet: Need retention-based architecture
Hyena: Want convolution-based approach

优先选择Mamba的场景:

需要处理超长序列（10万+ token）
希望获得比Transformers更快的推理速度
内存受限（无需KV缓存）
构建流式应用
线性扩展能力至关重要

优势:

O(n)复杂度: 线性复杂度对比平方复杂度
5倍更快推理: 无注意力机制开销
无需KV缓存: 更低内存占用
百万级token序列: 硬件高效设计
流式处理: 每个token的内存占用恒定

优先选择替代方案的场景:

Transformers: 需要当前最优性能，且具备足够计算资源
RWKV: 想要RNN+Transformer混合架构
RetNet: 需要基于保留机制的架构
Hyena: 想要基于卷积的方案

Common issues

常见问题

Issue: CUDA out of memory

Reduce batch size or use gradient checkpointing:

python

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing

Issue: Slow installation

Install binary wheels (not source):

bash

pip install mamba-ssm --no-build-isolation

Issue: Missing causal-conv1d

Install separately:

bash

pip install causal-conv1d>=1.4.0

Issue: Model not loading from HuggingFace

Use

MambaLMHeadModel.from_pretrained

(not

AutoModel

python

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")

问题：CUDA内存不足

减小批量大小或启用梯度检查点:

python

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing

问题：安装速度慢

安装二进制轮子（而非从源码编译）:

bash

pip install mamba-ssm --no-build-isolation

问题：缺少causal-conv1d

单独安装:

bash

pip install causal-conv1d>=1.4.0

问题：无法从HuggingFace加载模型

使用

MambaLMHeadModel.from_pretrained

（而非

AutoModel

）:

python

from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")

Advanced topics

高级主题

Selective SSM: See references/selective-ssm.md for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.

Mamba-2 architecture: See references/mamba2-details.md for multi-head structure, tensor parallelism, and distributed training setup.

Performance optimization: See references/performance.md for hardware-aware design, CUDA kernels, and memory efficiency techniques.

选择性SSM: 查看references/selective-ssm.md了解数学公式、状态空间方程，以及选择性机制如何实现O(n)复杂度。

Mamba-2架构: 查看references/mamba2-details.md了解多头结构、张量并行和分布式训练设置。

性能优化: 查看references/performance.md了解硬件感知设计、CUDA内核和内存效率优化技术。

Hardware requirements

硬件要求

GPU: NVIDIA with CUDA 11.6+
VRAM:
- 130M model: 2GB
- 370M model: 4GB
- 790M model: 8GB
- 1.4B model: 14GB
- 2.8B model: 28GB (FP16)
Inference: 5× faster than Transformers
Memory: No KV cache (lower than Transformers)

Performance (vs Transformers):

Speed: 5× faster inference
Memory: 50% less (no KV cache)
Scaling: Linear vs quadratic

GPU: 支持CUDA 11.6+的NVIDIA GPU
显存:
- 1.3亿参数模型: 2GB
- 3.7亿参数模型: 4GB
- 7.9亿参数模型: 8GB
- 14亿参数模型: 14GB
- 28亿参数模型: 28GB（FP16精度）
推理速度: 比Transformers快5倍
内存: 无需KV缓存（内存占用低于Transformers）

性能对比（与Transformers相比）:

速度: 推理速度快5倍
内存: 内存占用减少50%（无需KV缓存）
扩展性: 随序列长度线性扩展

Resources

资源

Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
Models: https://huggingface.co/state-spaces
Docs: Repository README and wiki

论文（Mamba-1）: https://arxiv.org/abs/2312.00752（2023年12月）
论文（Mamba-2）: https://arxiv.org/abs/2405.21060（2024年5月）
GitHub仓库: https://github.com/state-spaces/mamba ⭐ 13000+
模型: https://huggingface.co/state-spaces
文档: 仓库README及维基页面