mamba-architecture

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Mamba - Selective State Space Models

Mamba - 选择性状态空间模型

Quick start

快速开始

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.
Installation:
bash
undefined
Mamba是一种状态空间模型架构,实现了序列建模的O(n)线性复杂度。
安装:
bash
undefined

Install causal-conv1d (optional, for efficiency)

Install causal-conv1d (optional, for efficiency)

pip install causal-conv1d>=1.4.0
pip install causal-conv1d>=1.4.0

Install Mamba

Install Mamba

pip install mamba-ssm
pip install mamba-ssm

Or both together

Or both together

pip install mamba-ssm[causal-conv1d]

**Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

**Basic usage** (Mamba block):
```python
import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape
pip install mamba-ssm[causal-conv1d]

**前置要求**: Linux系统、NVIDIA GPU、PyTorch 1.12及以上版本、CUDA 11.6及以上版本

**基础用法**(Mamba模块):
```python
import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape

Common workflows

常见工作流

Workflow 1: Language model with Mamba-2

工作流1:基于Mamba-2的语言模型

Complete LM with generation:
python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch
完整生成式语言模型:
python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch

Configure Mamba-2 LM

Configure Mamba-2 LM

config = MambaConfig( d_model=1024, # Hidden dimension n_layer=24, # Number of layers vocab_size=50277, # Vocabulary size ssm_cfg=dict( layer="Mamba2", # Use Mamba-2 d_state=128, # Larger state for Mamba-2 headdim=64, # Head dimension ngroups=1 # Number of groups ) )
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
config = MambaConfig( d_model=1024, # Hidden dimension n_layer=24, # Number of layers vocab_size=50277, # Vocabulary size ssm_cfg=dict( layer="Mamba2", # Use Mamba-2 d_state=128, # Larger state for Mamba-2 headdim=64, # Head dimension ngroups=1 # Number of groups ) )
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

Generate text

Generate text

input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 )
undefined
input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long) output = model.generate( input_ids=input_ids, max_length=100, temperature=0.7, top_p=0.9 )
undefined

Workflow 2: Use pretrained Mamba models

工作流2:使用预训练Mamba模型

Load from HuggingFace:
python
from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
从HuggingFace加载:
python
from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

Load pretrained model

Load pretrained model

model_name = "state-spaces/mamba-2.8b" tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)
model_name = "state-spaces/mamba-2.8b" tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

Generate

Generate

prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output_ids = model.generate( input_ids=input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) generated_text = tokenizer.decode(output_ids[0]) print(generated_text)

**Available models**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`
prompt = "The future of AI is" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda") output_ids = model.generate( input_ids=input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) generated_text = tokenizer.decode(output_ids[0]) print(generated_text)

**可用模型**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`

Workflow 3: Mamba-1 vs Mamba-2

工作流3:Mamba-1 vs Mamba-2

Mamba-1 (smaller state):
python
from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")
Mamba-2 (multi-head, larger state):
python
from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")
Key differences:
  • State size: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
  • Architecture: Mamba-2 has multi-head structure
  • Normalization: Mamba-2 uses RMSNorm
  • Distributed: Mamba-2 supports tensor parallelism
Mamba-1(较小状态维度):
python
from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")
Mamba-2(多头结构,较大状态维度):
python
from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")
核心差异:
  • 状态大小: Mamba-1(d_state=16) vs Mamba-2(d_state=128)
  • 架构: Mamba-2采用多头结构
  • 归一化: Mamba-2使用RMSNorm
  • 分布式: Mamba-2支持张量并行

Workflow 4: Benchmark vs Transformers

工作流4:与Transformers的基准测试

Generation speed comparison:
bash
undefined
生成速度对比:
bash
undefined

Benchmark Mamba

Benchmark Mamba

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
python benchmarks/benchmark_generation_mamba_simple.py
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2

Benchmark Transformer

Benchmark Transformer

python benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2

**Expected results**:
- **Mamba**: 5× faster inference
- **Memory**: No KV cache needed
- **Scaling**: Linear with sequence length
python benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2

**预期结果**:
- **Mamba**: 推理速度快5倍
- **内存**: 无需KV缓存
- **扩展性**: 随序列长度线性扩展

When to use vs alternatives

适用场景与替代方案对比

Use Mamba when:
  • Need long sequences (100K+ tokens)
  • Want faster inference than Transformers
  • Memory-constrained (no KV cache)
  • Building streaming applications
  • Linear scaling important
Advantages:
  • O(n) complexity: Linear vs quadratic
  • 5× faster inference: No attention overhead
  • No KV cache: Lower memory usage
  • Million-token sequences: Hardware-efficient
  • Streaming: Constant memory per token
Use alternatives instead:
  • Transformers: Need best-in-class performance, have compute
  • RWKV: Want RNN+Transformer hybrid
  • RetNet: Need retention-based architecture
  • Hyena: Want convolution-based approach
优先选择Mamba的场景:
  • 需要处理超长序列(10万+ token)
  • 希望获得比Transformers更快的推理速度
  • 内存受限(无需KV缓存)
  • 构建流式应用
  • 线性扩展能力至关重要
优势:
  • O(n)复杂度: 线性复杂度对比平方复杂度
  • 5倍更快推理: 无注意力机制开销
  • 无需KV缓存: 更低内存占用
  • 百万级token序列: 硬件高效设计
  • 流式处理: 每个token的内存占用恒定
优先选择替代方案的场景:
  • Transformers: 需要当前最优性能,且具备足够计算资源
  • RWKV: 想要RNN+Transformer混合架构
  • RetNet: 需要基于保留机制的架构
  • Hyena: 想要基于卷积的方案

Common issues

常见问题

Issue: CUDA out of memory
Reduce batch size or use gradient checkpointing:
python
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing
Issue: Slow installation
Install binary wheels (not source):
bash
pip install mamba-ssm --no-build-isolation
Issue: Missing causal-conv1d
Install separately:
bash
pip install causal-conv1d>=1.4.0
Issue: Model not loading from HuggingFace
Use
MambaLMHeadModel.from_pretrained
(not
AutoModel
):
python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")
问题:CUDA内存不足
减小批量大小或启用梯度检查点:
python
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing
问题:安装速度慢
安装二进制轮子(而非从源码编译):
bash
pip install mamba-ssm --no-build-isolation
问题:缺少causal-conv1d
单独安装:
bash
pip install causal-conv1d>=1.4.0
问题:无法从HuggingFace加载模型
使用
MambaLMHeadModel.from_pretrained
(而非
AutoModel
):
python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")

Advanced topics

高级主题

Selective SSM: See references/selective-ssm.md for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.
Mamba-2 architecture: See references/mamba2-details.md for multi-head structure, tensor parallelism, and distributed training setup.
Performance optimization: See references/performance.md for hardware-aware design, CUDA kernels, and memory efficiency techniques.
选择性SSM: 查看references/selective-ssm.md了解数学公式、状态空间方程,以及选择性机制如何实现O(n)复杂度。
Mamba-2架构: 查看references/mamba2-details.md了解多头结构、张量并行和分布式训练设置。
性能优化: 查看references/performance.md了解硬件感知设计、CUDA内核和内存效率优化技术。

Hardware requirements

硬件要求

  • GPU: NVIDIA with CUDA 11.6+
  • VRAM:
    • 130M model: 2GB
    • 370M model: 4GB
    • 790M model: 8GB
    • 1.4B model: 14GB
    • 2.8B model: 28GB (FP16)
  • Inference: 5× faster than Transformers
  • Memory: No KV cache (lower than Transformers)
Performance (vs Transformers):
  • Speed: 5× faster inference
  • Memory: 50% less (no KV cache)
  • Scaling: Linear vs quadratic
  • GPU: 支持CUDA 11.6+的NVIDIA GPU
  • 显存:
    • 1.3亿参数模型: 2GB
    • 3.7亿参数模型: 4GB
    • 7.9亿参数模型: 8GB
    • 14亿参数模型: 14GB
    • 28亿参数模型: 28GB(FP16精度)
  • 推理速度: 比Transformers快5倍
  • 内存: 无需KV缓存(内存占用低于Transformers)
性能对比(与Transformers相比):
  • 速度: 推理速度快5倍
  • 内存: 内存占用减少50%(无需KV缓存)
  • 扩展性: 随序列长度线性扩展

Resources

资源