mamba-architecture
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMamba - Selective State Space Models
Mamba - 选择性状态空间模型
Quick start
快速开始
Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.
Installation:
bash
undefinedMamba是一种状态空间模型架构,实现了序列建模的O(n)线性复杂度。
安装:
bash
undefinedInstall causal-conv1d (optional, for efficiency)
Install causal-conv1d (optional, for efficiency)
pip install causal-conv1d>=1.4.0
pip install causal-conv1d>=1.4.0
Install Mamba
Install Mamba
pip install mamba-ssm
pip install mamba-ssm
Or both together
Or both together
pip install mamba-ssm[causal-conv1d]
**Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+
**Basic usage** (Mamba block):
```python
import torch
from mamba_ssm import Mamba
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba(
d_model=dim, # Model dimension
d_state=16, # SSM state dimension
d_conv=4, # Conv1d kernel size
expand=2 # Expansion factor
).to("cuda")
y = model(x) # O(n) complexity!
assert y.shape == x.shapepip install mamba-ssm[causal-conv1d]
**前置要求**: Linux系统、NVIDIA GPU、PyTorch 1.12及以上版本、CUDA 11.6及以上版本
**基础用法**(Mamba模块):
```python
import torch
from mamba_ssm import Mamba
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba(
d_model=dim, # Model dimension
d_state=16, # SSM state dimension
d_conv=4, # Conv1d kernel size
expand=2 # Expansion factor
).to("cuda")
y = model(x) # O(n) complexity!
assert y.shape == x.shapeCommon workflows
常见工作流
Workflow 1: Language model with Mamba-2
工作流1:基于Mamba-2的语言模型
Complete LM with generation:
python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch完整生成式语言模型:
python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torchConfigure Mamba-2 LM
Configure Mamba-2 LM
config = MambaConfig(
d_model=1024, # Hidden dimension
n_layer=24, # Number of layers
vocab_size=50277, # Vocabulary size
ssm_cfg=dict(
layer="Mamba2", # Use Mamba-2
d_state=128, # Larger state for Mamba-2
headdim=64, # Head dimension
ngroups=1 # Number of groups
)
)
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
config = MambaConfig(
d_model=1024, # Hidden dimension
n_layer=24, # Number of layers
vocab_size=50277, # Vocabulary size
ssm_cfg=dict(
layer="Mamba2", # Use Mamba-2
d_state=128, # Larger state for Mamba-2
headdim=64, # Head dimension
ngroups=1 # Number of groups
)
)
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
Generate text
Generate text
input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
output = model.generate(
input_ids=input_ids,
max_length=100,
temperature=0.7,
top_p=0.9
)
undefinedinput_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
output = model.generate(
input_ids=input_ids,
max_length=100,
temperature=0.7,
top_p=0.9
)
undefinedWorkflow 2: Use pretrained Mamba models
工作流2:使用预训练Mamba模型
Load from HuggingFace:
python
from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel从HuggingFace加载:
python
from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModelLoad pretrained model
Load pretrained model
model_name = "state-spaces/mamba-2.8b"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer
model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)
model_name = "state-spaces/mamba-2.8b"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b") # Use compatible tokenizer
model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)
Generate
Generate
prompt = "The future of AI is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output_ids = model.generate(
input_ids=input_ids,
max_length=200,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2
)
generated_text = tokenizer.decode(output_ids[0])
print(generated_text)
**Available models**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`prompt = "The future of AI is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output_ids = model.generate(
input_ids=input_ids,
max_length=200,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2
)
generated_text = tokenizer.decode(output_ids[0])
print(generated_text)
**可用模型**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`Workflow 3: Mamba-1 vs Mamba-2
工作流3:Mamba-1 vs Mamba-2
Mamba-1 (smaller state):
python
from mamba_ssm import Mamba
model = Mamba(
d_model=256,
d_state=16, # Smaller state dimension
d_conv=4,
expand=2
).to("cuda")Mamba-2 (multi-head, larger state):
python
from mamba_ssm import Mamba2
model = Mamba2(
d_model=256,
d_state=128, # Larger state dimension
d_conv=4,
expand=2,
headdim=64, # Head dimension for multi-head
ngroups=1 # Parallel groups
).to("cuda")Key differences:
- State size: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
- Architecture: Mamba-2 has multi-head structure
- Normalization: Mamba-2 uses RMSNorm
- Distributed: Mamba-2 supports tensor parallelism
Mamba-1(较小状态维度):
python
from mamba_ssm import Mamba
model = Mamba(
d_model=256,
d_state=16, # Smaller state dimension
d_conv=4,
expand=2
).to("cuda")Mamba-2(多头结构,较大状态维度):
python
from mamba_ssm import Mamba2
model = Mamba2(
d_model=256,
d_state=128, # Larger state dimension
d_conv=4,
expand=2,
headdim=64, # Head dimension for multi-head
ngroups=1 # Parallel groups
).to("cuda")核心差异:
- 状态大小: Mamba-1(d_state=16) vs Mamba-2(d_state=128)
- 架构: Mamba-2采用多头结构
- 归一化: Mamba-2使用RMSNorm
- 分布式: Mamba-2支持张量并行
Workflow 4: Benchmark vs Transformers
工作流4:与Transformers的基准测试
Generation speed comparison:
bash
undefined生成速度对比:
bash
undefinedBenchmark Mamba
Benchmark Mamba
python benchmarks/benchmark_generation_mamba_simple.py
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
python benchmarks/benchmark_generation_mamba_simple.py
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
--model-name "state-spaces/mamba-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
Benchmark Transformer
Benchmark Transformer
python benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
**Expected results**:
- **Mamba**: 5× faster inference
- **Memory**: No KV cache needed
- **Scaling**: Linear with sequence lengthpython benchmarks/benchmark_generation_mamba_simple.py
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
--model-name "EleutherAI/pythia-2.8b"
--prompt "The future of machine learning is"
--topp 0.9 --temperature 0.7 --repetition-penalty 1.2
**预期结果**:
- **Mamba**: 推理速度快5倍
- **内存**: 无需KV缓存
- **扩展性**: 随序列长度线性扩展When to use vs alternatives
适用场景与替代方案对比
Use Mamba when:
- Need long sequences (100K+ tokens)
- Want faster inference than Transformers
- Memory-constrained (no KV cache)
- Building streaming applications
- Linear scaling important
Advantages:
- O(n) complexity: Linear vs quadratic
- 5× faster inference: No attention overhead
- No KV cache: Lower memory usage
- Million-token sequences: Hardware-efficient
- Streaming: Constant memory per token
Use alternatives instead:
- Transformers: Need best-in-class performance, have compute
- RWKV: Want RNN+Transformer hybrid
- RetNet: Need retention-based architecture
- Hyena: Want convolution-based approach
优先选择Mamba的场景:
- 需要处理超长序列(10万+ token)
- 希望获得比Transformers更快的推理速度
- 内存受限(无需KV缓存)
- 构建流式应用
- 线性扩展能力至关重要
优势:
- O(n)复杂度: 线性复杂度对比平方复杂度
- 5倍更快推理: 无注意力机制开销
- 无需KV缓存: 更低内存占用
- 百万级token序列: 硬件高效设计
- 流式处理: 每个token的内存占用恒定
优先选择替代方案的场景:
- Transformers: 需要当前最优性能,且具备足够计算资源
- RWKV: 想要RNN+Transformer混合架构
- RetNet: 需要基于保留机制的架构
- Hyena: 想要基于卷积的方案
Common issues
常见问题
Issue: CUDA out of memory
Reduce batch size or use gradient checkpointing:
python
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable() # Enable checkpointingIssue: Slow installation
Install binary wheels (not source):
bash
pip install mamba-ssm --no-build-isolationIssue: Missing causal-conv1d
Install separately:
bash
pip install causal-conv1d>=1.4.0Issue: Model not loading from HuggingFace
Use (not ):
MambaLMHeadModel.from_pretrainedAutoModelpython
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")问题:CUDA内存不足
减小批量大小或启用梯度检查点:
python
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable() # Enable checkpointing问题:安装速度慢
安装二进制轮子(而非从源码编译):
bash
pip install mamba-ssm --no-build-isolation问题:缺少causal-conv1d
单独安装:
bash
pip install causal-conv1d>=1.4.0问题:无法从HuggingFace加载模型
使用(而非):
MambaLMHeadModel.from_pretrainedAutoModelpython
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")Advanced topics
高级主题
Selective SSM: See references/selective-ssm.md for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.
Mamba-2 architecture: See references/mamba2-details.md for multi-head structure, tensor parallelism, and distributed training setup.
Performance optimization: See references/performance.md for hardware-aware design, CUDA kernels, and memory efficiency techniques.
选择性SSM: 查看references/selective-ssm.md了解数学公式、状态空间方程,以及选择性机制如何实现O(n)复杂度。
Mamba-2架构: 查看references/mamba2-details.md了解多头结构、张量并行和分布式训练设置。
性能优化: 查看references/performance.md了解硬件感知设计、CUDA内核和内存效率优化技术。
Hardware requirements
硬件要求
- GPU: NVIDIA with CUDA 11.6+
- VRAM:
- 130M model: 2GB
- 370M model: 4GB
- 790M model: 8GB
- 1.4B model: 14GB
- 2.8B model: 28GB (FP16)
- Inference: 5× faster than Transformers
- Memory: No KV cache (lower than Transformers)
Performance (vs Transformers):
- Speed: 5× faster inference
- Memory: 50% less (no KV cache)
- Scaling: Linear vs quadratic
- GPU: 支持CUDA 11.6+的NVIDIA GPU
- 显存:
- 1.3亿参数模型: 2GB
- 3.7亿参数模型: 4GB
- 7.9亿参数模型: 8GB
- 14亿参数模型: 14GB
- 28亿参数模型: 28GB(FP16精度)
- 推理速度: 比Transformers快5倍
- 内存: 无需KV缓存(内存占用低于Transformers)
性能对比(与Transformers相比):
- 速度: 推理速度快5倍
- 内存: 内存占用减少50%(无需KV缓存)
- 扩展性: 随序列长度线性扩展
Resources
资源
- Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
- Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
- GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
- Models: https://huggingface.co/state-spaces
- Docs: Repository README and wiki
- 论文(Mamba-1): https://arxiv.org/abs/2312.00752(2023年12月)
- 论文(Mamba-2): https://arxiv.org/abs/2405.21060(2024年5月)
- GitHub仓库: https://github.com/state-spaces/mamba ⭐ 13000+
- 模型: https://huggingface.co/state-spaces
- 文档: 仓库README及维基页面