turboquant-pytorch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTurboQuant PyTorch
TurboQuant PyTorch
Skill by ara.so — Daily 2026 Skills collection.
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.
由ara.so提供的技能 — 2026每日技能合集。
从零开始基于PyTorch实现Google的TurboQuant(ICLR 2026),用于压缩LLM的KV缓存。通过两阶段向量量化,在3比特精度下实现5倍压缩率,同时保持99.5%的注意力保真度。
What It Does
功能介绍
TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:
- Stage 1: Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)
- Stage 2: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased
Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products, not vector fidelity.
Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline):
- 4-bit → 76 MB (3.8x)
- 3-bit → 58 MB (5.0x) ← practical sweet spot
- 2-bit → 40 MB (7.3x)
TurboQuant可将LLM的键值(KV)缓存压缩至每坐标2–4比特:
- 第一阶段:随机正交旋转 + Lloyd-Max标量量化(MSE最优)
- 第二阶段:QJL残差校正——1比特符号投影,使内积估计无偏
效果:即使单个向量与原始向量差异较大,注意力分数仍能保持准确。该算法保留的是内积,而非向量保真度。
在Qwen2.5-3B模型8K上下文长度下的压缩比(FP16基线为289 MB):
- 4比特 → 76 MB(3.8倍)
- 3比特 → 58 MB(5.0倍)← 实用最优选择
- 2比特 → 40 MB(7.3倍)
Installation
安装方法
bash
git clone https://github.com/tonbistudio/turboquant-pytorch
cd turboquant-pytorch
pip install -r requirements.txtbash
git clone https://github.com/tonbistudio/turboquant-pytorch
cd turboquant-pytorch
pip install -r requirements.txtFor CUDA PyTorch:
针对CUDA版本的PyTorch:
pip install torch --index-url https://download.pytorch.org/whl/cu128
**requirements.txt includes:**
- `torch>=2.0`
- `scipy` (Lloyd-Max codebook computation)
- `transformers`, `accelerate`, `bitsandbytes` (only for real model validation)pip install torch --index-url https://download.pytorch.org/whl/cu128
**requirements.txt包含:**
- `torch>=2.0`
- `scipy`(用于Lloyd-Max码本计算)
- `transformers`, `accelerate`, `bitsandbytes`(仅用于真实模型验证)Project Structure
项目结构
turboquant/
__init__.py # Package exports
lloyd_max.py # Lloyd-Max optimal scalar quantizer
turboquant.py # Core: TurboQuantMSE, TurboQuantProd, TurboQuantKVCache
compressors.py # Production compressors for real model tensors
test_turboquant.py # Synthetic validation tests
validate.py # Real model (Qwen2.5-3B) validationturboquant/
__init__.py # 包导出文件
lloyd_max.py # Lloyd-Max最优标量量化器
turboquant.py # 核心模块:TurboQuantMSE, TurboQuantProd, TurboQuantKVCache
compressors.py # 针对真实模型张量的生产级压缩器
test_turboquant.py # 合成验证测试
validate.py # 真实模型(Qwen2.5-3B)验证Key Commands
核心命令
bash
undefinedbash
undefinedRun synthetic algorithm validation (no GPU required, but GPU enables speed benchmark)
运行合成算法验证(无需GPU,GPU可加速基准测试)
python -m turboquant.test_turboquant
python -m turboquant.test_turboquant
Run real model validation on Qwen2.5-3B-Instruct
在Qwen2.5-3B-Instruct上运行真实模型验证
Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run
需要≥6GB显存的CUDA GPU;首次运行会下载约2GB的模型
python -m turboquant.validate
undefinedpython -m turboquant.validate
undefinedCore API
核心API
Lloyd-Max Codebook
Lloyd-Max码本
python
from turboquant.lloyd_max import build_lloyd_max_codebookpython
from turboquant.lloyd_max import build_lloyd_max_codebookBuild optimal scalar quantizer codebook for d-dimensional rotated unit vectors
为d维旋转单位向量构建最优标量量化器码本
Returns (boundaries, centroids) for the given bit-width
返回给定比特宽度对应的(边界值,质心)
boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)
undefinedboundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)
undefinedStage 1: MSE Quantization (TurboQuantMSE)
第一阶段:MSE量化(TurboQuantMSE)
python
from turboquant.turboquant import TurboQuantMSEpython
from turboquant.turboquant import TurboQuantMSEInitialize for head_dim=128, 3-bit quantization
针对head_dim=128、3比特量化进行初始化
tq_mse = TurboQuantMSE(dim=128, bits=3)
tq_mse = TurboQuantMSE(dim=128, bits=3)
Compress a batch of vectors: shape (batch, dim)
压缩一批向量:形状为(batch, dim)
keys = torch.randn(512, 128) # 512 key vectors
codes = tq_mse.quantize(keys) # integer codes, (512, 128)
reconstructed = tq_mse.dequantize(codes) # approximate keys, (512, 128)
undefinedkeys = torch.randn(512, 128) # 512个键向量
codes = tq_mse.quantize(keys) # 整数编码,形状(512, 128)
reconstructed = tq_mse.dequantize(codes) # 近似键向量,形状(512, 128)
undefinedStage 2: Unbiased Inner Product Estimation (TurboQuantProd)
第二阶段:无偏内积估计(TurboQuantProd)
python
from turboquant.turboquant import TurboQuantProdpython
from turboquant.turboquant import TurboQuantProdInitialize with QJL correction
启用QJL校正进行初始化
tq_prod = TurboQuantProd(dim=128, bits=3, proj_dim=64)
tq_prod = TurboQuantProd(dim=128, bits=3, proj_dim=64)
Compress key vectors (stores codes + QJL residual signs)
压缩键向量(存储编码+QJL残差符号)
compressed = tq_prod.compress(keys) # dict with 'codes', 'signs', 'residual_norms'
compressed = tq_prod.compress(keys) # 包含'codes'、'signs'、'residual_norms'的字典
Estimate inner products <query, key> for all keys — unbiased estimator
估计所有键与查询向量的内积<query, key>——无偏估计器
query = torch.randn(128)
scores = tq_prod.inner_product(query, compressed) # shape (512,)
undefinedquery = torch.randn(128)
scores = tq_prod.inner_product(query, compressed) # 形状(512,)
undefinedKV Cache Wrapper (TurboQuantKVCache)
KV缓存包装器(TurboQuantKVCache)
python
from turboquant.turboquant import TurboQuantKVCachepython
from turboquant.turboquant import TurboQuantKVCacheWrap a KV cache for a single attention head
为单个注意力头包装KV缓存
cache = TurboQuantKVCache(dim=128, bits=3, proj_dim=64)
cache = TurboQuantKVCache(dim=128, bits=3, proj_dim=64)
Add key/value vectors as tokens are generated
生成token时添加键/值向量
cache.append_key(new_key) # shape (dim,)
cache.append_value(new_val) # shape (dim,)
cache.append_key(new_key) # 形状(dim,)
cache.append_value(new_val) # 形状(dim,)
Compute attention scores for a query against all cached keys
计算查询向量与所有缓存键的注意力分数
query = torch.randn(128)
scores = cache.attention_scores(query) # shape (seq_len,), unbiased
query = torch.randn(128)
scores = cache.attention_scores(query) # 形状(seq_len,),无偏
Get values (MSE-reconstructed, used for weighted sum)
获取值(MSE重构,用于加权求和)
values = cache.get_values() # shape (seq_len, dim)
undefinedvalues = cache.get_values() # 形状(seq_len, dim)
undefinedProduction Compressors (for real model tensors)
生产级压缩器(针对真实模型张量)
python
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSEpython
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSEKey compressor — supports asymmetric attention score computation
键压缩器——支持非对称注意力分数计算
key_compressor = TurboQuantCompressorV2(dim=128, bits=3, proj_dim=64)
key_compressor = TurboQuantCompressorV2(dim=128, bits=3, proj_dim=64)
Compress all keys in a layer: shape (num_heads, seq_len, head_dim)
压缩某一层的所有键:形状(num_heads, seq_len, head_dim)
compressed_keys = key_compressor.compress(layer_keys)
compressed_keys = key_compressor.compress(layer_keys)
Compute attention scores directly from compressed keys (no decompress needed)
直接从压缩后的键计算注意力分数(无需解压缩)
query shape: (num_heads, head_dim)
query形状:(num_heads, head_dim)
scores = key_compressor.asymmetric_attention_scores(query, compressed_keys)
scores = key_compressor.asymmetric_attention_scores(query, compressed_keys)
scores shape: (num_heads, seq_len)
scores形状:(num_heads, seq_len)
Value compressor — MSE reconstruction (Stage 1 only, acceptable for values)
值压缩器——MSE重构(仅第一阶段,对值而言已足够)
val_compressor = TurboQuantCompressorMSE(dim=128, bits=3)
compressed_vals = val_compressor.compress(layer_values)
reconstructed_vals = val_compressor.decompress(compressed_vals)
undefinedval_compressor = TurboQuantCompressorMSE(dim=128, bits=3)
compressed_vals = val_compressor.compress(layer_values)
reconstructed_vals = val_compressor.decompress(compressed_vals)
undefinedCommon Patterns
常用模式
Pattern 1: Compress a Full Model's KV Cache
模式1:压缩完整模型的KV缓存
python
import torch
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE
def compress_kv_cache(kv_cache, head_dim=128, bits=3, proj_dim=64):
"""
kv_cache: list of (keys, values) per layer
keys/values shape: (num_heads, seq_len, head_dim)
Returns list of compressed (keys, values) per layer.
"""
key_comp = TurboQuantCompressorV2(dim=head_dim, bits=bits, proj_dim=proj_dim)
val_comp = TurboQuantCompressorMSE(dim=head_dim, bits=bits)
compressed = []
for layer_keys, layer_vals in kv_cache:
c_keys = key_comp.compress(layer_keys)
c_vals = val_comp.compress(layer_vals)
compressed.append((c_keys, c_vals))
return compressed, key_comp, val_comp
def run_attention_with_compressed_cache(query, compressed_keys, compressed_vals,
key_comp, val_comp):
"""
query: (num_heads, head_dim)
Returns: attention output (num_heads, head_dim)
"""
# Unbiased attention scores from compressed keys
scores = key_comp.asymmetric_attention_scores(query, compressed_keys)
# scores: (num_heads, seq_len)
attn_weights = torch.softmax(scores, dim=-1) # (num_heads, seq_len)
# Decompress values and compute weighted sum
values = val_comp.decompress(compressed_vals) # (num_heads, seq_len, head_dim)
output = torch.einsum('hs,hsd->hd', attn_weights, values)
return outputpython
import torch
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE
def compress_kv_cache(kv_cache, head_dim=128, bits=3, proj_dim=64):
"""
kv_cache:每层的(keys, values)列表
keys/values形状:(num_heads, seq_len, head_dim)
返回每层的压缩后(keys, values)列表。
"""
key_comp = TurboQuantCompressorV2(dim=head_dim, bits=bits, proj_dim=proj_dim)
val_comp = TurboQuantCompressorMSE(dim=head_dim, bits=bits)
compressed = []
for layer_keys, layer_vals in kv_cache:
c_keys = key_comp.compress(layer_keys)
c_vals = val_comp.compress(layer_vals)
compressed.append((c_keys, c_vals))
return compressed, key_comp, val_comp
def run_attention_with_compressed_cache(query, compressed_keys, compressed_vals,
key_comp, val_comp):
"""
query:(num_heads, head_dim)
返回:注意力输出(num_heads, head_dim)
"""
# 从压缩键获取无偏注意力分数
scores = key_comp.asymmetric_attention_scores(query, compressed_keys)
# scores:(num_heads, seq_len)
attn_weights = torch.softmax(scores, dim=-1) # (num_heads, seq_len)
# 解压缩值并计算加权和
values = val_comp.decompress(compressed_vals) # (num_heads, seq_len, head_dim)
output = torch.einsum('hs,hsd->hd', attn_weights, values)
return outputPattern 2: Validate Compression Quality
模式2:验证压缩质量
python
import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd
def measure_attention_fidelity(keys, queries, bits=3, proj_dim=64):
"""
Measure how well TurboQuant preserves attention distributions.
keys: (seq_len, head_dim)
queries: (num_queries, head_dim)
"""
dim = keys.shape[-1]
tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=proj_dim)
compressed = tq.compress(keys)
cosine_sims = []
top1_matches = []
for q in queries:
# True attention scores
true_scores = (keys @ q) # (seq_len,)
true_attn = torch.softmax(true_scores, dim=0)
# TurboQuant estimated scores
est_scores = tq.inner_product(q, compressed) # (seq_len,)
est_attn = torch.softmax(est_scores, dim=0)
# Cosine similarity of attention distributions
cos_sim = F.cosine_similarity(true_attn.unsqueeze(0),
est_attn.unsqueeze(0)).item()
cosine_sims.append(cos_sim)
# Top-1 match
top1_matches.append(true_attn.argmax() == est_attn.argmax())
return {
'mean_cosine_sim': sum(cosine_sims) / len(cosine_sims),
'top1_accuracy': sum(top1_matches) / len(top1_matches),
}python
import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd
def measure_attention_fidelity(keys, queries, bits=3, proj_dim=64):
"""
衡量TurboQuant对注意力分布的保留程度。
keys: (seq_len, head_dim)
queries: (num_queries, head_dim)
"""
dim = keys.shape[-1]
tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=proj_dim)
compressed = tq.compress(keys)
cosine_sims = []
top1_matches = []
for q in queries:
# 真实注意力分数
true_scores = (keys @ q) # (seq_len,)
true_attn = torch.softmax(true_scores, dim=0)
# TurboQuant估计的分数
est_scores = tq.inner_product(q, compressed) # (seq_len,)
est_attn = torch.softmax(est_scores, dim=0)
# 注意力分布的余弦相似度
cos_sim = F.cosine_similarity(true_attn.unsqueeze(0),
est_attn.unsqueeze(0)).item()
cosine_sims.append(cos_sim)
# Top-1匹配情况
top1_matches.append(true_attn.argmax() == est_attn.argmax())
return {
'mean_cosine_sim': sum(cosine_sims) / len(cosine_sims),
'top1_accuracy': sum(top1_matches) / len(top1_matches),
}Example usage
使用示例
keys = torch.randn(2048, 128)
keys = F.normalize(keys, dim=-1)
queries = torch.randn(100, 128)
queries = F.normalize(queries, dim=-1)
results = measure_attention_fidelity(keys, queries, bits=3)
print(f"Cosine similarity: {results['mean_cosine_sim']:.4f}")
print(f"Top-1 accuracy: {results['top1_accuracy']:.2%}")
undefinedkeys = torch.randn(2048, 128)
keys = F.normalize(keys, dim=-1)
queries = torch.randn(100, 128)
queries = F.normalize(queries, dim=-1)
results = measure_attention_fidelity(keys, queries, bits=3)
print(f"余弦相似度: {results['mean_cosine_sim']:.4f}")
print(f"Top-1准确率: {results['top1_accuracy']:.2%}")
undefinedPattern 3: Needle-in-Haystack Retrieval Test
模式3:大海捞针检索测试
python
import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd
def needle_in_haystack(seq_len=2048, dim=128, bits=3):
"""Test whether TurboQuant preserves nearest-neighbor ordering."""
tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=64)
# Build haystack of random unit vectors
haystack = F.normalize(torch.randn(seq_len, dim), dim=-1)
# Insert needle at random position
needle_idx = torch.randint(0, seq_len, (1,)).item()
query = F.normalize(torch.randn(dim), dim=0)
needle = query + 0.1 * torch.randn(dim) # Similar to query
needle = F.normalize(needle, dim=0)
haystack[needle_idx] = needle
# Compress
compressed = tq.compress(haystack)
# True nearest neighbor
true_scores = haystack @ query
true_best = true_scores.argmax().item()
# TurboQuant estimated nearest neighbor
est_scores = tq.inner_product(query, compressed)
est_best = est_scores.argmax().item()
return true_best == est_best, true_best, est_bestpython
import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd
def needle_in_haystack(seq_len=2048, dim=128, bits=3):
"""测试TurboQuant是否保留最近邻排序。"""
tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=64)
# 构建随机单位向量的"草堆"
haystack = F.normalize(torch.randn(seq_len, dim), dim=-1)
# 在随机位置插入"针"
needle_idx = torch.randint(0, seq_len, (1,)).item()
query = F.normalize(torch.randn(dim), dim=0)
needle = query + 0.1 * torch.randn(dim) # 与查询向量相似
needle = F.normalize(needle, dim=0)
haystack[needle_idx] = needle
# 压缩
compressed = tq.compress(haystack)
# 真实最近邻
true_scores = haystack @ query
true_best = true_scores.argmax().item()
# TurboQuant估计的最近邻
est_scores = tq.inner_product(query, compressed)
est_best = est_scores.argmax().item()
return true_best == est_best, true_best, est_bestRun multiple trials
多次运行测试
successes = sum(needle_in_haystack(seq_len=8192)[0] for _ in range(20))
print(f"Retrieval accuracy: {successes}/20")
undefinedsuccesses = sum(needle_in_haystack(seq_len=8192)[0] for _ in range(20))
print(f"检索准确率: {successes}/20")
undefinedPattern 4: Compute Memory Savings
模式4:计算内存节省
python
def estimate_memory_savings(num_layers, num_kv_heads, seq_len, head_dim,
bits, proj_dim=64):
"""
Estimate compressed KV cache size vs FP16 baseline.
"""
# FP16 baseline: 2 bytes per element
fp16_bytes = num_layers * 2 * num_kv_heads * seq_len * head_dim * 2
# Stage 1 codes: bits per element, packed into bytes
codes_bytes = (num_layers * 2 * num_kv_heads * seq_len * head_dim * bits) // 8
# Stage 2 signs (keys only): 1 bit per proj_dim element
signs_bytes = (num_layers * num_kv_heads * seq_len * proj_dim) // 8
# Residual norms: 1 float16 per vector (keys only)
norms_bytes = num_layers * num_kv_heads * seq_len * 2
total_compressed = codes_bytes + signs_bytes + norms_bytes
ratio = fp16_bytes / total_compressed
print(f"FP16 baseline: {fp16_bytes / 1e6:.1f} MB")
print(f"TurboQuant {bits}-bit: {total_compressed / 1e6:.1f} MB")
print(f"Compression ratio: {ratio:.1f}x")
return ratiopython
def estimate_memory_savings(num_layers, num_kv_heads, seq_len, head_dim,
bits, proj_dim=64):
"""
估算压缩后KV缓存与FP16基线的大小对比。
"""
# FP16基线:每个元素2字节
fp16_bytes = num_layers * 2 * num_kv_heads * seq_len * head_dim * 2
# 第一阶段编码:每个元素bits比特,打包为字节
codes_bytes = (num_layers * 2 * num_kv_heads * seq_len * head_dim * bits) // 8
# 第二阶段符号(仅键):每个proj_dim元素1比特
signs_bytes = (num_layers * num_kv_heads * seq_len * proj_dim) // 8
# 残差范数:每个向量1个float16(仅键)
norms_bytes = num_layers * num_kv_heads * seq_len * 2
total_compressed = codes_bytes + signs_bytes + norms_bytes
ratio = fp16_bytes / total_compressed
print(f"FP16基线: {fp16_bytes / 1e6:.1f} MB")
print(f"TurboQuant {bits}-bit: {total_compressed / 1e6:.1f} MB")
print(f"压缩比: {ratio:.1f}x")
return ratioQwen2.5-3B: 36 layers, 2 KV heads, head_dim=128
Qwen2.5-3B:36层,2个KV头,head_dim=128
estimate_memory_savings(
num_layers=36, num_kv_heads=2,
seq_len=8192, head_dim=128, bits=3
)
estimate_memory_savings(
num_layers=36, num_kv_heads=2,
seq_len=8192, head_dim=128, bits=3
)
FP16 baseline: 289.4 MB
FP16基线: 289.4 MB
TurboQuant 3-bit: 57.9 MB
TurboQuant 3-bit: 57.9 MB
Compression ratio: 5.0x
压缩比: 5.0x
undefinedundefinedAlgorithm Details
算法细节
Why Random Rotation?
为什么使用随机旋转?
Rotating by a random orthogonal matrix maps unit vectors to a space where each coordinate follows . This makes coordinates nearly independent with known distribution — enabling optimal per-coordinate scalar quantization (Lloyd-Max).
RN(0, 1/d)通过随机正交矩阵进行旋转,将单位向量映射到每个坐标服从分布的空间。这使得坐标几乎独立且分布已知——从而实现最优的逐坐标标量量化(Lloyd-Max)。
RN(0, 1/d)Why QJL for Keys but Not Values?
为什么键用QJL而值不用?
- Keys: Used in dot products with queries. Bias in inner product estimates directly corrupts attention weights. QJL correction is essential.
- Values: Used in weighted sums after softmax. Small per-vector MSE errors average out. Stage 1 MSE quantization is sufficient.
- 键:用于与查询向量进行点积。内积估计中的偏差会直接破坏注意力权重,QJL校正至关重要。
- 值:用于softmax后的加权求和。单个向量的微小MSE误差会被平均抵消,第一阶段的MSE量化已足够。
Choosing proj_dim
(QJL projection dimension)
proj_dim选择proj_dim
(QJL投影维度)
proj_dimHigher → lower variance in inner product estimates, but more memory:
proj_dimpython
undefinedproj_dimpython
undefinedRule of thumb: proj_dim = head_dim // 2 is a good default
经验法则:proj_dim = head_dim // 2是不错的默认值
head_dim=128 → proj_dim=64
head_dim=128 → proj_dim=64
head_dim=64 → proj_dim=32
head_dim=64 → proj_dim=32
head_dim=256 → proj_dim=128
head_dim=256 → proj_dim=128
undefinedundefinedBit-width Selection Guide
比特宽度选择指南
| Bits | Compression | Cosine Sim | Top-1 Match | Use Case |
|---|---|---|---|---|
| 4 | 3.8x | 0.999 | 87% | Quality-critical tasks |
| 3 | 5.0x | 0.995 | 82% | Recommended default |
| 2 | 7.3x | 0.988 | 66% | Extreme memory pressure |
| 比特数 | 压缩比 | 余弦相似度 | Top-1匹配率 | 使用场景 |
|---|---|---|---|---|
| 4 | 3.8x | 0.999 | 87% | 对质量要求极高的任务 |
| 3 | 5.0x | 0.995 | 82% | 推荐默认选项 |
| 2 | 7.3x | 0.988 | 66% | 内存极度紧张的场景 |
Troubleshooting
故障排除
scipybash
pip install scipyCUDA out of memory during :
validate.py- Requires ≥6GB VRAM for Qwen2.5-3B in 4-bit
- Reduce in the validation script or use a smaller model
seq_len
Inner product estimates have high variance:
- Increase (try
proj_diminstead ofhead_dim)head_dim // 2 - Check that input vectors are normalized before compressing
Codebook build is slow on first run:
- Lloyd-Max uses numerical integration (scipy) — this is expected
- Codebooks are precomputed once per combination; cache them:
(dim, bits)
python
import pickle构建码本时出现导入错误:
scipybash
pip install scipy运行时CUDA内存不足:
validate.py- 4比特精度下Qwen2.5-3B需要≥6GB显存
- 减小验证脚本中的或使用更小的模型
seq_len
内积估计方差过高:
- 增大(尝试使用
proj_dim而非head_dim)head_dim // 2 - 检查输入向量在压缩前是否已归一化
首次运行码本构建速度慢:
- Lloyd-Max使用数值积分(scipy)——这是正常现象
- 码本会针对每个组合预计算一次;可缓存码本:
(dim, bits)
python
import pickleSave codebook
保存码本
boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)
with open('codebook_128_3bit.pkl', 'wb') as f:
pickle.dump((boundaries, centroids), f)
boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)
with open('codebook_128_3bit.pkl', 'wb') as f:
pickle.dump((boundaries, centroids), f)
Load cached codebook
加载缓存的码本
with open('codebook_128_3bit.pkl', 'rb') as f:
boundaries, centroids = pickle.load(f)
**Attention fidelity lower than expected:**
- Ensure vectors are L2-normalized before compressing (`F.normalize(x, dim=-1)`)
- The compressors in `compressors.py` handle normalization internally; `TurboQuantProd` expects unit vectorswith open('codebook_128_3bit.pkl', 'rb') as f:
boundaries, centroids = pickle.load(f)
**注意力保真度低于预期:**
- 确保向量在压缩前已进行L2归一化(`F.normalize(x, dim=-1)`)
- `compressors.py`中的压缩器会自动处理归一化;`TurboQuantProd`要求输入为单位向量References
参考资料
- TurboQuant paper — ICLR 2026
- QJL paper — 1-bit residual correction technique
- PolarQuant — Related polar coordinate approach
- TurboQuant论文 — ICLR 2026
- QJL论文 — 1比特残差校正技术
- PolarQuant — 相关的极坐标方法 ",