turboquant-pytorch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TurboQuant PyTorch

Skill by ara.so — Daily 2026 Skills collection.

From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.

由ara.so提供的技能 — 2026每日技能合集。

从零开始基于PyTorch实现Google的TurboQuant（ICLR 2026），用于压缩LLM的KV缓存。通过两阶段向量量化，在3比特精度下实现5倍压缩率，同时保持99.5%的注意力保真度。

What It Does

功能介绍

TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:

Stage 1: Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)
Stage 2: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased

Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products, not vector fidelity.

Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline):

4-bit → 76 MB (3.8x)
3-bit → 58 MB (5.0x) ← practical sweet spot
2-bit → 40 MB (7.3x)

TurboQuant可将LLM的键值（KV）缓存压缩至每坐标2–4比特：

第一阶段：随机正交旋转 + Lloyd-Max标量量化（MSE最优）
第二阶段：QJL残差校正——1比特符号投影，使内积估计无偏

效果：即使单个向量与原始向量差异较大，注意力分数仍能保持准确。该算法保留的是内积，而非向量保真度。

在Qwen2.5-3B模型8K上下文长度下的压缩比（FP16基线为289 MB）：

4比特 → 76 MB（3.8倍）
3比特 → 58 MB（5.0倍）← 实用最优选择
2比特 → 40 MB（7.3倍）

Installation

安装方法

bash

git clone https://github.com/tonbistudio/turboquant-pytorch
cd turboquant-pytorch
pip install -r requirements.txt

bash

git clone https://github.com/tonbistudio/turboquant-pytorch
cd turboquant-pytorch
pip install -r requirements.txt

For CUDA PyTorch:

针对CUDA版本的PyTorch：

pip install torch --index-url https://download.pytorch.org/whl/cu128


**requirements.txt includes:**
- `torch>=2.0`
- `scipy` (Lloyd-Max codebook computation)
- `transformers`, `accelerate`, `bitsandbytes` (only for real model validation)

pip install torch --index-url https://download.pytorch.org/whl/cu128


**requirements.txt包含：**
- `torch>=2.0`
- `scipy`（用于Lloyd-Max码本计算）
- `transformers`, `accelerate`, `bitsandbytes`（仅用于真实模型验证）

Project Structure

项目结构

turboquant/
  __init__.py           # Package exports
  lloyd_max.py          # Lloyd-Max optimal scalar quantizer
  turboquant.py         # Core: TurboQuantMSE, TurboQuantProd, TurboQuantKVCache
  compressors.py        # Production compressors for real model tensors
  test_turboquant.py    # Synthetic validation tests
  validate.py           # Real model (Qwen2.5-3B) validation

turboquant/
  __init__.py           # 包导出文件
  lloyd_max.py          # Lloyd-Max最优标量量化器
  turboquant.py         # 核心模块：TurboQuantMSE, TurboQuantProd, TurboQuantKVCache
  compressors.py        # 针对真实模型张量的生产级压缩器
  test_turboquant.py    # 合成验证测试
  validate.py           # 真实模型（Qwen2.5-3B）验证

Key Commands

核心命令

bash

undefined

bash

undefined

Run synthetic algorithm validation (no GPU required, but GPU enables speed benchmark)

运行合成算法验证（无需GPU，GPU可加速基准测试）

python -m turboquant.test_turboquant

Run real model validation on Qwen2.5-3B-Instruct

在Qwen2.5-3B-Instruct上运行真实模型验证

Requires CUDA GPU with ≥6GB VRAM; downloads ~2GB model on first run

需要≥6GB显存的CUDA GPU；首次运行会下载约2GB的模型

python -m turboquant.validate

undefined

python -m turboquant.validate

undefined

Core API

核心API

Lloyd-Max Codebook

Lloyd-Max码本

python

from turboquant.lloyd_max import build_lloyd_max_codebook

python

from turboquant.lloyd_max import build_lloyd_max_codebook

Build optimal scalar quantizer codebook for d-dimensional rotated unit vectors

为d维旋转单位向量构建最优标量量化器码本

Returns (boundaries, centroids) for the given bit-width

返回给定比特宽度对应的（边界值，质心）

boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)

undefined

boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3)

undefined

Stage 1: MSE Quantization (TurboQuantMSE)

第一阶段：MSE量化（TurboQuantMSE）

python

from turboquant.turboquant import TurboQuantMSE

python

from turboquant.turboquant import TurboQuantMSE

Initialize for head_dim=128, 3-bit quantization

针对head_dim=128、3比特量化进行初始化

tq_mse = TurboQuantMSE(dim=128, bits=3)

Compress a batch of vectors: shape (batch, dim)

压缩一批向量：形状为(batch, dim)

keys = torch.randn(512, 128) # 512 key vectors codes = tq_mse.quantize(keys) # integer codes, (512, 128) reconstructed = tq_mse.dequantize(codes) # approximate keys, (512, 128)

undefined

keys = torch.randn(512, 128) # 512个键向量 codes = tq_mse.quantize(keys) # 整数编码，形状(512, 128) reconstructed = tq_mse.dequantize(codes) # 近似键向量，形状(512, 128)

undefined

Stage 2: Unbiased Inner Product Estimation (TurboQuantProd)

第二阶段：无偏内积估计（TurboQuantProd）

python

from turboquant.turboquant import TurboQuantProd

python

from turboquant.turboquant import TurboQuantProd

Initialize with QJL correction

启用QJL校正进行初始化

tq_prod = TurboQuantProd(dim=128, bits=3, proj_dim=64)

Compress key vectors (stores codes + QJL residual signs)

压缩键向量（存储编码+QJL残差符号）

compressed = tq_prod.compress(keys) # dict with 'codes', 'signs', 'residual_norms'

compressed = tq_prod.compress(keys) # 包含'codes'、'signs'、'residual_norms'的字典

Estimate inner products <query, key> for all keys — unbiased estimator

估计所有键与查询向量的内积<query, key>——无偏估计器

query = torch.randn(128) scores = tq_prod.inner_product(query, compressed) # shape (512,)

undefined

query = torch.randn(128) scores = tq_prod.inner_product(query, compressed) # 形状(512,)

undefined

KV Cache Wrapper (TurboQuantKVCache)

KV缓存包装器（TurboQuantKVCache）

python

from turboquant.turboquant import TurboQuantKVCache

python

from turboquant.turboquant import TurboQuantKVCache

Wrap a KV cache for a single attention head

为单个注意力头包装KV缓存

cache = TurboQuantKVCache(dim=128, bits=3, proj_dim=64)

Add key/value vectors as tokens are generated

生成token时添加键/值向量

cache.append_key(new_key) # shape (dim,) cache.append_value(new_val) # shape (dim,)

cache.append_key(new_key) # 形状(dim,) cache.append_value(new_val) # 形状(dim,)

Compute attention scores for a query against all cached keys

计算查询向量与所有缓存键的注意力分数

query = torch.randn(128) scores = cache.attention_scores(query) # shape (seq_len,), unbiased

query = torch.randn(128) scores = cache.attention_scores(query) # 形状(seq_len,)，无偏

Get values (MSE-reconstructed, used for weighted sum)

获取值（MSE重构，用于加权求和）

values = cache.get_values() # shape (seq_len, dim)

undefined

values = cache.get_values() # 形状(seq_len, dim)

undefined

Production Compressors (for real model tensors)

生产级压缩器（针对真实模型张量）

python

from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE

python

from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE

Key compressor — supports asymmetric attention score computation

键压缩器——支持非对称注意力分数计算

key_compressor = TurboQuantCompressorV2(dim=128, bits=3, proj_dim=64)

Compress all keys in a layer: shape (num_heads, seq_len, head_dim)

压缩某一层的所有键：形状(num_heads, seq_len, head_dim)

compressed_keys = key_compressor.compress(layer_keys)

Compute attention scores directly from compressed keys (no decompress needed)

直接从压缩后的键计算注意力分数（无需解压缩）

query shape: (num_heads, head_dim)

query形状：(num_heads, head_dim)

scores = key_compressor.asymmetric_attention_scores(query, compressed_keys)

scores shape: (num_heads, seq_len)

scores形状：(num_heads, seq_len)

Value compressor — MSE reconstruction (Stage 1 only, acceptable for values)

值压缩器——MSE重构（仅第一阶段，对值而言已足够）

val_compressor = TurboQuantCompressorMSE(dim=128, bits=3) compressed_vals = val_compressor.compress(layer_values) reconstructed_vals = val_compressor.decompress(compressed_vals)

undefined

val_compressor = TurboQuantCompressorMSE(dim=128, bits=3) compressed_vals = val_compressor.compress(layer_values) reconstructed_vals = val_compressor.decompress(compressed_vals)

undefined

Common Patterns

常用模式

Pattern 1: Compress a Full Model's KV Cache

模式1：压缩完整模型的KV缓存

python

import torch
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE

def compress_kv_cache(kv_cache, head_dim=128, bits=3, proj_dim=64):
    """
    kv_cache: list of (keys, values) per layer
              keys/values shape: (num_heads, seq_len, head_dim)
    Returns list of compressed (keys, values) per layer.
    """
    key_comp = TurboQuantCompressorV2(dim=head_dim, bits=bits, proj_dim=proj_dim)
    val_comp = TurboQuantCompressorMSE(dim=head_dim, bits=bits)

    compressed = []
    for layer_keys, layer_vals in kv_cache:
        c_keys = key_comp.compress(layer_keys)
        c_vals = val_comp.compress(layer_vals)
        compressed.append((c_keys, c_vals))

    return compressed, key_comp, val_comp


def run_attention_with_compressed_cache(query, compressed_keys, compressed_vals,
                                        key_comp, val_comp):
    """
    query: (num_heads, head_dim)
    Returns: attention output (num_heads, head_dim)
    """
    # Unbiased attention scores from compressed keys
    scores = key_comp.asymmetric_attention_scores(query, compressed_keys)
    # scores: (num_heads, seq_len)

    attn_weights = torch.softmax(scores, dim=-1)  # (num_heads, seq_len)

    # Decompress values and compute weighted sum
    values = val_comp.decompress(compressed_vals)  # (num_heads, seq_len, head_dim)
    output = torch.einsum('hs,hsd->hd', attn_weights, values)
    return output

python

import torch
from turboquant.compressors import TurboQuantCompressorV2, TurboQuantCompressorMSE

def compress_kv_cache(kv_cache, head_dim=128, bits=3, proj_dim=64):
    """
    kv_cache：每层的(keys, values)列表
              keys/values形状：(num_heads, seq_len, head_dim)
    返回每层的压缩后(keys, values)列表。
    """
    key_comp = TurboQuantCompressorV2(dim=head_dim, bits=bits, proj_dim=proj_dim)
    val_comp = TurboQuantCompressorMSE(dim=head_dim, bits=bits)

    compressed = []
    for layer_keys, layer_vals in kv_cache:
        c_keys = key_comp.compress(layer_keys)
        c_vals = val_comp.compress(layer_vals)
        compressed.append((c_keys, c_vals))

    return compressed, key_comp, val_comp


def run_attention_with_compressed_cache(query, compressed_keys, compressed_vals,
                                        key_comp, val_comp):
    """
    query：(num_heads, head_dim)
    返回：注意力输出(num_heads, head_dim)
    """
    # 从压缩键获取无偏注意力分数
    scores = key_comp.asymmetric_attention_scores(query, compressed_keys)
    # scores：(num_heads, seq_len)

    attn_weights = torch.softmax(scores, dim=-1)  # (num_heads, seq_len)

    # 解压缩值并计算加权和
    values = val_comp.decompress(compressed_vals)  # (num_heads, seq_len, head_dim)
    output = torch.einsum('hs,hsd->hd', attn_weights, values)
    return output

Pattern 2: Validate Compression Quality

模式2：验证压缩质量

python

import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd

def measure_attention_fidelity(keys, queries, bits=3, proj_dim=64):
    """
    Measure how well TurboQuant preserves attention distributions.
    keys:    (seq_len, head_dim)
    queries: (num_queries, head_dim)
    """
    dim = keys.shape[-1]
    tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=proj_dim)

    compressed = tq.compress(keys)

    cosine_sims = []
    top1_matches = []

    for q in queries:
        # True attention scores
        true_scores = (keys @ q)  # (seq_len,)
        true_attn = torch.softmax(true_scores, dim=0)

        # TurboQuant estimated scores
        est_scores = tq.inner_product(q, compressed)  # (seq_len,)
        est_attn = torch.softmax(est_scores, dim=0)

        # Cosine similarity of attention distributions
        cos_sim = F.cosine_similarity(true_attn.unsqueeze(0),
                                       est_attn.unsqueeze(0)).item()
        cosine_sims.append(cos_sim)

        # Top-1 match
        top1_matches.append(true_attn.argmax() == est_attn.argmax())

    return {
        'mean_cosine_sim': sum(cosine_sims) / len(cosine_sims),
        'top1_accuracy': sum(top1_matches) / len(top1_matches),
    }

python

import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd

def measure_attention_fidelity(keys, queries, bits=3, proj_dim=64):
    """
    衡量TurboQuant对注意力分布的保留程度。
    keys:    (seq_len, head_dim)
    queries: (num_queries, head_dim)
    """
    dim = keys.shape[-1]
    tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=proj_dim)

    compressed = tq.compress(keys)

    cosine_sims = []
    top1_matches = []

    for q in queries:
        # 真实注意力分数
        true_scores = (keys @ q)  # (seq_len,)
        true_attn = torch.softmax(true_scores, dim=0)

        # TurboQuant估计的分数
        est_scores = tq.inner_product(q, compressed)  # (seq_len,)
        est_attn = torch.softmax(est_scores, dim=0)

        # 注意力分布的余弦相似度
        cos_sim = F.cosine_similarity(true_attn.unsqueeze(0),
                                       est_attn.unsqueeze(0)).item()
        cosine_sims.append(cos_sim)

        # Top-1匹配情况
        top1_matches.append(true_attn.argmax() == est_attn.argmax())

    return {
        'mean_cosine_sim': sum(cosine_sims) / len(cosine_sims),
        'top1_accuracy': sum(top1_matches) / len(top1_matches),
    }

Example usage

使用示例

keys = torch.randn(2048, 128) keys = F.normalize(keys, dim=-1) queries = torch.randn(100, 128) queries = F.normalize(queries, dim=-1)

results = measure_attention_fidelity(keys, queries, bits=3) print(f"Cosine similarity: {results['mean_cosine_sim']:.4f}") print(f"Top-1 accuracy: {results['top1_accuracy']:.2%}")

undefined

keys = torch.randn(2048, 128) keys = F.normalize(keys, dim=-1) queries = torch.randn(100, 128) queries = F.normalize(queries, dim=-1)

results = measure_attention_fidelity(keys, queries, bits=3) print(f"余弦相似度: {results['mean_cosine_sim']:.4f}") print(f"Top-1准确率: {results['top1_accuracy']:.2%}")

undefined

Pattern 3: Needle-in-Haystack Retrieval Test

模式3：大海捞针检索测试

python

import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd

def needle_in_haystack(seq_len=2048, dim=128, bits=3):
    """Test whether TurboQuant preserves nearest-neighbor ordering."""
    tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=64)

    # Build haystack of random unit vectors
    haystack = F.normalize(torch.randn(seq_len, dim), dim=-1)

    # Insert needle at random position
    needle_idx = torch.randint(0, seq_len, (1,)).item()
    query = F.normalize(torch.randn(dim), dim=0)
    needle = query + 0.1 * torch.randn(dim)  # Similar to query
    needle = F.normalize(needle, dim=0)
    haystack[needle_idx] = needle

    # Compress
    compressed = tq.compress(haystack)

    # True nearest neighbor
    true_scores = haystack @ query
    true_best = true_scores.argmax().item()

    # TurboQuant estimated nearest neighbor
    est_scores = tq.inner_product(query, compressed)
    est_best = est_scores.argmax().item()

    return true_best == est_best, true_best, est_best

python

import torch
import torch.nn.functional as F
from turboquant.turboquant import TurboQuantProd

def needle_in_haystack(seq_len=2048, dim=128, bits=3):
    """测试TurboQuant是否保留最近邻排序。"""
    tq = TurboQuantProd(dim=dim, bits=bits, proj_dim=64)

    # 构建随机单位向量的"草堆"
    haystack = F.normalize(torch.randn(seq_len, dim), dim=-1)

    # 在随机位置插入"针"
    needle_idx = torch.randint(0, seq_len, (1,)).item()
    query = F.normalize(torch.randn(dim), dim=0)
    needle = query + 0.1 * torch.randn(dim)  # 与查询向量相似
    needle = F.normalize(needle, dim=0)
    haystack[needle_idx] = needle

    # 压缩
    compressed = tq.compress(haystack)

    # 真实最近邻
    true_scores = haystack @ query
    true_best = true_scores.argmax().item()

    # TurboQuant估计的最近邻
    est_scores = tq.inner_product(query, compressed)
    est_best = est_scores.argmax().item()

    return true_best == est_best, true_best, est_best

Run multiple trials

多次运行测试

successes = sum(needle_in_haystack(seq_len=8192)[0] for _ in range(20)) print(f"Retrieval accuracy: {successes}/20")

undefined

successes = sum(needle_in_haystack(seq_len=8192)[0] for _ in range(20)) print(f"检索准确率: {successes}/20")

undefined

Pattern 4: Compute Memory Savings

模式4：计算内存节省

python

def estimate_memory_savings(num_layers, num_kv_heads, seq_len, head_dim,
                             bits, proj_dim=64):
    """
    Estimate compressed KV cache size vs FP16 baseline.
    """
    # FP16 baseline: 2 bytes per element
    fp16_bytes = num_layers * 2 * num_kv_heads * seq_len * head_dim * 2

    # Stage 1 codes: bits per element, packed into bytes
    codes_bytes = (num_layers * 2 * num_kv_heads * seq_len * head_dim * bits) // 8

    # Stage 2 signs (keys only): 1 bit per proj_dim element
    signs_bytes = (num_layers * num_kv_heads * seq_len * proj_dim) // 8

    # Residual norms: 1 float16 per vector (keys only)
    norms_bytes = num_layers * num_kv_heads * seq_len * 2

    total_compressed = codes_bytes + signs_bytes + norms_bytes
    ratio = fp16_bytes / total_compressed

    print(f"FP16 baseline:      {fp16_bytes / 1e6:.1f} MB")
    print(f"TurboQuant {bits}-bit:  {total_compressed / 1e6:.1f} MB")
    print(f"Compression ratio:  {ratio:.1f}x")
    return ratio

python

def estimate_memory_savings(num_layers, num_kv_heads, seq_len, head_dim,
                             bits, proj_dim=64):
    """
    估算压缩后KV缓存与FP16基线的大小对比。
    """
    # FP16基线：每个元素2字节
    fp16_bytes = num_layers * 2 * num_kv_heads * seq_len * head_dim * 2

    # 第一阶段编码：每个元素bits比特，打包为字节
    codes_bytes = (num_layers * 2 * num_kv_heads * seq_len * head_dim * bits) // 8

    # 第二阶段符号（仅键）：每个proj_dim元素1比特
    signs_bytes = (num_layers * num_kv_heads * seq_len * proj_dim) // 8

    # 残差范数：每个向量1个float16（仅键）
    norms_bytes = num_layers * num_kv_heads * seq_len * 2

    total_compressed = codes_bytes + signs_bytes + norms_bytes
    ratio = fp16_bytes / total_compressed

    print(f"FP16基线:      {fp16_bytes / 1e6:.1f} MB")
    print(f"TurboQuant {bits}-bit:  {total_compressed / 1e6:.1f} MB")
    print(f"压缩比:  {ratio:.1f}x")
    return ratio

Qwen2.5-3B: 36 layers, 2 KV heads, head_dim=128

Qwen2.5-3B：36层，2个KV头，head_dim=128

estimate_memory_savings( num_layers=36, num_kv_heads=2, seq_len=8192, head_dim=128, bits=3 )

FP16 baseline: 289.4 MB

FP16基线: 289.4 MB

TurboQuant 3-bit: 57.9 MB

Compression ratio: 5.0x

压缩比: 5.0x

undefined

undefined

Algorithm Details

算法细节

Why Random Rotation?

为什么使用随机旋转？

Rotating by a random orthogonal matrix

maps unit vectors to a space where each coordinate follows

N(0, 1/d)

. This makes coordinates nearly independent with known distribution — enabling optimal per-coordinate scalar quantization (Lloyd-Max).

通过随机正交矩阵

进行旋转，将单位向量映射到每个坐标服从

N(0, 1/d)

分布的空间。这使得坐标几乎独立且分布已知——从而实现最优的逐坐标标量量化（Lloyd-Max）。

Why QJL for Keys but Not Values?

为什么键用QJL而值不用？

Keys: Used in dot products with queries. Bias in inner product estimates directly corrupts attention weights. QJL correction is essential.
Values: Used in weighted sums after softmax. Small per-vector MSE errors average out. Stage 1 MSE quantization is sufficient.

键：用于与查询向量进行点积。内积估计中的偏差会直接破坏注意力权重，QJL校正至关重要。
值：用于softmax后的加权求和。单个向量的微小MSE误差会被平均抵消，第一阶段的MSE量化已足够。

Choosing

proj_dim

(QJL projection dimension)

选择

proj_dim

（QJL投影维度）

Higher

proj_dim

→ lower variance in inner product estimates, but more memory:

python

undefined

proj_dim

越高→内积估计的方差越低，但占用内存更多：

python

undefined

Rule of thumb: proj_dim = head_dim // 2 is a good default

经验法则：proj_dim = head_dim // 2是不错的默认值

head_dim=128 → proj_dim=64

head_dim=64 → proj_dim=32

head_dim=256 → proj_dim=128

undefined

undefined

Bit-width Selection Guide

比特宽度选择指南

Bits	Compression	Cosine Sim	Top-1 Match	Use Case
4	3.8x	0.999	87%	Quality-critical tasks
3	5.0x	0.995	82%	Recommended default
2	7.3x	0.988	66%	Extreme memory pressure

比特数	压缩比	余弦相似度	Top-1匹配率	使用场景
4	3.8x	0.999	87%	对质量要求极高的任务
3	5.0x	0.995	82%	推荐默认选项
2	7.3x	0.988	66%	内存极度紧张的场景

Troubleshooting

故障排除

scipy
import error when building codebooks:

bash

pip install scipy

CUDA out of memory during
validate.py
:

Requires ≥6GB VRAM for Qwen2.5-3B in 4-bit
Reduce
```
seq_len
```
in the validation script or use a smaller model

Inner product estimates have high variance:

Increase
```
proj_dim
```
(try
```
head_dim
```
instead of
```
head_dim // 2
```
)
Check that input vectors are normalized before compressing

Codebook build is slow on first run:

Lloyd-Max uses numerical integration (scipy) — this is expected
Codebooks are precomputed once per
```
(dim, bits)
```
combination; cache them:

python

import pickle

构建码本时出现
scipy
导入错误：

bash

pip install scipy

运行
validate.py
时CUDA内存不足：

4比特精度下Qwen2.5-3B需要≥6GB显存
减小验证脚本中的
```
seq_len
```
或使用更小的模型

内积估计方差过高：

增大
```
proj_dim
```
（尝试使用
```
head_dim
```
而非
```
head_dim // 2
```
）
检查输入向量在压缩前是否已归一化

首次运行码本构建速度慢：

Lloyd-Max使用数值积分（scipy）——这是正常现象
码本会针对每个
```
(dim, bits)
```
组合预计算一次；可缓存码本：

python

import pickle

Save codebook

保存码本

boundaries, centroids = build_lloyd_max_codebook(dim=128, bits=3) with open('codebook_128_3bit.pkl', 'wb') as f: pickle.dump((boundaries, centroids), f)

Load cached codebook

加载缓存的码本

with open('codebook_128_3bit.pkl', 'rb') as f: boundaries, centroids = pickle.load(f)


**Attention fidelity lower than expected:**
- Ensure vectors are L2-normalized before compressing (`F.normalize(x, dim=-1)`)
- The compressors in `compressors.py` handle normalization internally; `TurboQuantProd` expects unit vectors

with open('codebook_128_3bit.pkl', 'rb') as f: boundaries, centroids = pickle.load(f)


**注意力保真度低于预期：**
- 确保向量在压缩前已进行L2归一化（`F.normalize(x, dim=-1)`）
- `compressors.py`中的压缩器会自动处理归一化；`TurboQuantProd`要求输入为单位向量

References

参考资料

TurboQuant paper — ICLR 2026
QJL paper — 1-bit residual correction technique
PolarQuant — Related polar coordinate approach

TurboQuant论文 — ICLR 2026
QJL论文 — 1比特残差校正技术
PolarQuant — 相关的极坐标方法 ",