Speculative Decoding: Accelerating LLM Inference

推测解码：加速大语言模型（LLM）推理

When to Use This Skill

何时使用该技术

Use Speculative Decoding when you need to:

Speed up inference by 1.5-3.6× without quality loss
Reduce latency for real-time applications (chatbots, code generation)
Optimize throughput for high-volume serving
Deploy efficiently on limited hardware
Generate faster without changing model architecture

Key Techniques: Draft model speculative decoding, Medusa (multiple heads), Lookahead Decoding (Jacobi iteration)

Papers: Medusa (arXiv 2401.10774), Lookahead Decoding (ICML 2024), Speculative Decoding Survey (ACL 2024)

在以下场景中使用推测解码：

无质量损失加速推理：实现1.5-3.6倍的推理速度提升
降低实时应用延迟：适用于聊天机器人、代码生成等实时应用
优化高吞吐量服务：提升大流量场景下的服务能力
有限硬件高效部署：在资源受限的硬件上高效部署模型
无需修改模型架构即可提速：在不改变模型结构的前提下加快生成速度

核心技术：草稿模型推测解码、Medusa（多头解码）、前瞻解码（雅可比迭代）

参考论文：Medusa（arXiv 2401.10774）、前瞻解码（ICML 2024）、推测解码综述（ACL 2024）

Installation

安装步骤

bash

undefined

bash

undefined

Standard speculative decoding (transformers)

pip install transformers accelerate

Medusa (multiple decoding heads)

git clone https://github.com/FasterDecoding/Medusa cd Medusa pip install -e .

Lookahead Decoding

git clone https://github.com/hao-ai-lab/LookaheadDecoding cd LookaheadDecoding pip install -e .

Optional: vLLM with speculative decoding

pip install vllm

undefined

pip install vllm

undefined

Quick Start

快速开始

Basic Speculative Decoding (Draft Model)

基础推测解码（草稿模型）

python

from transformers import AutoModelForCausalLM, AutoTokenizer

python

from transformers import AutoModelForCausalLM, AutoTokenizer

Load target model (large, slow)

target_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", device_map="auto", torch_dtype=torch.float16 )

Load draft model (small, fast)

draft_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16 )

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

draft_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16 )

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

Generate with speculative decoding

prompt = "Explain quantum computing in simple terms:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

Transformers 4.36+ supports assisted generation

outputs = target_model.generate( **inputs, assistant_model=draft_model, # Enable speculative decoding max_new_tokens=256, do_sample=True, temperature=0.7, )

response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response)

undefined

outputs = target_model.generate( **inputs, assistant_model=draft_model, # Enable speculative decoding max_new_tokens=256, do_sample=True, temperature=0.7, )

response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response)

undefined

Medusa (Multiple Decoding Heads)

Medusa（多头解码）

python

from medusa.model.medusa_model import MedusaModel

python

from medusa.model.medusa_model import MedusaModel

Load Medusa-enhanced model

model = MedusaModel.from_pretrained( "FasterDecoding/medusa-vicuna-7b-v1.3", # Pre-trained with Medusa heads torch_dtype=torch.float16, device_map="auto" )

tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")

model = MedusaModel.from_pretrained( "FasterDecoding/medusa-vicuna-7b-v1.3", # Pre-trained with Medusa heads torch_dtype=torch.float16, device_map="auto" )

tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")

Generate with Medusa (2-3× speedup)

prompt = "Write a Python function to calculate fibonacci numbers:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.medusa_generate( **inputs, max_new_tokens=256, temperature=0.7, posterior_threshold=0.09, # Acceptance threshold posterior_alpha=0.3, # Tree construction parameter )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

undefined

prompt = "Write a Python function to calculate fibonacci numbers:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.medusa_generate( **inputs, max_new_tokens=256, temperature=0.7, posterior_threshold=0.09, # Acceptance threshold posterior_alpha=0.3, # Tree construction parameter )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

undefined

Lookahead Decoding (Jacobi Iteration)

前瞻解码（雅可比迭代）

python

from lookahead.lookahead_decoding import LookaheadDecoding

python

from lookahead.lookahead_decoding import LookaheadDecoding

Load model

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

Initialize lookahead decoding

lookahead = LookaheadDecoding( model=model, tokenizer=tokenizer, window_size=15, # Lookahead window (W) ngram_size=5, # N-gram size (N) guess_size=5 # Number of parallel guesses )

Generate (1.5-2.3× speedup)

prompt = "Implement quicksort in Python:" output = lookahead.generate(prompt, max_new_tokens=256) print(output)

undefined

prompt = "Implement quicksort in Python:" output = lookahead.generate(prompt, max_new_tokens=256) print(output)

undefined

Core Concepts

核心概念

1. Speculative Decoding (Draft Model)

1. 推测解码（草稿模型）

Idea: Use small draft model to generate candidates, large target model to verify in parallel.

Algorithm:

Draft model generates K tokens speculatively
Target model evaluates all K tokens in parallel (single forward pass)
Accept tokens where draft and target agree
Reject first disagreement, continue from there

python

def speculative_decode(target_model, draft_model, prompt, K=4):
    """Speculative decoding algorithm."""
    # 1. Generate K draft tokens
    draft_tokens = draft_model.generate(prompt, max_new_tokens=K)

    # 2. Target model evaluates all K tokens in one forward pass
    target_logits = target_model(draft_tokens)  # Parallel!

    # 3. Accept/reject based on probability match
    accepted = []
    for i in range(K):
        p_draft = softmax(draft_model.logits[i])
        p_target = softmax(target_logits[i])

        # Acceptance probability
        if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
            accepted.append(draft_tokens[i])
        else:
            break  # Reject, resample from target

    return accepted

Performance:

Speedup: 1.5-2× with good draft model
Zero quality loss (mathematically equivalent to target model)
Best when draft model is 5-10× smaller than target

核心思路：使用小型草稿模型生成候选令牌，大型目标模型并行验证这些候选令牌。

算法流程：

草稿模型推测生成K个令牌
目标模型在单次前向传播中并行评估所有K个令牌
接受草稿模型与目标模型预测一致的令牌
遇到第一个不一致的令牌时拒绝，从该位置继续生成

python

def speculative_decode(target_model, draft_model, prompt, K=4):
    """Speculative decoding algorithm."""
    # 1. Generate K draft tokens
    draft_tokens = draft_model.generate(prompt, max_new_tokens=K)

    # 2. Target model evaluates all K tokens in one forward pass
    target_logits = target_model(draft_tokens)  # Parallel!

    # 3. Accept/reject based on probability match
    accepted = []
    for i in range(K):
        p_draft = softmax(draft_model.logits[i])
        p_target = softmax(target_logits[i])

        # Acceptance probability
        if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
            accepted.append(draft_tokens[i])
        else:
            break  # Reject, resample from target

    return accepted

性能表现：

加速比：使用优质草稿模型可实现1.5-2倍加速
无质量损失：理论上与目标模型生成结果完全一致
最佳搭配：草稿模型大小为目标模型的1/5到1/10

2. Medusa (Multiple Decoding Heads)

2. Medusa（多头解码）

Source: arXiv 2401.10774 (2024)

Innovation: Add multiple prediction heads to existing model, predict future tokens without separate draft model.

Architecture:

Input → Base LLM (frozen) → Hidden State
                                ├→ Head 1 (predicts token t+1)
                                ├→ Head 2 (predicts token t+2)
                                ├→ Head 3 (predicts token t+3)
                                └→ Head 4 (predicts token t+4)

Training:

Medusa-1: Freeze base LLM, train only heads
- 2.2× speedup, lossless
Medusa-2: Fine-tune base LLM + heads together
- 2.3-3.6× speedup, better quality

Tree-based Attention:

python

undefined

来源：arXiv 2401.10774（2024）

创新点：在现有模型中添加多个预测头，无需单独的草稿模型即可预测未来令牌。

架构：

输入 → 基础大语言模型（冻结） → 隐藏状态
                                ├→ 头1（预测令牌t+1）
                                ├→ 头2（预测令牌t+2）
                                ├→ 头3（预测令牌t+3）
                                └→ 头4（预测令牌t+4）

训练方式：

Medusa-1：冻结基础大语言模型，仅训练新增的预测头
- 加速比：2.2倍，无质量损失
Medusa-2：同时微调基础大语言模型与预测头
- 加速比：2.3-3.6倍，生成质量更优

基于树的注意力机制：

python

undefined

Medusa constructs tree of candidates

Example: Predict 2 steps ahead with top-2 per step

Root

/ \

T1a T1b (Step 1: 2 candidates)

/ \ / \

T2a T2b T2c T2d (Step 2: 4 candidates total)

Single forward pass evaluates entire tree!


**Advantages**:
- No separate draft model needed
- Minimal training (only heads)
- Compatible with any LLM


**优势**：
- 无需单独的草稿模型
- 训练成本极低（仅需训练预测头）
- 兼容所有大语言模型

3. Lookahead Decoding (Jacobi Iteration)

3. 前瞻解码（雅可比迭代）

Source: ICML 2024

Core idea: Reformulate autoregressive decoding as solving system of equations, solve in parallel using Jacobi iteration.

Mathematical formulation:

Traditional:  y_t = f(x, y_1, ..., y_{t-1})  (sequential)
Jacobi:       y_t^{(k+1)} = f(x, y_1^{(k)}, ..., y_{t-1}^{(k)})  (parallel)

Two branches:

Lookahead Branch: Generate n-grams in parallel
- Window size W: How many steps to look ahead
- N-gram size N: How many past tokens to use
Verification Branch: Verify promising n-grams
- Match n-grams with generated tokens
- Accept if first token matches

python

class LookaheadDecoding:
    def __init__(self, model, window_size=15, ngram_size=5):
        self.model = model
        self.W = window_size  # Lookahead window
        self.N = ngram_size   # N-gram size

    def generate_step(self, tokens):
        # Lookahead branch: Generate W × N candidates
        candidates = {}
        for w in range(1, self.W + 1):
            for n in range(1, self.N + 1):
                # Generate n-gram starting at position w
                ngram = self.generate_ngram(tokens, start=w, length=n)
                candidates[(w, n)] = ngram

        # Verification branch: Find matching n-grams
        verified = []
        for ngram in candidates.values():
            if ngram[0] == tokens[-1]:  # First token matches last input
                if self.verify(tokens, ngram):
                    verified.append(ngram)

        # Accept longest verified n-gram
        return max(verified, key=len) if verified else [self.model.generate_next(tokens)]

Performance:

Speedup: 1.5-2.3× (up to 3.6× for code generation)
No draft model or training needed
Works out-of-the-box with any model

来源：ICML 2024

核心思路：将自回归解码重构为求解方程组问题，使用雅可比迭代并行求解。

数学公式：

传统方式:  y_t = f(x, y_1, ..., y_{t-1})  (顺序执行)
雅可比方式:       y_t^{(k+1)} = f(x, y_1^{(k)}, ..., y_{t-1}^{(k)})  (并行执行)

两个分支：

前瞻分支：并行生成n元语法
- 窗口大小W：前瞻的步数
- n元语法大小N：生成时使用的历史令牌数量
验证分支：验证有前景的n元语法
- 将n元语法与已生成令牌匹配
- 若第一个令牌匹配则接受

python

class LookaheadDecoding:
    def __init__(self, model, window_size=15, ngram_size=5):
        self.model = model
        self.W = window_size  # Lookahead window
        self.N = ngram_size   # N-gram size

    def generate_step(self, tokens):
        # Lookahead branch: Generate W × N candidates
        candidates = {}
        for w in range(1, self.W + 1):
            for n in range(1, self.N + 1):
                # Generate n-gram starting at position w
                ngram = self.generate_ngram(tokens, start=w, length=n)
                candidates[(w, n)] = ngram

        # Verification branch: Find matching n-grams
        verified = []
        for ngram in candidates.values():
            if ngram[0] == tokens[-1]:  # First token matches last input
                if self.verify(tokens, ngram):
                    verified.append(ngram)

        # Accept longest verified n-gram
        return max(verified, key=len) if verified else [self.model.generate_next(tokens)]

性能表现：

加速比：1.5-2.3倍（代码生成场景下最高可达3.6倍）
无需草稿模型或额外训练
开箱即用，兼容所有模型

Method Comparison

方法对比

Method	Speedup	Training Needed	Draft Model	Quality Loss
Draft Model Speculative	1.5-2×	No	Yes (external)	None
Medusa	2-3.6×	Minimal (heads only)	No (built-in heads)	None
Lookahead	1.5-2.3×	None	No	None
Naive Batching	1.2-1.5×	No	No	None

方法	加速比	是否需要训练	是否需要草稿模型	是否有质量损失
草稿模型推测解码	1.5-2×	否	是（外部模型）	无
Medusa多头解码	2-3.6×	极少（仅训练头部）	否（内置多头）	无
前瞻解码	1.5-2.3×	否	否	无
朴素批处理	1.2-1.5×	否	否	无

Advanced Patterns

高级应用模式

Training Medusa Heads

训练Medusa预测头

python

from medusa.model.medusa_model import MedusaModel
from medusa.model.kv_cache import initialize_past_key_values
import torch.nn as nn

python

from medusa.model.medusa_model import MedusaModel
from medusa.model.kv_cache import initialize_past_key_values
import torch.nn as nn

1. Load base model

base_model = AutoModelForCausalLM.from_pretrained( "lmsys/vicuna-7b-v1.3", torch_dtype=torch.float16 )

2. Add Medusa heads

num_heads = 4 medusa_heads = nn.ModuleList([ nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size, bias=False) for _ in range(num_heads) ])

3. Training loop (freeze base model for Medusa-1)

for param in base_model.parameters(): param.requires_grad = False # Freeze base

optimizer = torch.optim.Adam(medusa_heads.parameters(), lr=1e-3)

for batch in dataloader: # Forward pass hidden_states = base_model(**batch, output_hidden_states=True).hidden_states[-1]

# Predict future tokens with each head
loss = 0
for i, head in enumerate(medusa_heads):
    logits = head(hidden_states)
    # Target: tokens shifted by (i+1) positions
    target = batch['input_ids'][:, i+1:]
    loss += F.cross_entropy(logits[:, :-i-1], target)

# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()

undefined

for param in base_model.parameters(): param.requires_grad = False # Freeze base

optimizer = torch.optim.Adam(medusa_heads.parameters(), lr=1e-3)

for batch in dataloader: # Forward pass hidden_states = base_model(**batch, output_hidden_states=True).hidden_states[-1]

# Predict future tokens with each head
loss = 0
for i, head in enumerate(medusa_heads):
    logits = head(hidden_states)
    # Target: tokens shifted by (i+1) positions
    target = batch['input_ids'][:, i+1:]
    loss += F.cross_entropy(logits[:, :-i-1], target)

# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()

undefined

Hybrid: Speculative + Medusa

混合模式：推测解码 + Medusa

python

undefined

python

undefined

Use Medusa as draft model for speculative decoding

draft_medusa = MedusaModel.from_pretrained("medusa-vicuna-7b") target_model = AutoModelForCausalLM.from_pretrained("vicuna-33b")

Draft generates multiple candidates with Medusa

draft_tokens = draft_medusa.medusa_generate(prompt, max_new_tokens=5)

Target verifies in single forward pass

outputs = target_model.generate( prompt, assistant_model=draft_medusa, # Use Medusa as draft max_new_tokens=256 )

Combines benefits: Medusa speed + large model quality

undefined

undefined

Optimal Draft Model Selection

最优草稿模型选择

python

def select_draft_model(target_model_size, target):
    """Select optimal draft model for speculative decoding."""
    # Rule: Draft should be 5-10× smaller
    if target_model_size == "70B":
        return "7B"  # 10× smaller
    elif target_model_size == "33B":
        return "7B"  # 5× smaller
    elif target_model_size == "13B":
        return "1B"  # 13× smaller
    else:
        return None  # Target too small, use Medusa/Lookahead instead

python

def select_draft_model(target_model_size, target):
    """Select optimal draft model for speculative decoding."""
    # Rule: Draft should be 5-10× smaller
    if target_model_size == "70B":
        return "7B"  # 10× smaller
    elif target_model_size == "33B":
        return "7B"  # 5× smaller
    elif target_model_size == "13B":
        return "1B"  # 13× smaller
    else:
        return None  # Target too small, use Medusa/Lookahead instead

Example

draft = select_draft_model("70B", target_model)

Returns "7B" → Use Llama-2-7b as draft for Llama-2-70b

undefined

undefined

Best Practices

最佳实践

1. Choose the Right Method

1. 选择合适的方法

python

undefined

python

undefined

New deployment → Medusa (best overall speedup, no draft model)

if deploying_new_model: use_method = "Medusa"

Existing deployment with small model available → Draft speculative

elif have_small_version_of_model: use_method = "Draft Model Speculative"

Want zero training/setup → Lookahead

elif want_plug_and_play: use_method = "Lookahead Decoding"

undefined

elif want_plug_and_play: use_method = "Lookahead Decoding"

undefined

2. Hyperparameter Tuning

2. 超参数调优

Draft Model Speculative:

python

undefined

草稿模型推测解码：

python

undefined

K = number of speculative tokens

K = 4 # Good default K = 2 # Conservative (higher acceptance) K = 8 # Aggressive (lower acceptance, but more when accepted)

Rule: Larger K → more speedup IF draft model is good


**Medusa**:
```python


**Medusa多头解码**：
```python

Posterior threshold (acceptance confidence)

posterior_threshold = 0.09 # Standard (from paper) posterior_threshold = 0.05 # More conservative (slower, higher quality) posterior_threshold = 0.15 # More aggressive (faster, may degrade quality)

Tree depth (how many steps ahead)

medusa_choices = [[0], [0, 0], [0, 1], [0, 0, 0]] # Depth 3 (standard)


**Lookahead**:
```python

medusa_choices = [[0], [0, 0], [0, 1], [0, 0, 0]] # Depth 3 (standard)


**前瞻解码**：
```python

Window size W (lookahead distance)

N-gram size N (context for generation)

7B model (more resources)

W, N = 15, 5

13B model (moderate)

W, N = 10, 5

33B+ model (limited resources)

W, N = 7, 5

undefined

W, N = 7, 5

undefined

3. Production Deployment

3. 生产部署

python

undefined

python

undefined

vLLM with speculative decoding

from vllm import LLM, SamplingParams

Initialize with draft model

llm = LLM( model="meta-llama/Llama-2-70b-hf", speculative_model="meta-llama/Llama-2-7b-hf", # Draft model num_speculative_tokens=5, use_v2_block_manager=True, )

Generate

prompts = ["Tell me about AI:", "Explain quantum physics:"] sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)

undefined

prompts = ["Tell me about AI:", "Explain quantum physics:"] sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)

undefined

Resources

参考资源

Medusa Paper: https://arxiv.org/abs/2401.10774
Medusa GitHub: https://github.com/FasterDecoding/Medusa
Lookahead Decoding (ICML 2024): https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Lookahead GitHub: https://github.com/hao-ai-lab/LookaheadDecoding
Speculative Decoding Survey (ACL 2024): https://aclanthology.org/2024.findings-acl.456.pdf
Comprehensive Survey: https://arxiv.org/abs/2401.07851

Medusa论文：https://arxiv.org/abs/2401.10774
Medusa项目GitHub：https://github.com/FasterDecoding/Medusa
前瞻解码（ICML 2024）：https://lmsys.org/blog/2023-11-21-lookahead-decoding/
前瞻解码项目GitHub：https://github.com/hao-ai-lab/LookaheadDecoding
推测解码综述（ACL 2024）：https://aclanthology.org/2024.findings-acl.456.pdf
综合综述：https://arxiv.org/abs/2401.07851

扩展阅读

```
references/draft_model.md
```
- Draft model selection and training
```
references/medusa.md
```
- Medusa architecture and training
```
references/lookahead.md
```
- Lookahead decoding implementation details

```
references/draft_model.md
```
- 草稿模型的选择与训练
```
references/medusa.md
```
- Medusa架构与训练细节
```
references/lookahead.md
```
- 前瞻解码的实现细节

speculative-decoding

Original

Translation

Speculative Decoding: Accelerating LLM Inference

推测解码：加速大语言模型（LLM）推理

When to Use This Skill

何时使用该技术

Installation

安装步骤

Standard speculative decoding (transformers)

Standard speculative decoding (transformers)

Medusa (multiple decoding heads)

Medusa (multiple decoding heads)

Lookahead Decoding

Lookahead Decoding

Optional: vLLM with speculative decoding

Optional: vLLM with speculative decoding

Quick Start

快速开始

Basic Speculative Decoding (Draft Model)

基础推测解码（草稿模型）

Load target model (large, slow)

Load target model (large, slow)

Load draft model (small, fast)

Load draft model (small, fast)

Generate with speculative decoding

Generate with speculative decoding

Transformers 4.36+ supports assisted generation

Transformers 4.36+ supports assisted generation

Medusa (Multiple Decoding Heads)

Medusa（多头解码）

Load Medusa-enhanced model

Load Medusa-enhanced model

Generate with Medusa (2-3× speedup)

Generate with Medusa (2-3× speedup)

Lookahead Decoding (Jacobi Iteration)

前瞻解码（雅可比迭代）

Load model

Load model

Initialize lookahead decoding

Initialize lookahead decoding

Generate (1.5-2.3× speedup)

Generate (1.5-2.3× speedup)

Core Concepts

核心概念

1. Speculative Decoding (Draft Model)

1. 推测解码（草稿模型）

2. Medusa (Multiple Decoding Heads)

2. Medusa（多头解码）

Medusa constructs tree of candidates

Medusa constructs tree of candidates

Example: Predict 2 steps ahead with top-2 per step

Example: Predict 2 steps ahead with top-2 per step

Root

Root

/ \

/ \

T1a T1b (Step 1: 2 candidates)

T1a T1b (Step 1: 2 candidates)

/ \ / \

/ \ / \

T2a T2b T2c T2d (Step 2: 4 candidates total)

T2a T2b T2c T2d (Step 2: 4 candidates total)

Single forward pass evaluates entire tree!

Single forward pass evaluates entire tree!

3. Lookahead Decoding (Jacobi Iteration)

3. 前瞻解码（雅可比迭代）

Method Comparison

方法对比

Advanced Patterns

高级应用模式

Training Medusa Heads

训练Medusa预测头

1. Load base model

1. Load base model

2. Add Medusa heads

2. Add Medusa heads

3. Training loop (freeze base model for Medusa-1)

3. Training loop (freeze base model for Medusa-1)

Hybrid: Speculative + Medusa