speculative-decoding

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Speculative Decoding: Accelerating LLM Inference

推测解码:加速大语言模型(LLM)推理

When to Use This Skill

何时使用该技术

Use Speculative Decoding when you need to:
  • Speed up inference by 1.5-3.6× without quality loss
  • Reduce latency for real-time applications (chatbots, code generation)
  • Optimize throughput for high-volume serving
  • Deploy efficiently on limited hardware
  • Generate faster without changing model architecture
Key Techniques: Draft model speculative decoding, Medusa (multiple heads), Lookahead Decoding (Jacobi iteration)
Papers: Medusa (arXiv 2401.10774), Lookahead Decoding (ICML 2024), Speculative Decoding Survey (ACL 2024)
在以下场景中使用推测解码:
  • 无质量损失加速推理:实现1.5-3.6倍的推理速度提升
  • 降低实时应用延迟:适用于聊天机器人、代码生成等实时应用
  • 优化高吞吐量服务:提升大流量场景下的服务能力
  • 有限硬件高效部署:在资源受限的硬件上高效部署模型
  • 无需修改模型架构即可提速:在不改变模型结构的前提下加快生成速度
核心技术:草稿模型推测解码、Medusa(多头解码)、前瞻解码(雅可比迭代)
参考论文:Medusa(arXiv 2401.10774)、前瞻解码(ICML 2024)、推测解码综述(ACL 2024)

Installation

安装步骤

bash
undefined
bash
undefined

Standard speculative decoding (transformers)

Standard speculative decoding (transformers)

pip install transformers accelerate
pip install transformers accelerate

Medusa (multiple decoding heads)

Medusa (multiple decoding heads)

git clone https://github.com/FasterDecoding/Medusa cd Medusa pip install -e .
git clone https://github.com/FasterDecoding/Medusa cd Medusa pip install -e .

Lookahead Decoding

Lookahead Decoding

git clone https://github.com/hao-ai-lab/LookaheadDecoding cd LookaheadDecoding pip install -e .
git clone https://github.com/hao-ai-lab/LookaheadDecoding cd LookaheadDecoding pip install -e .

Optional: vLLM with speculative decoding

Optional: vLLM with speculative decoding

pip install vllm
undefined
pip install vllm
undefined

Quick Start

快速开始

Basic Speculative Decoding (Draft Model)

基础推测解码(草稿模型)

python
from transformers import AutoModelForCausalLM, AutoTokenizer
python
from transformers import AutoModelForCausalLM, AutoTokenizer

Load target model (large, slow)

Load target model (large, slow)

target_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", device_map="auto", torch_dtype=torch.float16 )
target_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", device_map="auto", torch_dtype=torch.float16 )

Load draft model (small, fast)

Load draft model (small, fast)

draft_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16 )
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")
draft_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype=torch.float16 )
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf")

Generate with speculative decoding

Generate with speculative decoding

prompt = "Explain quantum computing in simple terms:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
prompt = "Explain quantum computing in simple terms:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

Transformers 4.36+ supports assisted generation

Transformers 4.36+ supports assisted generation

outputs = target_model.generate( **inputs, assistant_model=draft_model, # Enable speculative decoding max_new_tokens=256, do_sample=True, temperature=0.7, )
response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response)
undefined
outputs = target_model.generate( **inputs, assistant_model=draft_model, # Enable speculative decoding max_new_tokens=256, do_sample=True, temperature=0.7, )
response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response)
undefined

Medusa (Multiple Decoding Heads)

Medusa(多头解码)

python
from medusa.model.medusa_model import MedusaModel
python
from medusa.model.medusa_model import MedusaModel

Load Medusa-enhanced model

Load Medusa-enhanced model

model = MedusaModel.from_pretrained( "FasterDecoding/medusa-vicuna-7b-v1.3", # Pre-trained with Medusa heads torch_dtype=torch.float16, device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")
model = MedusaModel.from_pretrained( "FasterDecoding/medusa-vicuna-7b-v1.3", # Pre-trained with Medusa heads torch_dtype=torch.float16, device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained("FasterDecoding/medusa-vicuna-7b-v1.3")

Generate with Medusa (2-3× speedup)

Generate with Medusa (2-3× speedup)

prompt = "Write a Python function to calculate fibonacci numbers:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.medusa_generate( **inputs, max_new_tokens=256, temperature=0.7, posterior_threshold=0.09, # Acceptance threshold posterior_alpha=0.3, # Tree construction parameter )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
undefined
prompt = "Write a Python function to calculate fibonacci numbers:" inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.medusa_generate( **inputs, max_new_tokens=256, temperature=0.7, posterior_threshold=0.09, # Acceptance threshold posterior_alpha=0.3, # Tree construction parameter )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
undefined

Lookahead Decoding (Jacobi Iteration)

前瞻解码(雅可比迭代)

python
from lookahead.lookahead_decoding import LookaheadDecoding
python
from lookahead.lookahead_decoding import LookaheadDecoding

Load model

Load model

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

Initialize lookahead decoding

Initialize lookahead decoding

lookahead = LookaheadDecoding( model=model, tokenizer=tokenizer, window_size=15, # Lookahead window (W) ngram_size=5, # N-gram size (N) guess_size=5 # Number of parallel guesses )
lookahead = LookaheadDecoding( model=model, tokenizer=tokenizer, window_size=15, # Lookahead window (W) ngram_size=5, # N-gram size (N) guess_size=5 # Number of parallel guesses )

Generate (1.5-2.3× speedup)

Generate (1.5-2.3× speedup)

prompt = "Implement quicksort in Python:" output = lookahead.generate(prompt, max_new_tokens=256) print(output)
undefined
prompt = "Implement quicksort in Python:" output = lookahead.generate(prompt, max_new_tokens=256) print(output)
undefined

Core Concepts

核心概念

1. Speculative Decoding (Draft Model)

1. 推测解码(草稿模型)

Idea: Use small draft model to generate candidates, large target model to verify in parallel.
Algorithm:
  1. Draft model generates K tokens speculatively
  2. Target model evaluates all K tokens in parallel (single forward pass)
  3. Accept tokens where draft and target agree
  4. Reject first disagreement, continue from there
python
def speculative_decode(target_model, draft_model, prompt, K=4):
    """Speculative decoding algorithm."""
    # 1. Generate K draft tokens
    draft_tokens = draft_model.generate(prompt, max_new_tokens=K)

    # 2. Target model evaluates all K tokens in one forward pass
    target_logits = target_model(draft_tokens)  # Parallel!

    # 3. Accept/reject based on probability match
    accepted = []
    for i in range(K):
        p_draft = softmax(draft_model.logits[i])
        p_target = softmax(target_logits[i])

        # Acceptance probability
        if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
            accepted.append(draft_tokens[i])
        else:
            break  # Reject, resample from target

    return accepted
Performance:
  • Speedup: 1.5-2× with good draft model
  • Zero quality loss (mathematically equivalent to target model)
  • Best when draft model is 5-10× smaller than target
核心思路:使用小型草稿模型生成候选令牌,大型目标模型并行验证这些候选令牌。
算法流程
  1. 草稿模型推测生成K个令牌
  2. 目标模型在单次前向传播中并行评估所有K个令牌
  3. 接受草稿模型与目标模型预测一致的令牌
  4. 遇到第一个不一致的令牌时拒绝,从该位置继续生成
python
def speculative_decode(target_model, draft_model, prompt, K=4):
    """Speculative decoding algorithm."""
    # 1. Generate K draft tokens
    draft_tokens = draft_model.generate(prompt, max_new_tokens=K)

    # 2. Target model evaluates all K tokens in one forward pass
    target_logits = target_model(draft_tokens)  # Parallel!

    # 3. Accept/reject based on probability match
    accepted = []
    for i in range(K):
        p_draft = softmax(draft_model.logits[i])
        p_target = softmax(target_logits[i])

        # Acceptance probability
        if random.random() < min(1, p_target[draft_tokens[i]] / p_draft[draft_tokens[i]]):
            accepted.append(draft_tokens[i])
        else:
            break  # Reject, resample from target

    return accepted
性能表现
  • 加速比:使用优质草稿模型可实现1.5-2倍加速
  • 无质量损失:理论上与目标模型生成结果完全一致
  • 最佳搭配:草稿模型大小为目标模型的1/5到1/10

2. Medusa (Multiple Decoding Heads)

2. Medusa(多头解码)

Source: arXiv 2401.10774 (2024)
Innovation: Add multiple prediction heads to existing model, predict future tokens without separate draft model.
Architecture:
Input → Base LLM (frozen) → Hidden State
                                ├→ Head 1 (predicts token t+1)
                                ├→ Head 2 (predicts token t+2)
                                ├→ Head 3 (predicts token t+3)
                                └→ Head 4 (predicts token t+4)
Training:
  • Medusa-1: Freeze base LLM, train only heads
    • 2.2× speedup, lossless
  • Medusa-2: Fine-tune base LLM + heads together
    • 2.3-3.6× speedup, better quality
Tree-based Attention:
python
undefined
来源:arXiv 2401.10774(2024)
创新点:在现有模型中添加多个预测头,无需单独的草稿模型即可预测未来令牌。
架构
输入 → 基础大语言模型(冻结) → 隐藏状态
                                ├→ 头1(预测令牌t+1)
                                ├→ 头2(预测令牌t+2)
                                ├→ 头3(预测令牌t+3)
                                └→ 头4(预测令牌t+4)
训练方式
  • Medusa-1:冻结基础大语言模型,仅训练新增的预测头
    • 加速比:2.2倍,无质量损失
  • Medusa-2:同时微调基础大语言模型与预测头
    • 加速比:2.3-3.6倍,生成质量更优
基于树的注意力机制
python
undefined

Medusa constructs tree of candidates

Medusa constructs tree of candidates

Example: Predict 2 steps ahead with top-2 per step

Example: Predict 2 steps ahead with top-2 per step

Root

Root

/ \

/ \

T1a T1b (Step 1: 2 candidates)

T1a T1b (Step 1: 2 candidates)

/ \ / \

/ \ / \

T2a T2b T2c T2d (Step 2: 4 candidates total)

T2a T2b T2c T2d (Step 2: 4 candidates total)

Single forward pass evaluates entire tree!

Single forward pass evaluates entire tree!


**Advantages**:
- No separate draft model needed
- Minimal training (only heads)
- Compatible with any LLM

**优势**:
- 无需单独的草稿模型
- 训练成本极低(仅需训练预测头)
- 兼容所有大语言模型

3. Lookahead Decoding (Jacobi Iteration)

3. 前瞻解码(雅可比迭代)

Source: ICML 2024
Core idea: Reformulate autoregressive decoding as solving system of equations, solve in parallel using Jacobi iteration.
Mathematical formulation:
Traditional:  y_t = f(x, y_1, ..., y_{t-1})  (sequential)
Jacobi:       y_t^{(k+1)} = f(x, y_1^{(k)}, ..., y_{t-1}^{(k)})  (parallel)
Two branches:
  1. Lookahead Branch: Generate n-grams in parallel
    • Window size W: How many steps to look ahead
    • N-gram size N: How many past tokens to use
  2. Verification Branch: Verify promising n-grams
    • Match n-grams with generated tokens
    • Accept if first token matches
python
class LookaheadDecoding:
    def __init__(self, model, window_size=15, ngram_size=5):
        self.model = model
        self.W = window_size  # Lookahead window
        self.N = ngram_size   # N-gram size

    def generate_step(self, tokens):
        # Lookahead branch: Generate W × N candidates
        candidates = {}
        for w in range(1, self.W + 1):
            for n in range(1, self.N + 1):
                # Generate n-gram starting at position w
                ngram = self.generate_ngram(tokens, start=w, length=n)
                candidates[(w, n)] = ngram

        # Verification branch: Find matching n-grams
        verified = []
        for ngram in candidates.values():
            if ngram[0] == tokens[-1]:  # First token matches last input
                if self.verify(tokens, ngram):
                    verified.append(ngram)

        # Accept longest verified n-gram
        return max(verified, key=len) if verified else [self.model.generate_next(tokens)]
Performance:
  • Speedup: 1.5-2.3× (up to 3.6× for code generation)
  • No draft model or training needed
  • Works out-of-the-box with any model
来源:ICML 2024
核心思路:将自回归解码重构为求解方程组问题,使用雅可比迭代并行求解。
数学公式
传统方式:  y_t = f(x, y_1, ..., y_{t-1})  (顺序执行)
雅可比方式:       y_t^{(k+1)} = f(x, y_1^{(k)}, ..., y_{t-1}^{(k)})  (并行执行)
两个分支
  1. 前瞻分支:并行生成n元语法
    • 窗口大小W:前瞻的步数
    • n元语法大小N:生成时使用的历史令牌数量
  2. 验证分支:验证有前景的n元语法
    • 将n元语法与已生成令牌匹配
    • 若第一个令牌匹配则接受
python
class LookaheadDecoding:
    def __init__(self, model, window_size=15, ngram_size=5):
        self.model = model
        self.W = window_size  # Lookahead window
        self.N = ngram_size   # N-gram size

    def generate_step(self, tokens):
        # Lookahead branch: Generate W × N candidates
        candidates = {}
        for w in range(1, self.W + 1):
            for n in range(1, self.N + 1):
                # Generate n-gram starting at position w
                ngram = self.generate_ngram(tokens, start=w, length=n)
                candidates[(w, n)] = ngram

        # Verification branch: Find matching n-grams
        verified = []
        for ngram in candidates.values():
            if ngram[0] == tokens[-1]:  # First token matches last input
                if self.verify(tokens, ngram):
                    verified.append(ngram)

        # Accept longest verified n-gram
        return max(verified, key=len) if verified else [self.model.generate_next(tokens)]
性能表现
  • 加速比:1.5-2.3倍(代码生成场景下最高可达3.6倍)
  • 无需草稿模型或额外训练
  • 开箱即用,兼容所有模型

Method Comparison

方法对比

MethodSpeedupTraining NeededDraft ModelQuality Loss
Draft Model Speculative1.5-2×NoYes (external)None
Medusa2-3.6×Minimal (heads only)No (built-in heads)None
Lookahead1.5-2.3×NoneNoNone
Naive Batching1.2-1.5×NoNoNone
方法加速比是否需要训练是否需要草稿模型是否有质量损失
草稿模型推测解码1.5-2×是(外部模型)
Medusa多头解码2-3.6×极少(仅训练头部)否(内置多头)
前瞻解码1.5-2.3×
朴素批处理1.2-1.5×

Advanced Patterns

高级应用模式

Training Medusa Heads

训练Medusa预测头

python
from medusa.model.medusa_model import MedusaModel
from medusa.model.kv_cache import initialize_past_key_values
import torch.nn as nn
python
from medusa.model.medusa_model import MedusaModel
from medusa.model.kv_cache import initialize_past_key_values
import torch.nn as nn

1. Load base model

1. Load base model

base_model = AutoModelForCausalLM.from_pretrained( "lmsys/vicuna-7b-v1.3", torch_dtype=torch.float16 )
base_model = AutoModelForCausalLM.from_pretrained( "lmsys/vicuna-7b-v1.3", torch_dtype=torch.float16 )

2. Add Medusa heads

2. Add Medusa heads

num_heads = 4 medusa_heads = nn.ModuleList([ nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size, bias=False) for _ in range(num_heads) ])
num_heads = 4 medusa_heads = nn.ModuleList([ nn.Linear(base_model.config.hidden_size, base_model.config.vocab_size, bias=False) for _ in range(num_heads) ])

3. Training loop (freeze base model for Medusa-1)

3. Training loop (freeze base model for Medusa-1)

for param in base_model.parameters(): param.requires_grad = False # Freeze base
optimizer = torch.optim.Adam(medusa_heads.parameters(), lr=1e-3)
for batch in dataloader: # Forward pass hidden_states = base_model(**batch, output_hidden_states=True).hidden_states[-1]
# Predict future tokens with each head
loss = 0
for i, head in enumerate(medusa_heads):
    logits = head(hidden_states)
    # Target: tokens shifted by (i+1) positions
    target = batch['input_ids'][:, i+1:]
    loss += F.cross_entropy(logits[:, :-i-1], target)

# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
undefined
for param in base_model.parameters(): param.requires_grad = False # Freeze base
optimizer = torch.optim.Adam(medusa_heads.parameters(), lr=1e-3)
for batch in dataloader: # Forward pass hidden_states = base_model(**batch, output_hidden_states=True).hidden_states[-1]
# Predict future tokens with each head
loss = 0
for i, head in enumerate(medusa_heads):
    logits = head(hidden_states)
    # Target: tokens shifted by (i+1) positions
    target = batch['input_ids'][:, i+1:]
    loss += F.cross_entropy(logits[:, :-i-1], target)

# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
undefined

Hybrid: Speculative + Medusa

混合模式:推测解码 + Medusa

python
undefined
python
undefined

Use Medusa as draft model for speculative decoding

Use Medusa as draft model for speculative decoding

draft_medusa = MedusaModel.from_pretrained("medusa-vicuna-7b") target_model = AutoModelForCausalLM.from_pretrained("vicuna-33b")
draft_medusa = MedusaModel.from_pretrained("medusa-vicuna-7b") target_model = AutoModelForCausalLM.from_pretrained("vicuna-33b")

Draft generates multiple candidates with Medusa

Draft generates multiple candidates with Medusa

draft_tokens = draft_medusa.medusa_generate(prompt, max_new_tokens=5)
draft_tokens = draft_medusa.medusa_generate(prompt, max_new_tokens=5)

Target verifies in single forward pass

Target verifies in single forward pass

outputs = target_model.generate( prompt, assistant_model=draft_medusa, # Use Medusa as draft max_new_tokens=256 )
outputs = target_model.generate( prompt, assistant_model=draft_medusa, # Use Medusa as draft max_new_tokens=256 )

Combines benefits: Medusa speed + large model quality

Combines benefits: Medusa speed + large model quality

undefined
undefined

Optimal Draft Model Selection

最优草稿模型选择

python
def select_draft_model(target_model_size, target):
    """Select optimal draft model for speculative decoding."""
    # Rule: Draft should be 5-10× smaller
    if target_model_size == "70B":
        return "7B"  # 10× smaller
    elif target_model_size == "33B":
        return "7B"  # 5× smaller
    elif target_model_size == "13B":
        return "1B"  # 13× smaller
    else:
        return None  # Target too small, use Medusa/Lookahead instead
python
def select_draft_model(target_model_size, target):
    """Select optimal draft model for speculative decoding."""
    # Rule: Draft should be 5-10× smaller
    if target_model_size == "70B":
        return "7B"  # 10× smaller
    elif target_model_size == "33B":
        return "7B"  # 5× smaller
    elif target_model_size == "13B":
        return "1B"  # 13× smaller
    else:
        return None  # Target too small, use Medusa/Lookahead instead

Example

Example

draft = select_draft_model("70B", target_model)
draft = select_draft_model("70B", target_model)

Returns "7B" → Use Llama-2-7b as draft for Llama-2-70b

Returns "7B" → Use Llama-2-7b as draft for Llama-2-70b

undefined
undefined

Best Practices

最佳实践

1. Choose the Right Method

1. 选择合适的方法

python
undefined
python
undefined

New deployment → Medusa (best overall speedup, no draft model)

New deployment → Medusa (best overall speedup, no draft model)

if deploying_new_model: use_method = "Medusa"
if deploying_new_model: use_method = "Medusa"

Existing deployment with small model available → Draft speculative

Existing deployment with small model available → Draft speculative

elif have_small_version_of_model: use_method = "Draft Model Speculative"
elif have_small_version_of_model: use_method = "Draft Model Speculative"

Want zero training/setup → Lookahead

Want zero training/setup → Lookahead

elif want_plug_and_play: use_method = "Lookahead Decoding"
undefined
elif want_plug_and_play: use_method = "Lookahead Decoding"
undefined

2. Hyperparameter Tuning

2. 超参数调优

Draft Model Speculative:
python
undefined
草稿模型推测解码
python
undefined

K = number of speculative tokens

K = number of speculative tokens

K = 4 # Good default K = 2 # Conservative (higher acceptance) K = 8 # Aggressive (lower acceptance, but more when accepted)
K = 4 # Good default K = 2 # Conservative (higher acceptance) K = 8 # Aggressive (lower acceptance, but more when accepted)

Rule: Larger K → more speedup IF draft model is good

Rule: Larger K → more speedup IF draft model is good


**Medusa**:
```python

**Medusa多头解码**:
```python

Posterior threshold (acceptance confidence)

Posterior threshold (acceptance confidence)

posterior_threshold = 0.09 # Standard (from paper) posterior_threshold = 0.05 # More conservative (slower, higher quality) posterior_threshold = 0.15 # More aggressive (faster, may degrade quality)
posterior_threshold = 0.09 # Standard (from paper) posterior_threshold = 0.05 # More conservative (slower, higher quality) posterior_threshold = 0.15 # More aggressive (faster, may degrade quality)

Tree depth (how many steps ahead)

Tree depth (how many steps ahead)

medusa_choices = [[0], [0, 0], [0, 1], [0, 0, 0]] # Depth 3 (standard)

**Lookahead**:
```python
medusa_choices = [[0], [0, 0], [0, 1], [0, 0, 0]] # Depth 3 (standard)

**前瞻解码**:
```python

Window size W (lookahead distance)

Window size W (lookahead distance)

N-gram size N (context for generation)

N-gram size N (context for generation)

7B model (more resources)

7B model (more resources)

W, N = 15, 5
W, N = 15, 5

13B model (moderate)

13B model (moderate)

W, N = 10, 5
W, N = 10, 5

33B+ model (limited resources)

33B+ model (limited resources)

W, N = 7, 5
undefined
W, N = 7, 5
undefined

3. Production Deployment

3. 生产部署

python
undefined
python
undefined

vLLM with speculative decoding

vLLM with speculative decoding

from vllm import LLM, SamplingParams
from vllm import LLM, SamplingParams

Initialize with draft model

Initialize with draft model

llm = LLM( model="meta-llama/Llama-2-70b-hf", speculative_model="meta-llama/Llama-2-7b-hf", # Draft model num_speculative_tokens=5, use_v2_block_manager=True, )
llm = LLM( model="meta-llama/Llama-2-70b-hf", speculative_model="meta-llama/Llama-2-7b-hf", # Draft model num_speculative_tokens=5, use_v2_block_manager=True, )

Generate

Generate

prompts = ["Tell me about AI:", "Explain quantum physics:"] sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)
undefined
prompts = ["Tell me about AI:", "Explain quantum physics:"] sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text)
undefined

Resources

参考资源

See Also

扩展阅读

  • references/draft_model.md
    - Draft model selection and training
  • references/medusa.md
    - Medusa architecture and training
  • references/lookahead.md
    - Lookahead decoding implementation details
  • references/draft_model.md
    - 草稿模型的选择与训练
  • references/medusa.md
    - Medusa架构与训练细节
  • references/lookahead.md
    - 前瞻解码的实现细节