Speculative Decoding

Fig. 1 | Speculative Decoding Animation. Top: Draft model generates 4 tokens quickly. Bottom: Target model verifies all 4 in parallel. After verification: Accept/Reject per token (Green=Accept, Red=Reject).

Comparison: Standard vs Speculative

Standard Decoding

Forward Pass 1 Token 1

Forward Pass 2 Token 2

Forward Pass 3 Token 3

Forward Pass 4 Token 4

Total 4 Forward Passes

Speculative Decoding

Draft (fast) 4 Tokens

Target (parallel) 4 Tokens

Accept 3 Tokens ✓

Reject & Retry 1 Token

Total 2× faster!

Key Insights

1

Draft + Target Paradigm: A small model (draft, e.g., 7B) generates quickly. A large model (target, e.g., 70B) verifies. This is asymmetric: Draft is cheap, Target is expensive. The asymmetry is the trick.

2

Parallel Verification: The key is that Target verifies all draft candidates simultaneously (a single forward pass with longer sequence). This is much faster than sequential sampling.

3

Acceptance Rate is Critical: If Draft is similar to Target, many tokens are accepted (~80-90%), and speedup is ~2-3×. If Draft is poor, many rejections, and speedup drops to ~1.2×.

4

Practical Constraints: The draft model must be very similar to target (otherwise too many rejections). This means: Draft is often a smaller checkpoint of the same model, not a completely different model.

5

Latency vs Throughput: Speculative Decoding reduces latency (important for interactive use). It doesn't reduce FLOPS requirement. Ideal for: Chat APIs, real-time applications. Bad for: Batch inference, maximum throughput scenarios.

6

Allocation Decision: In a world with limited GPU budget: Speculative Decoding is an engineering trade-off between latency improvement and model complexity. Only worthwhile when latency is critical.

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Comparison: Standard vs Speculative

Key Insights