Fig. 1 | Speculative Decoding Animation. Top: Draft model generates 4 tokens quickly. Bottom: Target model verifies all 4 in parallel. After verification: Accept/Reject per token (Green=Accept, Red=Reject).

Comparison: Standard vs Speculative

Standard Decoding
Forward Pass 1 Token 1
Forward Pass 2 Token 2
Forward Pass 3 Token 3
Forward Pass 4 Token 4
Total 4 Forward Passes
Speculative Decoding
Draft (fast) 4 Tokens
Target (parallel) 4 Tokens
Accept 3 Tokens ✓
Reject & Retry 1 Token
Total 2× faster!

Key Insights

1
Draft + Target Paradigm: A small model (draft, e.g., 7B) generates quickly. A large model (target, e.g., 70B) verifies. This is asymmetric: Draft is cheap, Target is expensive. The asymmetry is the trick.
2
Parallel Verification: The key is that Target verifies all draft candidates simultaneously (a single forward pass with longer sequence). This is much faster than sequential sampling.
3
Acceptance Rate is Critical: If Draft is similar to Target, many tokens are accepted (~80-90%), and speedup is ~2-3×. If Draft is poor, many rejections, and speedup drops to ~1.2×.
4
Practical Constraints: The draft model must be very similar to target (otherwise too many rejections). This means: Draft is often a smaller checkpoint of the same model, not a completely different model.
5
Latency vs Throughput: Speculative Decoding reduces latency (important for interactive use). It doesn't reduce FLOPS requirement. Ideal for: Chat APIs, real-time applications. Bad for: Batch inference, maximum throughput scenarios.
6
Allocation Decision: In a world with limited GPU budget: Speculative Decoding is an engineering trade-off between latency improvement and model complexity. Only worthwhile when latency is critical.