Speculative Decoding

What is Speculative Decoding?

Speculative decoding uses a small, fast draft model to generate tokens that are verified in parallel by the larger target model, speeding up inference without losing quality.

Where did the term "Speculative Decoding" come from?

Optimization technique to overcome memory bandwidth bottlenecks.

How is "Speculative Decoding" used today?

Can improve inference speed by 2-3x.

Related Terms

Inference
Latency
Large Language Model (LLM)