Speculative Decoding Guide: Speed Up LLM Inference with Draft and Verifier Models

Imagine waiting for a massive language model to type out a response one word at a time, like a slow typewriter. That's standard autoregressive decoding. Now imagine a teammate who quickly scribbles a draft of the next few words, and a senior editor who glances at them and says, "Yes, yes, no, change this one." Suddenly, the typing speed jumps. This is exactly how speculative decoding is an inference-time optimization technique that accelerates large language model (LLM) token generation without compromising output quality. First introduced by Google Research in 2022, it solves the biggest bottleneck in AI today: the slow speed of generating tokens in large models.

How the Draft and Verifier System Works

To understand this, you have to realize that big models are slow because they have to run through billions of parameters for every single token. Speculative decoding breaks this cycle by using two different models working together. First, a Draft Model-which is a much smaller, faster, and "cheaper" model-predicts a sequence of tokens. Instead of just one, it might guess the next 3 to 12 tokens. Because it's small, it does this almost instantly. Then comes the Verifier Model (the target LLM). Instead of generating tokens one by one, it looks at the entire draft in a single forward pass. It checks if the draft tokens align with what it would have predicted. If the draft is correct, the verifier accepts them all at once. If the draft hits a wrong word, the verifier keeps the correct part, fixes the mistake, and the process starts over. The magic here is that the final output is 100% identical to what the big model would have produced alone; it's just significantly faster.

Measuring Success: The Acceptance Rate

Not every draft is a winner. The efficiency of this system relies heavily on the Acceptance Rate, often denoted as $\alpha$. This is the probability that the verifier model agrees with the draft model's suggestions. If your draft model is well-aligned with the target model, you'll see high acceptance rates, leading to massive speedups. For example, in structured tasks like code generation, draft models often perform better because code follows more predictable patterns. You might see an acceptance rate of 58% for Python code, while an open-ended creative writing task might drop to 32%. When the acceptance rate is too low-typically below 30%-the system can actually become slower than standard decoding. This happens because you're wasting compute power on drafts that get rejected, effectively doing the work twice for nothing. Finding the right pair of models is the secret sauce to making this work in production. A small draft AI model proposing tokens to a larger verifier AI model.

A small draft AI model proposing tokens to a larger verifier AI model.

Different Flavors of Speculative Decoding

As the tech has evolved since 2022, we've moved beyond just pairing two separate models. Depending on your hardware and memory, you might choose a different approach.

Comparison of Speculative Decoding Implementations
Method	Mechanism	Typical Speedup	Main Advantage	Trade-off
Standard Draft-Target	Two separate models (e.g., T5-small $\rightarrow$ T5-XXL)	~3x	High peak speed	Higher memory footprint
Self-Speculative	Layer-skipping within one model	~1.99x	No extra memory needed	Slower than dual-model
Speculative Speculative (SSD)	Asynchronous hardware parallelization	Up to 5x	Extreme low latency	Requires separate GPUs

For those who can't afford to load two models into VRAM, Self-Speculative Decoding is a lifesaver. Introduced at ACL 2024, it doesn't use a separate model. Instead, it selectively skips intermediate layers of the original LLM to create a "fast version" of itself for drafting. This is a plug-and-play solution that requires zero extra training. If you're pushing for the absolute lowest latency, the Saguaro implementation of Speculative Speculative Decoding (SSD) is the current gold standard. It runs the drafter and verifier on separate hardware devices asynchronously, removing the sequential bottleneck and potentially hitting 5x speedups compared to standard autoregressive methods. A GPU chip surrounded by symbols of cost savings and high-speed performance.

A GPU chip surrounded by symbols of cost savings and high-speed performance.

Practical Implementation and Pitfalls

Setting this up isn't as simple as flipping a switch. Most developers use engines like vLLM or Text Generation Inference (TGI) to handle the heavy lifting. One of the biggest headaches is tuning the value of K (the number of draft tokens). You might think "more is better," but NVIDIA's research shows diminishing returns after K=8. If you set K too high, you spend too much time drafting tokens that are likely to be rejected, which drags down your overall throughput. Another common trap is distribution drift. Over time, if your draft model was trained on a different dataset than your target model, the "agreement" between them drops. This is why some newer frameworks, like the Draft, Verify, and Improve (DVI) system, use online learning to let the verifier "teach" the drafter in real-time, keeping them in sync.

Real-World Impact on AI Costs

This isn't just a theoretical win; it's changing the economics of AI. Enterprise adoption is huge because faster inference means fewer GPU hours. AWS reported that customers using speculative decoding in Bedrock saw inference costs drop by 63%. For a real-time chat interface, a 5x speedup is the difference between a user feeling like they're talking to a human and feeling like they're waiting for a webpage to load. By reducing the time the GPU spends idling while waiting for the next token, companies can serve more users with the same amount of hardware. According to industry data, about 78% of enterprise LLM frameworks now include some form of speculative decoding in their stack.

Does speculative decoding change the quality of the AI's answer?

No. Because the verifier model (the big LLM) checks every single token proposed by the draft model, the final output is mathematically identical to what the big model would have generated on its own. There is zero quality loss.

What happens if the draft model is completely wrong?

If the verifier rejects the very first token proposed by the drafter, the system simply reverts to standard autoregressive decoding for that step. You lose a small amount of time on the failed draft, but the accuracy remains perfect.

Which GPU is best for this technique?

NVIDIA Ampere architecture or newer is generally recommended. For high-end implementations like SSD (Saguaro), having multiple GPUs is necessary to run the draft and verifier models in parallel across different devices.

Why is code generation faster than creative writing with this method?

Code is more structured and predictable. There are fewer ways to write a standard "for loop" in Python than there are ways to describe a sunset in a poem. This makes it much easier for the small draft model to guess correctly, leading to a higher acceptance rate.

Can I use this with any LLM?

Yes, but you need a compatible draft model. You can either use a separate smaller model (like TinyLlama for CodeLlama) or use self-speculative decoding, which works with the model you already have by skipping layers.