Speculative decoding

Speculative decoding runs a small draft model in parallel with your main model. The draft proposes multiple candidate tokens; the target verifies them in one forward pass. When the draft agrees with the target, you generate more than one token per target step. RunInfra’s optimizer evaluates speculation as one of the techniques during an optimization run and ships it on the winning variant when it improves throughput on your workload.

When speculation helps

Speculation wins when the draft model agrees with the target often. That tends to be true for structured outputs (JSON, code, SQL), long-form generation, and multi-turn chat where the next tokens are reasonably predictable. It loses when the target model is already small (so the draft speedup is marginal), when the workload is latency-bound at batch size 1, or when the draft hit rate is low for the traffic shape.

Ask for it

Optimize this with speculative decoding

Or let the optimizer decide; it evaluates speculation automatically when you pick priority: throughput.

Known limitations

Applies to autoregressive models (text LLMs, vision-language, and autoregressive TTS), not embeddings, classifiers, or non-autoregressive audio.
Adds a small overhead to first-token latency on each generation. If your SLA is strict on first-token latency, measure before committing.
Availability depends on the serving backend behind your deployment. The optimizer picks compatible backends for you.

Next steps

Optimization

Speculation is one technique among many the optimizer evaluates.

Instant Start

Cold-start caching, orthogonal concern.

Autoscaling

Replica budget, Flex vs Active.

Monitoring

Throughput after enabling speculation.

Autoscaling

Instant Start

⌘I

​When speculation helps

​Ask for it

​Known limitations

​Next steps

Optimization

Instant Start

Autoscaling

Monitoring

When speculation helps

Ask for it

Known limitations

Next steps