Speculative decoding runs a small draft model in parallel with your main model. The draft proposes multiple candidate tokens; the target verifies them in one forward pass. When the draft agrees with the target, you generate more than one token per target step. RunInfra’s optimizer evaluates speculation as one of the techniques during an optimization run and ships it on the winning variant when it improves throughput on your workload.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
When speculation helps
Speculation wins when the draft model agrees with the target often. That tends to be true for structured outputs (JSON, code, SQL), long-form generation, and multi-turn chat where the next tokens are reasonably predictable. It loses when the target model is already small (so the draft speedup is marginal), when the workload is latency-bound at batch size 1, or when the draft hit rate is low for the traffic shape.Ask for it
priority: throughput.
Known limitations
- Applies to autoregressive text models only, not embeddings, classifiers, or audio.
- Adds a small overhead to first-token latency on each generation. If your SLA is strict on first-token latency, measure before committing.
- Availability depends on the serving backend behind your deployment. The optimizer picks compatible backends for you.
Next steps
Optimization
Speculation is one technique among many the optimizer evaluates.
Instant Start
Cold-start caching, orthogonal concern.
Autoscaling
Replica budget, Flex vs Active.
Monitoring
Throughput after enabling speculation.