Instant Start is RunInfra’s weight-caching layer. When a Flex replica spins up from zero, the weights are already staged in a regional cache so the replica becomes ready in under 2 seconds for most pipelines instead of pulling multi-gigabyte model files from Hugging Face or S3 each time. It is automatic on Pro+ deployments and runs underneath every Flex replica. You never configure it; you only see the cold-start time it produces.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
What a cold start is actually doing
A cold start is the time between a new request arrives at the load balancer and the replica returns its first token. With Instant Start, those steps are:| Step | Typical time | Notes |
|---|---|---|
| Allocate GPU + pull container image | 400-800 ms | Image is small (serving runtime only, no weights). |
| Pull weights from regional cache | 200-1500 ms | Depends on model size; weights stream off the cache to the GPU. |
| Load runtime (vLLM / SGLang / TRT-LLM) | 150-400 ms | Engine init, KV cache allocation. |
| Compile / warm forward pass | 100-300 ms | One short forward pass to JIT-warm CUDA graphs. |
| Total cold start | ~1-3 s | Most pipelines land in the lower half. |
The cache architecture
- The cache is regional. A pipeline deployed in
IADhas its weights staged inIAD. A new replica inIADpulls from local NVMe over the rack network. If you redeploy to a different region, that region’s cache warms from scratch. - The cache is per-pipeline, not per-model. If two pipelines use the same base model, they each warm their own slot in the cache. Storage is cheap; this avoids cross-tenant cache poisoning.
Eviction
The cache evicts on a least-recently-used policy:- Hot pipelines (any request in the last ~48 hours) stay warm indefinitely.
- Cold pipelines (no requests for several days) eventually evict. The next cold start pays the full warm-up to repopulate.
- Custom-uploaded models evict on the same schedule. Re-warming pulls from the S3 bucket the upload was staged in.
cache: warm chip on the deployment card means the next cold start is the fast path.
Multi-GPU pipelines
For pipelines that need tensor parallelism across multiple GPUs (large models like Llama 3.1 70B on 2x H100), Instant Start stages weight shards in parallel. The shards land on each GPU directly via DMA off the regional NVMe, so the multi-GPU spin-up is bounded by the slowest shard, not the sum of all shards. For pipelines with multiple distinct models on one GPU (voice agent: ASR + LLM + TTS), all three models are staged in one cache entry tied to the pipeline id. They co-load on the same spin-up.When cold starts still bite
Even with Instant Start, the first request after a fully-idle replica still pays the cold-start cost listed above. That’s 1-3 seconds, which is fine for most APIs but not for sub-200ms SLAs. If the cold-start cost is unacceptable:- Switch to Active mode (Team+). Replicas stay warm 24/7 so no request ever pays a cold start. See Autoscaling.
- Raise
min_replicasto 1. Keeps one replica warm even in Flex mode. You pay for one constant replica but cold starts only happen on scale-out, not scale-up-from-zero. - Schedule warm pings. A request every 30 seconds from a cheap cron keeps the current replica from going idle. This is hacky; prefer the two options above.
Known limitations
- The very first deploy of a new pipeline always pays the full warm-up (one-time pull from Hugging Face or S3 into the regional cache).
- Custom model uploads take proportionally longer to first-warm because of the upload size.
- Cache state is per region. If your traffic moves regions, the new region warms cold.
- The cache does not warm proactively. A new deploy that gets zero traffic stays cold-cached but unverified until the first real request lands.
Next steps
Autoscaling
Replica count, min/max, and scale-up math.
Speculation
Steady-state throughput, not cold starts.
Monitoring
Cold-start time per deployment.
Deployments overview
How deployments fit together.