Instant Start - RunInfra

Instant Start is RunInfra’s weight-caching layer. When a Flex replica spins up from zero, the weights are already staged in a regional cache so the replica becomes ready in under 2 seconds for most pipelines instead of pulling multi-gigabyte model files from Hugging Face or S3 each time. It is automatic on Core deployments and runs underneath every Flex replica. You never configure it; you only see the cold-start time it produces.

What a cold start is actually doing

A cold start is the time between a new request arrives at the load balancer and the replica returns its first token. With Instant Start, those steps are:

Step	Typical time	Notes
Allocate GPU + pull container image	400-800 ms	Image is small (serving runtime only, no weights).
Pull weights from regional cache	200-1500 ms	Depends on model size; weights stream off the cache to the GPU.
Load runtime (vLLM / SGLang / TRT-LLM)	150-400 ms	Engine init, KV cache allocation.
Compile / warm forward pass	100-300 ms	One short forward pass to JIT-warm CUDA graphs.
Total cold start	~1-3 s	Most pipelines land in the lower half.

Without Instant Start, “pull weights” would be 30-120 seconds (downloading from Hugging Face or S3 to the host). Instant Start replaces that with a regional cache pull off NVMe-backed storage, which is the headline win.

The cache architecture

[Hugging Face / S3]
        |
        |   (one-time pull on first deploy)
        v
[Regional weight cache]  (NVMe, replicated per region)
        |
        |   (every cold start)
        v
[Replica GPU host]  ---> [GPU VRAM]

Two things to notice:

The cache is regional. A pipeline deployed in IAD has its weights staged in IAD. A new replica in IAD pulls from local NVMe over the rack network. If you redeploy to a different region, that region’s cache warms from scratch.
The cache is per-pipeline, not per-model. If two pipelines use the same base model, they each warm their own slot in the cache. Storage is cheap; this avoids cross-tenant cache poisoning.

Eviction

The cache evicts on a least-recently-used policy:

Hot pipelines (any request in the last ~48 hours) stay warm indefinitely.
Cold pipelines (no requests for several days) eventually evict. The next cold start pays the full warm-up to repopulate.
Custom-uploaded models evict on the same schedule. Re-warming pulls from the S3 bucket the upload was staged in.

You can see whether a pipeline is currently cache-warm from the deployment dashboard. A cache: warm chip on the deployment card means the next cold start is the fast path.

Multi-GPU pipelines

For pipelines that need tensor parallelism across multiple GPUs (large models like Llama 3.1 70B on 2x H100), Instant Start stages weight shards in parallel. The shards land on each GPU directly via DMA off the regional NVMe, so the multi-GPU spin-up is bounded by the slowest shard, not the sum of all shards. For pipelines with multiple distinct models on one GPU (voice agent: ASR + LLM + TTS), all three models are staged in one cache entry tied to the pipeline id. They co-load on the same spin-up.

When cold starts still bite

Even with Instant Start, the first request after a fully-idle replica still pays the cold-start cost listed above. That’s 1-3 seconds, which is fine for most APIs but not for sub-200ms SLAs. If the cold-start cost is unacceptable:

Switch to Active mode (Core). Replicas stay warm 24/7 so no request ever pays a cold start. See Autoscaling.
Raise min_replicas to 1. Keeps one replica warm even in Flex mode. You pay for one constant replica but cold starts only happen on scale-out, not scale-up-from-zero.
Schedule warm pings. A request every 30 seconds from a cheap cron keeps the current replica from going idle. This is hacky; prefer the two options above.

Instant Start caches model weights, not KV cache. A cold replica always starts with an empty KV cache. If your workload has multi-turn prompt reuse (system prompts, RAG context, long conversations), keep replicas warm so the KV cache builds up instead of being rebuilt from scratch on every cold start.

Known limitations

The very first deploy of a new pipeline always pays the full warm-up (one-time pull from Hugging Face or S3 into the regional cache).
Custom model uploads take proportionally longer to first-warm because of the upload size.
Cache state is per region. If your traffic moves regions, the new region warms cold.
The cache does not warm proactively. A new deploy that gets zero traffic stays cold-cached but unverified until the first real request lands.

Next steps

Autoscaling

Replica count, min/max, and scale-up math.

Speculation

Steady-state throughput, not cold starts.

Monitoring

Cold-start time per deployment.

Deployments overview

How deployments fit together.

​What a cold start is actually doing

​The cache architecture

​Eviction

​Multi-GPU pipelines

​When cold starts still bite

​Known limitations

​Next steps

Autoscaling

Speculation

Monitoring

Deployments overview

What a cold start is actually doing

The cache architecture

Eviction

Multi-GPU pipelines

When cold starts still bite

Known limitations

Next steps