RunInfra deployments autoscale. Traffic arrives, replicas spin up; traffic drops, replicas shed. You set the floor and ceiling, and RunInfra moves between them based on load.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Modes
- Flex (Pro+)
- Active (Team+)
Scale to zero when idle. A new replica spins up on demand when a request arrives. Instant Start keeps the spin-up fast.
- Pay per token only when requests are being processed.
- No charges while idle.
- Best for development, bursty traffic, cost-sensitive workloads.
How replica count is decided
The scaler tracks two signals at the load balancer:- Concurrent in-flight requests per replica (the primary signal)
- Queue depth (requests waiting because every replica is busy)
- Average in-flight per replica exceeds the target concurrency, OR
- Queue depth exceeds a threshold (roughly: queue holding requests for more than a couple of seconds)
Queue depth + 503
If traffic spikes past yourmax_replicas ceiling, new requests queue. The queue is short (about 2 seconds of headroom by default). After that, requests start returning 503 Service Unavailable with a Retry-After header.
If you see 503s:
- Raise
max_replicasif the spike is real and you want to serve it. - Lower
max_replicasif you want a hard ceiling on cost during a runaway prompt loop. - Smooth traffic upstream with a queue (Vercel Queues, SQS, your own buffer) if the spike is brief.
max_replicas is too low.
Cost vs latency math
The trade-off behind autoscaling is simple but worth knowing.| Knob | Cost impact | Latency impact |
|---|---|---|
Raise min_replicas from 0 to 1 (Flex) | +1 replica-hour cost, always | Eliminates cold start on first request |
| Switch Flex to Active | +base fee per warm replica-hour | Eliminates cold start completely |
Raise max_replicas | More headroom during spikes | Tail latency drops; no 503s on bursts |
| Higher target concurrency | Fewer replicas needed at same QPS | Per-request latency rises (batching) |
min_replicas: 0,max_replicas: 4is reasonable. Cold start happens on the very first request of a quiet period; bursts scale to 4 and queue from there.min_replicas: 1,max_replicas: 4removes the cold start at a cost of one warm replica 24/7.- Active mode with
min_replicas: 2,max_replicas: 6removes cold starts entirely and gives more burst headroom; you pay for 2 warm replicas.
Knobs
Flex scales to zero when idle. Active keeps a warm floor that never drops below
min_replicas.The floor. Flex default is 0, Active default is at least 1. Raising this cuts cold starts but costs constant GPU time.
The ceiling. Caps spend during a spike and keeps a bad prompt loop from scaling a deployment into the triple digits. Hard 503 above this.
Per-replica in-flight target. Set automatically based on the optimization run. Override only if you know your workload tolerates higher batching.
How long average in-flight must stay below target before scaling down. Defaults: Flex 30 s, Active 120 s.
Best practices
- Start with Flex unless you know you need Active. Flex handles 80%+ of workloads correctly.
- Set
max_replicasdeliberately. It is your spend cap. Pick a number that matches the spike you actually want to serve. - Don’t set
min_replicashigher than the average steady-state replica count. That’s just wasted warm capacity. - Watch queue depth, not just latency. Queue depth is the leading indicator; latency is the lagging indicator.
- Re-optimize after a major traffic-pattern change. The target concurrency the optimizer picked for 1k requests/day is not the right one for 1M requests/day.
Known limitations
- Active mode requires Team plan or higher.
- Scale-out has a cold-start cost per new replica. Instant Start minimizes it; see Instant Start.
- Scale-to-zero requires a recent idle window. Traffic bursts close together keep the current replica warm.
max_replicasis a hard ceiling per workspace, not per region. Cross-region scale-out requires explicit configuration.
Next steps
Instant Start
How cold starts stay fast on Flex.
Speculation
Speculative decoding for throughput.
Deployments overview
The shape of a RunInfra deployment.
Monitoring
Watch latency, replica count, and queue depth.