Autoscaling - RunInfra

RunInfra deployments autoscale. Traffic arrives, replicas spin up; traffic drops, replicas shed. You set the floor and ceiling, and RunInfra moves between them based on load.

Modes

Flex (Core)
Active (Core)

Scale to zero when idle. A new replica spins up on demand when a request arrives. Instant Start keeps the spin-up fast.

Pay per token only when requests are being processed.
No charges while idle.
Best for development, bursty traffic, cost-sensitive workloads.

How replica count is decided

The scaler tracks two signals at the load balancer:

Concurrent in-flight requests per replica (the primary signal)
Queue depth (requests waiting because every replica is busy)

It scales up when either:

Average in-flight per replica exceeds the target concurrency, OR
Queue depth exceeds a threshold (roughly: queue holding requests for more than a couple of seconds)

It scales down when in-flight has been below the target for a sustained window (~30 seconds for Flex, ~2 minutes for Active so you don’t churn during normal traffic dips).

target_concurrency_per_replica = workload-specific
                                 (chat: ~32, embeddings: ~256, image-gen: ~1-4)

needed_replicas = ceil( total_in_flight / target_concurrency_per_replica )

clamped_replicas = clamp(needed_replicas, min_replicas, max_replicas)

The target concurrency is set automatically based on what the optimization run found. Embedding pipelines tolerate huge batches (256+ in-flight per L40S). Chat LLMs do better with smaller batches (16-32 per L40S). Image generation runs ~1-4 per GPU depending on resolution.

Queue depth + 503

If traffic spikes past your max_replicas ceiling, new requests queue. The queue is short (about 2 seconds of headroom by default). After that, requests start returning 503 Service Unavailable with a Retry-After header. If you see 503s:

Raise max_replicas if the spike is real and you want to serve it.
Lower max_replicas if you want a hard ceiling on cost during a runaway prompt loop.
Smooth traffic upstream with a queue (Vercel Queues, SQS, your own buffer) if the spike is brief.

The queue depth is visible in the Monitoring dashboard alongside replica count and latency. If queue depth is consistently non-zero, your max_replicas is too low.

Cost vs latency math

The trade-off behind autoscaling is simple but worth knowing.

Knob	Cost impact	Latency impact
Raise `min_replicas` from 0 to 1 (Flex)	+1 replica-hour cost, always	Eliminates cold start on first request
Switch Flex to Active	+base fee per warm replica-hour	Eliminates cold start completely
Raise `max_replicas`	More headroom during spikes	Tail latency drops; no 503s on bursts
Higher target concurrency	Fewer replicas needed at same QPS	Per-request latency rises (batching)

For a Flex pipeline at 5 RPS average with bursts to 30 RPS:

min_replicas: 0, max_replicas: 4 is reasonable. Cold start happens on the very first request of a quiet period; bursts scale to 4 and queue from there.
min_replicas: 1, max_replicas: 4 removes the cold start at a cost of one warm replica 24/7.
Active mode with min_replicas: 2, max_replicas: 6 removes cold starts entirely and gives more burst headroom; you pay for 2 warm replicas.

Knobs

mode

'flex' | 'active'

default:"flex"

Flex scales to zero when idle. Active keeps a warm floor that never drops below min_replicas.

min_replicas

number

The floor. Flex default is 0, Active default is at least 1. Raising this cuts cold starts but costs constant GPU time.

max_replicas

number

The ceiling. Caps spend during a spike and keeps a bad prompt loop from scaling a deployment into the triple digits. Hard 503 above this.

target_concurrency

number

Per-replica in-flight target. Set automatically based on the optimization run. Override only if you know your workload tolerates higher batching.

scale_down_window_seconds

number

How long average in-flight must stay below target before scaling down. Defaults: Flex 30 s, Active 120 s.

Sizing persistence

Worker counts requested at deploy time persist across restart, start, and GPU changes. An explicit count on the request always wins; otherwise the deployment keeps its original sizing; otherwise plan defaults apply. Worker counts are capped at 32 per deployment, and plan or operational caps can be lower; see the replica budgets in the deployments overview.

Best practices

Start with Flex unless you know you need Active. Flex handles 80%+ of workloads correctly.
Set max_replicas deliberately. It is your spend cap. Pick a number that matches the spike you actually want to serve.
Don’t set min_replicas higher than the average steady-state replica count. That’s just wasted warm capacity.
Watch queue depth, not just latency. Queue depth is the leading indicator; latency is the lagging indicator.
Re-optimize after a major traffic-pattern change. The target concurrency the optimizer picked for 1k requests/day is not the right one for 1M requests/day.

Known limitations

Active mode requires a paid Core plan.
Scale-out has a cold-start cost per new replica. Instant Start minimizes it; see Instant Start.
Scale-to-zero requires a recent idle window. Traffic bursts close together keep the current replica warm.
max_replicas is a hard ceiling per workspace, not per region. Cross-region scale-out requires explicit configuration.

Next steps

Instant Start

How cold starts stay fast on Flex.

Speculation

Speculative decoding for throughput.

Deployments overview

The shape of a RunInfra deployment.

Monitoring

Watch latency, replica count, and queue depth.

​Modes

​How replica count is decided

​Queue depth + 503

​Cost vs latency math

​Knobs

​Sizing persistence

​Best practices

​Known limitations

​Next steps

Instant Start

Speculation

Deployments overview

Monitoring

Modes

How replica count is decided

Queue depth + 503

Cost vs latency math

Knobs

Sizing persistence

Best practices

Known limitations

Next steps