Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra deployments autoscale. Traffic arrives, replicas spin up; traffic drops, replicas shed. You set the floor and ceiling, and RunInfra moves between them based on load.

Modes

Scale to zero when idle. A new replica spins up on demand when a request arrives. Instant Start keeps the spin-up fast.
  • Pay per token only when requests are being processed.
  • No charges while idle.
  • Best for development, bursty traffic, cost-sensitive workloads.

How replica count is decided

The scaler tracks two signals at the load balancer:
  1. Concurrent in-flight requests per replica (the primary signal)
  2. Queue depth (requests waiting because every replica is busy)
It scales up when either:
  • Average in-flight per replica exceeds the target concurrency, OR
  • Queue depth exceeds a threshold (roughly: queue holding requests for more than a couple of seconds)
It scales down when in-flight has been below the target for a sustained window (~30 seconds for Flex, ~2 minutes for Active so you don’t churn during normal traffic dips).
target_concurrency_per_replica = workload-specific
                                 (chat: ~32, embeddings: ~256, image-gen: ~1-4)

needed_replicas = ceil( total_in_flight / target_concurrency_per_replica )

clamped_replicas = clamp(needed_replicas, min_replicas, max_replicas)
The target concurrency is set automatically based on what the optimization run found. Embedding pipelines tolerate huge batches (256+ in-flight per L40S). Chat LLMs do better with smaller batches (16-32 per L40S). Image generation runs ~1-4 per GPU depending on resolution.

Queue depth + 503

If traffic spikes past your max_replicas ceiling, new requests queue. The queue is short (about 2 seconds of headroom by default). After that, requests start returning 503 Service Unavailable with a Retry-After header. If you see 503s:
  • Raise max_replicas if the spike is real and you want to serve it.
  • Lower max_replicas if you want a hard ceiling on cost during a runaway prompt loop.
  • Smooth traffic upstream with a queue (Vercel Queues, SQS, your own buffer) if the spike is brief.
The queue depth is visible in the Monitoring dashboard alongside replica count and latency. If queue depth is consistently non-zero, your max_replicas is too low.

Cost vs latency math

The trade-off behind autoscaling is simple but worth knowing.
KnobCost impactLatency impact
Raise min_replicas from 0 to 1 (Flex)+1 replica-hour cost, alwaysEliminates cold start on first request
Switch Flex to Active+base fee per warm replica-hourEliminates cold start completely
Raise max_replicasMore headroom during spikesTail latency drops; no 503s on bursts
Higher target concurrencyFewer replicas needed at same QPSPer-request latency rises (batching)
For a Flex pipeline at 5 RPS average with bursts to 30 RPS:
  • min_replicas: 0, max_replicas: 4 is reasonable. Cold start happens on the very first request of a quiet period; bursts scale to 4 and queue from there.
  • min_replicas: 1, max_replicas: 4 removes the cold start at a cost of one warm replica 24/7.
  • Active mode with min_replicas: 2, max_replicas: 6 removes cold starts entirely and gives more burst headroom; you pay for 2 warm replicas.

Knobs

mode
'flex' | 'active'
default:"flex"
Flex scales to zero when idle. Active keeps a warm floor that never drops below min_replicas.
min_replicas
number
The floor. Flex default is 0, Active default is at least 1. Raising this cuts cold starts but costs constant GPU time.
max_replicas
number
The ceiling. Caps spend during a spike and keeps a bad prompt loop from scaling a deployment into the triple digits. Hard 503 above this.
target_concurrency
number
Per-replica in-flight target. Set automatically based on the optimization run. Override only if you know your workload tolerates higher batching.
scale_down_window_seconds
number
How long average in-flight must stay below target before scaling down. Defaults: Flex 30 s, Active 120 s.

Best practices

  • Start with Flex unless you know you need Active. Flex handles 80%+ of workloads correctly.
  • Set max_replicas deliberately. It is your spend cap. Pick a number that matches the spike you actually want to serve.
  • Don’t set min_replicas higher than the average steady-state replica count. That’s just wasted warm capacity.
  • Watch queue depth, not just latency. Queue depth is the leading indicator; latency is the lagging indicator.
  • Re-optimize after a major traffic-pattern change. The target concurrency the optimizer picked for 1k requests/day is not the right one for 1M requests/day.

Known limitations

  • Active mode requires Team plan or higher.
  • Scale-out has a cold-start cost per new replica. Instant Start minimizes it; see Instant Start.
  • Scale-to-zero requires a recent idle window. Traffic bursts close together keep the current replica warm.
  • max_replicas is a hard ceiling per workspace, not per region. Cross-region scale-out requires explicit configuration.

Next steps

Instant Start

How cold starts stay fast on Flex.

Speculation

Speculative decoding for throughput.

Deployments overview

The shape of a RunInfra deployment.

Monitoring

Watch latency, replica count, and queue depth.