An optimization run produces ranked variants of your pipeline across GPUs, quantization schemes, and serving backends, measured against the priority and constraints you set. You pick the winner; RunInfra redeploys in seconds.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
What a run does
Every stage is automatic. You only interact at the start (priority + constraints) and at the end (pick a variant).Priorities
One dimension ranked first. Other dimensions are reported but do not drive ranking.| Priority | Best for |
|---|---|
latency | Chatbots, real-time UX |
throughput | Batch processing, high-volume APIs |
cost | Budget-sensitive workloads |
quality | Accuracy-critical tasks |
balanced | No strong preference |
Constraints
Hard filters. Variants that violate them are dropped before ranking so you never see a “cheap but too slow” false positive at the top.Variant types
The optimizer searches these categories on Hugging Face automatically. You can pin a specific type but usually shouldn’t.| Type | What it is | Plan | When it wins |
|---|---|---|---|
| AWQ | 4-bit activation-aware quant | All | Most 7-70B models; best default |
| GPTQ | 3/4/8-bit calibration-based | Pro+ | When AWQ isn’t available at the size you need |
| FP8 | 8-bit floating point | Pro+ | H100/H200 when quality matters |
| TensorRT-LLM | NVIDIA compiled engine | Team+ | Max throughput on NVIDIA GPUs, 1.4x over vLLM |
Versions and rollback
Each run creates a pipeline version you can compare and roll back to. Rollback is a metadata change; no re-benchmarking needed.Versions are scoped to one pipeline. Rolling back does not affect running deployments until you explicitly redeploy the reverted version.
Known limitations
- Optimization sessions by plan: 3/mo Starter, 20/mo Pro, 100/seat/mo Team. Overage is $2.50 each on Pro and Team.
- Max concurrent runs per workspace: 1 on Starter, 3 on Pro, 10 on Team.
- Typical run time: 2 to 5 minutes. Vision models and 70B+ targets take longer.
- Models above 200 GB on disk are not auto-cached in Instant Start; expect longer cold starts on the winning variant.
- Quality scoring uses a standardized eval set. If your task diverges from the eval distribution, the quality score may not reflect your use case. See the Models page for per-family eval info.
Common questions
Why are some variants missing from the results?
Why are some variants missing from the results?
A variant is excluded if: a pre-quantized checkpoint does not exist on HF for that (model, quant) pair, the variant violates a hard constraint, or the variant requires a GPU tier above what your plan unlocks (e.g. TensorRT-LLM on Team+).
Do optimization sessions cost tokens?
Do optimization sessions cost tokens?
Session count is independent of token use. Each run consumes a fixed amount of GPU time (averaged across variants) and counts as one session. Token billing only applies to playground and deployed-endpoint traffic.
Can I re-optimize an existing deployment without downtime?
Can I re-optimize an existing deployment without downtime?
Yes. Run optimization against the pipeline (the deployment keeps serving old variant), pick a new winner, click Redeploy. RunInfra rolls replicas one at a time with connection draining; requests in flight complete on the old replica, new ones land on the new variant.
Why did the agent skip FP8?
Why did the agent skip FP8?
Either your target GPU does not support FP8 (L4, L40S) or your constraint set penalized cost; FP8 needs H100+, which raises per-hour GPU cost even if per-token is similar.
How is quality measured?
How is quality measured?
A fixed eval set of canonical tasks (general QA, multi-turn chat consistency, basic reasoning). Quality score is a 0 to 1 normalized composite. For task-specific evaluation, use your own test set in the playground and compare variants manually.
Does optimization touch my deployed endpoint?
Does optimization touch my deployed endpoint?
Never. Optimization runs happen on isolated benchmark hardware; your deployment serves the previously-chosen variant the whole time. You decide when and whether to switch.
Can I optimize custom fine-tuned models?
Can I optimize custom fine-tuned models?
Yes on Team+. Upload the model first (Custom models), then run optimization against it like any catalog model.
Next steps
Deploy your pipeline
Turn the winning variant into a live endpoint.
Speculation
Further 1.5 to 3x throughput via speculative decoding.
GPUs and pricing
Which GPU tiers the optimizer will consider.
Debug bad results
Redirect the optimizer when results don’t fit.