Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

An optimization run produces ranked variants of your pipeline across GPUs, quantization schemes, and serving backends, measured against the priority and constraints you set. You pick the winner; RunInfra redeploys in seconds.

What a run does

Every stage is automatic. You only interact at the start (priority + constraints) and at the end (pick a variant).

Priorities

One dimension ranked first. Other dimensions are reported but do not drive ranking.
PriorityBest for
latencyChatbots, real-time UX
throughputBatch processing, high-volume APIs
costBudget-sensitive workloads
qualityAccuracy-critical tasks
balancedNo strong preference
Optimize for latency, keep cost under $0.005 per request

Constraints

Hard filters. Variants that violate them are dropped before ranking so you never see a “cheap but too slow” false positive at the top.
Max latency 200ms, min quality 0.9, budget under $500/month
Combine priority + multiple constraints in one prompt. The agent parses commas as “and” and applies every constraint as an AND filter.

Variant types

The optimizer searches these categories on Hugging Face automatically. You can pin a specific type but usually shouldn’t.
TypeWhat it isPlanWhen it wins
AWQ4-bit activation-aware quantAllMost 7-70B models; best default
GPTQ3/4/8-bit calibration-basedPro+When AWQ isn’t available at the size you need
FP88-bit floating pointPro+H100/H200 when quality matters
TensorRT-LLMNVIDIA compiled engineTeam+Max throughput on NVIDIA GPUs, 1.4x over vLLM
If you have a preference:
Search for an AWQ version of this model

Versions and rollback

Each run creates a pipeline version you can compare and roll back to. Rollback is a metadata change; no re-benchmarking needed.
Compare version 1 and version 2
Go back to version 1
Versions are scoped to one pipeline. Rolling back does not affect running deployments until you explicitly redeploy the reverted version.

Known limitations

Optimization is best-effort. If no variant satisfies your constraints, the run returns zero winners with a recommendation to relax a constraint (usually cost or latency). Failed runs do not consume a session against your budget.
  • Optimization sessions by plan: 3/mo Starter, 20/mo Pro, 100/seat/mo Team. Overage is $2.50 each on Pro and Team.
  • Max concurrent runs per workspace: 1 on Starter, 3 on Pro, 10 on Team.
  • Typical run time: 2 to 5 minutes. Vision models and 70B+ targets take longer.
  • Models above 200 GB on disk are not auto-cached in Instant Start; expect longer cold starts on the winning variant.
  • Quality scoring uses a standardized eval set. If your task diverges from the eval distribution, the quality score may not reflect your use case. See the Models page for per-family eval info.

Common questions

A variant is excluded if: a pre-quantized checkpoint does not exist on HF for that (model, quant) pair, the variant violates a hard constraint, or the variant requires a GPU tier above what your plan unlocks (e.g. TensorRT-LLM on Team+).
Session count is independent of token use. Each run consumes a fixed amount of GPU time (averaged across variants) and counts as one session. Token billing only applies to playground and deployed-endpoint traffic.
Yes. Run optimization against the pipeline (the deployment keeps serving old variant), pick a new winner, click Redeploy. RunInfra rolls replicas one at a time with connection draining; requests in flight complete on the old replica, new ones land on the new variant.
Either your target GPU does not support FP8 (L4, L40S) or your constraint set penalized cost; FP8 needs H100+, which raises per-hour GPU cost even if per-token is similar.
A fixed eval set of canonical tasks (general QA, multi-turn chat consistency, basic reasoning). Quality score is a 0 to 1 normalized composite. For task-specific evaluation, use your own test set in the playground and compare variants manually.
Never. Optimization runs happen on isolated benchmark hardware; your deployment serves the previously-chosen variant the whole time. You decide when and whether to switch.
Yes on Team+. Upload the model first (Custom models), then run optimization against it like any catalog model.

Next steps

Deploy your pipeline

Turn the winning variant into a live endpoint.

Speculation

Further 1.5 to 3x throughput via speculative decoding.

GPUs and pricing

Which GPU tiers the optimizer will consider.

Debug bad results

Redirect the optimizer when results don’t fit.