Optimization - RunInfra

An optimization run produces ranked variants of your pipeline across GPUs, quantization schemes, and serving backends, measured against the priority and constraints you set. You pick the winner; RunInfra redeploys in seconds.

What a run does

Every stage is automatic. You only interact at the start (priority + constraints) and at the end (pick a variant).

Deploying the winning variant

Deploying an optimized variant preserves the artifact that won the run. RunInfra carries the selected pipeline files, model revision, serving backend, quantization, chat template, cache settings, speculative decoding settings, and version identity into the deployment request. If the saved artifact snapshot no longer matches the deployable pipeline files, deployment stops before GPU provisioning. This keeps the live endpoint from serving a weaker stock configuration when you expected the optimized build. After a custom image build, the deployment record stores a receipt with the artifact identity and image metadata. Use this receipt when auditing whether a running endpoint came from the selected optimized artifact.

Priorities

One dimension ranked first. Other dimensions are reported but do not drive ranking.

Priority	Best for
`latency`	Chatbots, real-time UX
`throughput`	Batch processing, high-volume APIs
`cost`	Budget-sensitive workloads
`quality`	Accuracy-critical tasks
`balanced`	No strong preference

Optimize for latency, keep cost under $0.005 per request

Constraints

Hard filters. Variants that violate them are dropped before ranking so you never see a “cheap but too slow” false positive at the top.

Max latency 200ms, min quality 0.9, budget under $500/month

Combine priority + multiple constraints in one prompt. The agent parses commas as “and” and applies every constraint as an AND filter.

Variant types

The optimizer searches these categories automatically when the selected model, modality, runtime, GPU, and plan support them. You can pin a specific type, but usually should let the optimizer rank compatible options.

Type	What it is	Plan	When it wins
4-bit LLM variants	AWQ, GPTQ INT4, W4A16 INT4, and similar runtime-supported paths	Varies by plan and backend	Reducing VRAM while keeping acceptable output quality
FP8	FP8 dynamic, calibrated, or weight-only paths	Core	Compatible GPU/runtime pairs where quality evidence stays within limits
Advanced quantization	INT8, NF4, HQQ, NVFP4, and related methods	Varies by plan, GPU, and backend	Specialized model/runtime combinations
TensorRT-LLM	NVIDIA compiled runtime	Core	High-throughput NVIDIA deployments when the compiled runtime wins in benchmarks

If you have a preference:

Search for a 4-bit variant of this model

Versions and rollback

Each run creates a pipeline version you can compare and roll back to. Rollback is a metadata change; no re-benchmarking needed.

Compare version 1 and version 2
Go back to version 1

Versions are scoped to one pipeline. Rolling back does not affect running deployments until you explicitly redeploy the reverted version.

Compare models, engines, and techniques

Comparisons are measured, not advisory. Ask for a comparison in chat and RunInfra runs real benchmarks before declaring a winner.

Compare Llama 3.1 8B and Qwen 2.5 7B on this pipeline

Mode	What gets measured
Models	2 to 5 models from your pipeline, benchmarked sequentially on the same GPU target with the same serving configuration. Each model gets its own measured baseline.
Serving engines	The same model and the same serving configuration benchmarked across serving backends, for example vLLM vs SGLang.
Techniques	Candidate serving or quantization configurations for one model, benchmarked side by side.

Results render as a comparison card in chat and in the Compare Results panel in the session workspace: a side-by-side table of latency, throughput, VRAM, quality, and cost, with the winner highlighted and the execution that produced the measurements identified. Both surfaces rebuild from persisted results, so a comparison verdict survives page reloads. Comparisons stay honest by construction:

A model never borrows another model’s baseline. Absent data renders as a gap, not a borrowed number.
A candidate that fails to benchmark shows as failed, never as a zero.
A candidate without a recorded measurement shows as not measured, never estimated.
A winner is only declared when the measured candidates ran on the same GPU. When a comparison would be unfair, RunInfra vetoes the winner and shows the reason.
If you cancel mid-comparison, candidates that completed before the cancel keep their measured rows and the run shows as canceled, never as completed.

Comparison runs can also be planned as runbooks: each candidate gets its own lane of phases, and a final verdict phase aggregates the measurements those lanes already recorded without spending additional GPU time.

Testing versions

The Test tab can run the current preview graph, the baseline graph, any saved optimization version, or the live deployed endpoint. Baseline tests ignore applied optimization state so you can compare output and latency against the optimized preview or a selected version without accidentally saving the optimized graph first.

Interruptions, reloads, and cancel

Optimization and runbook executions are durable. The run executes server-side, so refreshing the page, losing your connection, or closing the tab does not stop it.

Reload re-attaches live. Reopening the session mid-execution reconnects to the running execution within about a second over a dedicated event stream: phases, live cost, and the Stop control resume updating. A background snapshot poll covers the rare case where the stream cannot connect.
Interruptions never strand a run. If an execution is interrupted by a timeout, crash, or redeploy, it converges to a blocked, retryable state with resume and restart actions instead of appearing to run forever.
Stop actually stops. Cancel aborts the underlying GPU work and finalizes the run’s billing. A canceled run can never promote an optimization version afterward; work measured before the cancel stays visible as measured results.
Honest connection states. If only your view loses its live feed, the cost chip shows Reconnecting or Unavailable instead of a false $0.00. An Unavailable feed means this view lost its connection, not that the run stopped.

Known limitations

Optimization is best-effort. If no variant satisfies your constraints, the run returns zero winners with a recommendation to relax a constraint (usually cost or latency). Failed or cancelled runs are refunded to your credit balance in full.

Optimization runs are metered to the measured GPU cost of each run, drawn from your unified credit balance (1 credit = $1). A temporary hold is placed when a run starts and the unused amount is refunded once measured. New accounts start with$ 10 in free credits.
Typical run time: 2 to 5 minutes. Vision models and 70B+ targets take longer.
Models above 200 GB on disk are not auto-cached in Instant Start; expect longer cold starts on the winning variant.
Quality evidence depends on the selected modality and available eval path. When a measured quality gate is unavailable, RunInfra marks quality as pending or uncharacterized instead of treating an estimated score as a pass.

Common questions

Why are some variants missing from the results?

A variant is excluded if: a pre-quantized checkpoint does not exist on HF for that (model, quant) pair, the variant violates a hard constraint, or the variant requires a GPU your plan does not unlock (B200/H200 are Enterprise).

How are optimization runs billed?

Each run is metered to its measured GPU cost and drawn from your unified credit balance (1 credit = $1), independent of token use. Token billing applies separately to playground and deployed-endpoint traffic.

Can I re-optimize an existing deployment without downtime?

Yes. Run optimization against the pipeline (the deployment keeps serving old variant), pick a new winner, click Redeploy. RunInfra rolls replicas one at a time with connection draining; requests in flight complete on the old replica, new ones land on the new variant.

Why did the agent skip FP8?

Either the selected GPU/runtime does not support the requested FP8 method, or your constraint set penalized compatible options. The optimizer only offers FP8-family artifacts when the compatibility check passes.

How is quality measured?

Quality can come from a measured gate for the modality, for example accuracy checks against a baseline. If the gate is unavailable or incomplete, RunInfra shows the quality state as pending or uncharacterized. For task-specific evaluation, use your own test set in the playground and compare variants manually.

Does optimization touch my deployed endpoint?

Never. Optimization runs happen on isolated benchmark hardware; your deployment serves the previously-chosen variant the whole time. You decide when and whether to switch.

Can I optimize custom fine-tuned models?

Yes on Core. Upload the model first (Custom models), then run optimization against it like any catalog model.

Next steps

Deploy your pipeline

Turn the winning variant into a live endpoint.

Speculation

Higher decode throughput when a draft model matches your traffic. Gains depend on draft selection and measured acceptance rate, and speculation only ships when it beats the tuned baseline.

GPUs and pricing

Which GPU tiers the optimizer will consider.

Debug bad results

Redirect the optimizer when results don’t fit.

​What a run does

​Deploying the winning variant

​Priorities

​Constraints

​Variant types

​Versions and rollback

​Compare models, engines, and techniques

​Testing versions

​Interruptions, reloads, and cancel

​Known limitations

​Common questions

​Next steps

Deploy your pipeline

Speculation

GPUs and pricing

Debug bad results

What a run does

Deploying the winning variant

Priorities

Constraints

Variant types

Versions and rollback

Compare models, engines, and techniques

Testing versions

Interruptions, reloads, and cancel

Known limitations

Common questions

Next steps