Optimization

How RunInfra makes your models faster, cheaper, and smaller.

When you optimize a pipeline, RunInfra profiles your model on real GPUs, searches for optimized variants, and picks the best configuration for your priorities.

You don't configure any of this manually. Just tell the agent what you care about:

Optimize for latency, keep cost under $0.005 per request

What happens during optimization

GPU profiling - Your model runs real inference on actual GPU hardware. RunInfra measures latency, throughput, memory usage, and cost.
Experiment planning - Based on your priority (latency, cost, throughput, quality, or balanced), the agent plans which optimized variants to test.
Optimized model search - The agent searches Hugging Face for pre-quantized and optimized versions of your model (AWQ, GPTQ, FP8 variants). Quantizing from scratch takes hours, so RunInfra finds ready-to-use optimized models and benchmarks them against your baseline.
Kernel optimization - Forge profiles your model's GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup. Available on Pro+.
Ranking - All results are ranked against your constraints. The agent recommends the best one.

Optimized model types

The agent searches for these types of pre-optimized models:

Type	What it is	Plan
AWQ	4-bit models with minimal quality loss. Most common, best default.	All
GPTQ	3/4/8-bit models with calibration-based optimization. More size options.	Pro+
FP8	8-bit floating point models. Best quality preservation. Fastest on H100/H200.	Pro+
TensorRT-LLM	NVIDIA compiled engine. 1.4x throughput boost. Not a quantization method, but a serving backend.	Team+

You don't need to pick a type. The agent searches for what's available, benchmarks each variant, and shows you the results. If you have a preference:

Search for an AWQ version of this model

Priorities

Tell the agent what matters most:

Priority	Best for
Latency	Chatbots, real-time UX
Throughput	Batch processing, high-volume APIs
Cost	Budget-sensitive workloads
Quality	Accuracy-critical tasks
Balanced	No strong preference

Constraints

Set hard limits in natural language:

Max latency 200ms, min quality 0.9, budget under $500/month

Experiments that violate constraints are filtered out.

Versions

Each optimization creates a version. Compare versions and roll back anytime:

Compare version 1 and version 2

Go back to version 1

How is this guide?

What happens during optimization

GPU profiling - Your model runs real inference on actual GPU hardware. RunInfra measures latency, throughput, memory usage, and cost.

Experiment planning - Based on your priority (latency, cost, throughput, quality, or balanced), the agent plans which optimized variants to test.

Optimized model search - The agent searches Hugging Face for pre-quantized and optimized versions of your model (AWQ, GPTQ, FP8 variants). Quantizing from scratch takes hours, so RunInfra finds ready-to-use optimized models and benchmarks them against your baseline.

Kernel optimization - Forge profiles your model's GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup. Available on Pro+.

Ranking - All results are ranked against your constraints. The agent recommends the best one.

Optimized model types

The agent searches for these types of pre-optimized models:

Type	What it is	Plan
AWQ	4-bit models with minimal quality loss. Most common, best default.	All
GPTQ	3/4/8-bit models with calibration-based optimization. More size options.	Pro+
FP8	8-bit floating point models. Best quality preservation. Fastest on H100/H200.	Pro+
TensorRT-LLM	NVIDIA compiled engine. 1.4x throughput boost. Not a quantization method, but a serving backend.	Team+

You don't need to pick a type. The agent searches for what's available, benchmarks each variant, and shows you the results. If you have a preference:

Search for an AWQ version of this model

Priority

Best for

Latency

Chatbots, real-time UX

Throughput

Batch processing, high-volume APIs

Cost

Budget-sensitive workloads

Quality

Accuracy-critical tasks

Balanced

No strong preference