Optimization
How RunInfra makes your models faster, cheaper, and smaller.
When you optimize a pipeline, RunInfra profiles your model on real GPUs, searches for optimized variants, and picks the best configuration for your priorities.
You don't configure any of this manually. Just tell the agent what you care about:
Optimize for latency, keep cost under $0.005 per requestWhat happens during optimization
-
GPU profiling - Your model runs real inference on actual GPU hardware. RunInfra measures latency, throughput, memory usage, and cost.
-
Experiment planning - Based on your priority (latency, cost, throughput, quality, or balanced), the agent plans which optimized variants to test.
-
Optimized model search - The agent searches Hugging Face for pre-quantized and optimized versions of your model (AWQ, GPTQ, FP8 variants). Quantizing from scratch takes hours, so RunInfra finds ready-to-use optimized models and benchmarks them against your baseline.
-
Kernel optimization - Forge profiles your model's GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup. Available on Pro+.
-
Ranking - All results are ranked against your constraints. The agent recommends the best one.
Optimized model types
The agent searches for these types of pre-optimized models:
| Type | What it is | Plan |
|---|---|---|
| AWQ | 4-bit models with minimal quality loss. Most common, best default. | All |
| GPTQ | 3/4/8-bit models with calibration-based optimization. More size options. | Pro+ |
| FP8 | 8-bit floating point models. Best quality preservation. Fastest on H100/H200. | Pro+ |
| TensorRT-LLM | NVIDIA compiled engine. 1.4x throughput boost. Not a quantization method, but a serving backend. | Team+ |
You don't need to pick a type. The agent searches for what's available, benchmarks each variant, and shows you the results. If you have a preference:
Search for an AWQ version of this modelPriorities
Tell the agent what matters most:
| Priority | Best for |
|---|---|
| Latency | Chatbots, real-time UX |
| Throughput | Batch processing, high-volume APIs |
| Cost | Budget-sensitive workloads |
| Quality | Accuracy-critical tasks |
| Balanced | No strong preference |
Constraints
Set hard limits in natural language:
Max latency 200ms, min quality 0.9, budget under $500/monthExperiments that violate constraints are filtered out.
Versions
Each optimization creates a version. Compare versions and roll back anytime:
Compare version 1 and version 2Go back to version 1How is this guide?