RunInfra/Docs
GuideChangelog
Sign inGet started
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog

Optimization

How RunInfra makes your models faster, cheaper, and smaller.

When you optimize a pipeline, RunInfra profiles your model on real GPUs, searches for optimized variants, and picks the best configuration for your priorities.

You don't configure any of this manually. Just tell the agent what you care about:

Optimize for latency, keep cost under $0.005 per request

What happens during optimization

  1. GPU profiling - Your model runs real inference on actual GPU hardware. RunInfra measures latency, throughput, memory usage, and cost.

  2. Experiment planning - Based on your priority (latency, cost, throughput, quality, or balanced), the agent plans which optimized variants to test.

  3. Optimized model search - The agent searches Hugging Face for pre-quantized and optimized versions of your model (AWQ, GPTQ, FP8 variants). Quantizing from scratch takes hours, so RunInfra finds ready-to-use optimized models and benchmarks them against your baseline.

  4. Kernel optimization - Forge profiles your model's GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup. Available on Pro+.

  5. Ranking - All results are ranked against your constraints. The agent recommends the best one.

Optimized model types

The agent searches for these types of pre-optimized models:

TypeWhat it isPlan
AWQ4-bit models with minimal quality loss. Most common, best default.All
GPTQ3/4/8-bit models with calibration-based optimization. More size options.Pro+
FP88-bit floating point models. Best quality preservation. Fastest on H100/H200.Pro+
TensorRT-LLMNVIDIA compiled engine. 1.4x throughput boost. Not a quantization method, but a serving backend.Team+

You don't need to pick a type. The agent searches for what's available, benchmarks each variant, and shows you the results. If you have a preference:

Search for an AWQ version of this model

Priorities

Tell the agent what matters most:

PriorityBest for
LatencyChatbots, real-time UX
ThroughputBatch processing, high-volume APIs
CostBudget-sensitive workloads
QualityAccuracy-critical tasks
BalancedNo strong preference

Constraints

Set hard limits in natural language:

Max latency 200ms, min quality 0.9, budget under $500/month

Experiments that violate constraints are filtered out.

Versions

Each optimization creates a version. Compare versions and roll back anytime:

Compare version 1 and version 2
Go back to version 1

How is this guide?

PreviousMonitoringNextFAQ

On this page

What happens during optimizationOptimized model typesPrioritiesConstraintsVersions