RunInfra/Docs
GuideChangelog
Sign inGet started
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog

GPU and Pricing

How RunInfra pricing works and which GPUs power your endpoints.

RunInfra charges per million tokens, not per GPU hour. You never see infrastructure costs. The agent picks the best GPU for your model automatically, but you can influence the choice.

Token pricing

Inference is billed per million tokens. Estimated starting rates by model size:

Model sizeInput (from)Output (from)
Small (1-8B)$0.08 / MTok$0.20 / MTok
Medium (8-30B)$0.20 / MTok$0.80 / MTok
Large (30-70B)$0.45 / MTok$1.50 / MTok
XL (70B+)$0.80 / MTok$2.50 / MTok

These are estimated starting prices. Your actual cost depends on your full pipeline configuration: model, quantization method, GPU tier, routing strategy, and deployment mode. The deploy tab shows your estimated per-token cost before you deploy.

Team plans get 10% off at 100M+ tokens/month. Enterprise gets up to 40% off.

Available GPUs

RunInfra selects from these GPU tiers during optimization:

GPUVRAMTierBest for
L424 GBBudgetQuantized 7B-14B models
L40S48 GBMidProduction 14B-32B, great cost/performance
A10080 GBHigh32B FP16, 70B quantized
H10080 GBHighProduction 70B, FP8, TensorRT-LLM
H200141 GBPremium70B FP16, large MoE models
B200180 GBPremium70B+ FP16, maximum performance

The agent matches your model to the right GPU based on model size, quantization method, and your performance priority.

How GPU selection affects your cost

Larger GPUs serve tokens faster, which means:

  • Faster GPU = lower latency per request
  • More VRAM = can run larger models or higher precision
  • Higher tier = better throughput (more requests per second)

You pay per token regardless of GPU. The GPU choice affects your latency and throughput, not your per-token rate.

Deployment modes

Flex (Pro+): Scale-to-zero. You only pay for tokens when requests are being processed. When idle, nothing runs, nothing costs.

Active (Team+): Always-on. Zero cold start. Better for high-traffic production endpoints.

Letting the agent choose

The agent picks the optimal GPU during optimization. If you want to influence it:

Use a budget GPU for this, I care about cost
Use an H100 for maximum performance
What GPU do you recommend for a 14B model?

The agent considers your model size, quantization, priority, and constraints when recommending hardware.

How is this guide?

PreviousDeploymentNextModels

On this page

Token pricingAvailable GPUsHow GPU selection affects your costDeployment modesLetting the agent choose