GPU and Pricing

How RunInfra pricing works and which GPUs power your endpoints.

RunInfra charges per million tokens, not per GPU hour. You never see infrastructure costs. The agent picks the best GPU for your model automatically, but you can influence the choice.

Token pricing

Inference is billed per million tokens. Estimated starting rates by model size:

Model size	Input (from)	Output (from)
Small (1-8B)	$0.08 / MTok	$0.20 / MTok
Medium (8-30B)	$0.20 / MTok	$0.80 / MTok
Large (30-70B)	$0.45 / MTok	$1.50 / MTok
XL (70B+)	$0.80 / MTok	$2.50 / MTok

These are estimated starting prices. Your actual cost depends on your full pipeline configuration: model, quantization method, GPU tier, routing strategy, and deployment mode. The deploy tab shows your estimated per-token cost before you deploy.

Team plans get 10% off at 100M+ tokens/month. Enterprise gets up to 40% off.

Available GPUs

RunInfra selects from these GPU tiers during optimization:

GPU	VRAM	Tier	Best for
L4	24 GB	Budget	Quantized 7B-14B models
L40S	48 GB	Mid	Production 14B-32B, great cost/performance
A100	80 GB	High	32B FP16, 70B quantized
H100	80 GB	High	Production 70B, FP8, TensorRT-LLM
H200	141 GB	Premium	70B FP16, large MoE models
B200	180 GB	Premium	70B+ FP16, maximum performance

The agent matches your model to the right GPU based on model size, quantization method, and your performance priority.

How GPU selection affects your cost

Larger GPUs serve tokens faster, which means:

Faster GPU = lower latency per request
More VRAM = can run larger models or higher precision
Higher tier = better throughput (more requests per second)

You pay per token regardless of GPU. The GPU choice affects your latency and throughput, not your per-token rate.

Deployment modes

Flex (Pro+): Scale-to-zero. You only pay for tokens when requests are being processed. When idle, nothing runs, nothing costs.

Active (Team+): Always-on. Zero cold start. Better for high-traffic production endpoints.

Letting the agent choose

The agent picks the optimal GPU during optimization. If you want to influence it:

Use a budget GPU for this, I care about cost

Use an H100 for maximum performance

What GPU do you recommend for a 14B model?

The agent considers your model size, quantization, priority, and constraints when recommending hardware.

How is this guide?