GPU and Pricing
How RunInfra pricing works and which GPUs power your endpoints.
RunInfra charges per million tokens, not per GPU hour. You never see infrastructure costs. The agent picks the best GPU for your model automatically, but you can influence the choice.
Token pricing
Inference is billed per million tokens. Estimated starting rates by model size:
| Model size | Input (from) | Output (from) |
|---|---|---|
| Small (1-8B) | $0.08 / MTok | $0.20 / MTok |
| Medium (8-30B) | $0.20 / MTok | $0.80 / MTok |
| Large (30-70B) | $0.45 / MTok | $1.50 / MTok |
| XL (70B+) | $0.80 / MTok | $2.50 / MTok |
These are estimated starting prices. Your actual cost depends on your full pipeline configuration: model, quantization method, GPU tier, routing strategy, and deployment mode. The deploy tab shows your estimated per-token cost before you deploy.
Team plans get 10% off at 100M+ tokens/month. Enterprise gets up to 40% off.
Available GPUs
RunInfra selects from these GPU tiers during optimization:
| GPU | VRAM | Tier | Best for |
|---|---|---|---|
| L4 | 24 GB | Budget | Quantized 7B-14B models |
| L40S | 48 GB | Mid | Production 14B-32B, great cost/performance |
| A100 | 80 GB | High | 32B FP16, 70B quantized |
| H100 | 80 GB | High | Production 70B, FP8, TensorRT-LLM |
| H200 | 141 GB | Premium | 70B FP16, large MoE models |
| B200 | 180 GB | Premium | 70B+ FP16, maximum performance |
The agent matches your model to the right GPU based on model size, quantization method, and your performance priority.
How GPU selection affects your cost
Larger GPUs serve tokens faster, which means:
- Faster GPU = lower latency per request
- More VRAM = can run larger models or higher precision
- Higher tier = better throughput (more requests per second)
You pay per token regardless of GPU. The GPU choice affects your latency and throughput, not your per-token rate.
Deployment modes
Flex (Pro+): Scale-to-zero. You only pay for tokens when requests are being processed. When idle, nothing runs, nothing costs.
Active (Team+): Always-on. Zero cold start. Better for high-traffic production endpoints.
Letting the agent choose
The agent picks the optimal GPU during optimization. If you want to influence it:
Use a budget GPU for this, I care about costUse an H100 for maximum performanceWhat GPU do you recommend for a 14B model?The agent considers your model size, quantization, priority, and constraints when recommending hardware.
How is this guide?