RunInfra charges per million tokens, not per GPU hour. You never interact with infrastructure directly, the agent selects the right GPU for your model automatically based on size, quantization method, and your performance priority. This page explains how token pricing works, which GPUs are available, and how to influence hardware selection when you need to.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Cost
Inference cost depends on the full pipeline: model, quantization, GPU tier, deployment mode. The Deploy tab projects your actual per-request cost for the exact configuration before you commit. See Plans for plan-level details and session budgets.Available GPU tiers
RunInfra selects from these GPU tiers during the optimization process. The agent matches your model to the right GPU based on model size, quantization method, and your performance priority.| GPU | VRAM | Tier | Best for |
|---|---|---|---|
| L4 | 24 GB | Budget | Quantized 7B-14B models |
| L40S | 48 GB | Mid | Production 14B-32B, great cost/performance ratio |
| A100 | 80 GB | High | 32B FP16, 70B quantized |
| H100 | 80 GB | High | Production 70B, FP8, TensorRT-LLM |
| H200 | 141 GB | Premium | 70B FP16, large MoE models |
| B200 | 180 GB | Premium | 70B+ FP16, maximum performance |
How GPU selection affects your cost
Larger GPUs serve tokens faster, which changes the economics of your workload in predictable ways:Faster GPU
Lower latency per request. Better for real-time applications where response time matters.
More VRAM
Supports larger models or higher precision (FP16 instead of quantized). Expands your model options.
Higher tier
Better throughput, more requests per second from a single replica.
Deployment modes
The deployment mode you choose affects when you’re billed, not how much you pay per token.- Flex (Pro+)
- Active (Team+)
Scale-to-zero. Your endpoint only runs when it’s processing requests. When idle, nothing runs and nothing costs.
- Pay per token only when requests are being processed
- Cold starts under 2 seconds on RunInfra Cloud
- Endpoint scales down after 5 minutes of inactivity
- Best for development, variable traffic, and cost-sensitive workloads
Influencing GPU selection
The agent picks the optimal GPU during optimization. If you want to guide the selection, tell it directly:The agent considers your model size, quantization method, priority, and any hard constraints when recommending hardware. If you specify a GPU tier that is underprovisioned for your model, the agent will warn you before proceeding.
Known limitations
- FP8 requires H100 or newer. L4, L40S, and A100 do not support FP8; the optimizer falls back to AWQ or GPTQ on those tiers.
- TensorRT-LLM is Team plan or higher.
- Active mode is Team plan or higher.
Common questions
Why do output tokens cost more than input tokens?
Why do output tokens cost more than input tokens?
Output tokens cost real GPU-seconds to generate (autoregressive, one forward pass per token). Input tokens are processed in a single batched pass; they still cost compute, but dramatically less per token. Every major provider prices this way.
What's the per-second billing granularity?
What's the per-second billing granularity?
Token billing is per-exact-token, not per-second. Active-mode base fees accrue per-second of warm-replica time. No minimums on either.
Can I see the GPU tier the optimizer picked?
Can I see the GPU tier the optimizer picked?
Yes. The variant card in the optimization results shows the GPU for each variant. The Deploy tab shows the active deployment’s GPU under Details.
What if no GPU tier fits my constraints?
What if no GPU tier fits my constraints?
The optimizer reports “no viable variants” with a recommendation to relax a constraint. Usually raising the cost ceiling or widening the latency ceiling unlocks options. Failed runs do not consume a session.
Do I pay for the playground?
Do I pay for the playground?
Playground requests on Starter count against your 100/day cap. On Pro+ they are unlimited and free; we do not charge tokens for playground.
Is there a minimum GPU-time billing on Active mode?
Is there a minimum GPU-time billing on Active mode?
Active billing is per-second with a 60-second minimum (so brief starts are rounded up). Once running, every second counts.
Next steps
Plans and sessions
Compare Starter, Pro, Team, and Enterprise plans.
Supported models
LLMs, speech-to-text, and text-to-speech with pricing by size.
Optimize for cost
Use cost priority and hard budget constraints in optimization.
Monitor cost in real time
Track spend with daily charts, per-model breakdowns, and alerts.