GPUs and pricing - RunInfra

RunInfra charges per million tokens, not per GPU hour. You never interact with infrastructure directly, the agent selects the right GPU for your model automatically based on size, quantization method, and your performance priority. This page explains how token pricing works, which GPUs are available, and how to influence hardware selection when you need to.

Cost

Inference cost depends on the full pipeline: model, quantization, GPU tier, deployment mode. The Deploy tab projects your actual per-request cost for the exact configuration before you commit. See Plans for plan-level details and session budgets.

Available GPU tiers

RunInfra selects from these GPU tiers during the optimization process. The agent matches your model to the right GPU based on model size, quantization method, and your performance priority.

GPU	VRAM	Tier	Best for
L4	24 GB	Budget	Quantized 7B-14B models
L40S	48 GB	Mid	Production 14B-32B, great cost/performance ratio
A100	80 GB	High	32B FP16, 70B quantized
H100	80 GB	High	Production 70B, FP8, TensorRT-LLM
H200	141 GB	Premium	70B FP16, large MoE models
B200	180 GB	Premium	70B+ FP16, maximum performance

How GPU selection affects your cost

Larger GPUs serve tokens faster, which changes the economics of your workload in predictable ways:

Faster GPU

Lower latency per request. Better for real-time applications where response time matters.

More VRAM

Supports larger models or higher precision (FP16 instead of quantized). Expands your model options.

Higher tier

Better throughput, more requests per second from a single replica.

You pay per token regardless of which GPU runs your model. GPU selection affects latency and throughput, not your per-token rate.

Deployment modes

The deployment mode you choose affects when you’re billed, not how much you pay per token.

Flex (Core)
Active (Core)

Scale-to-zero. Your endpoint only runs when it’s processing requests. When idle, nothing runs and nothing costs.

Pay per token only when requests are being processed
Cold starts under 2 seconds on RunInfra Cloud
Endpoint scales down after 5 minutes of inactivity
Best for development, variable traffic, and cost-sensitive workloads

Influencing GPU selection

The agent picks the optimal GPU during optimization. If you want to guide the selection, tell it directly:

Use a budget GPU for this, I care about cost

Use an H100 for maximum performance

What GPU do you recommend for a 14B model?

The agent considers your model size, quantization method, priority, and any hard constraints when recommending hardware. If you specify a GPU tier that is underprovisioned for your model, the agent will warn you before proceeding.

Known limitations

FP8 availability depends on the exact GPU architecture, runtime, and FP8 method. RunInfra only offers FP8-family artifacts when the compatibility check passes, and marks SM-bound exports accordingly.
TensorRT-LLM requires a paid Core plan.
Active mode requires a paid Core plan.

Common questions

Why do output tokens cost more than input tokens?

Output tokens cost real GPU-seconds to generate (autoregressive, one forward pass per token). Input tokens are processed in a single batched pass; they still cost compute, but dramatically less per token. Every major provider prices this way.

What's the per-second billing granularity?

Token billing is per-exact-token, not per-second. Active-mode base fees accrue per-second of warm-replica time. No minimums on either.

Can I see the GPU tier the optimizer picked?

Yes. The variant card in the optimization results shows the GPU for each variant. The Deploy tab shows the active deployment’s GPU under Details.

What if no GPU tier fits my constraints?

The optimizer reports “no viable variants” with a recommendation to relax a constraint. Usually raising the cost ceiling or widening the latency ceiling unlocks options. Failed runs do not consume a session.

Do I pay for the playground?

Free (trial) workspaces are capped at 100 playground requests/day. On Core and Enterprise the playground is unlimited and free; we do not charge credits for playground.

Is there a minimum GPU-time billing on Active mode?

Active billing is per-second with a 60-second minimum (so brief starts are rounded up). Once running, every second counts.

Next steps

Plans and sessions

Compare the Core and Enterprise plans.

Supported models

LLMs, speech-to-text, and text-to-speech with pricing by size.

Optimize for cost

Use cost priority and hard budget constraints in optimization.

Monitor cost in real time

Track spend with daily charts, per-model breakdowns, and alerts.

​Cost

​Available GPU tiers

​How GPU selection affects your cost

Faster GPU

More VRAM

Higher tier

​Deployment modes

​Influencing GPU selection

​Known limitations

​Common questions

​Next steps

Plans and sessions

Supported models

Optimize for cost

Monitor cost in real time

Cost

Available GPU tiers

How GPU selection affects your cost

Deployment modes

Influencing GPU selection

Known limitations

Common questions

Next steps