Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra charges per million tokens, not per GPU hour. You never interact with infrastructure directly, the agent selects the right GPU for your model automatically based on size, quantization method, and your performance priority. This page explains how token pricing works, which GPUs are available, and how to influence hardware selection when you need to.

Cost

Inference cost depends on the full pipeline: model, quantization, GPU tier, deployment mode. The Deploy tab projects your actual per-request cost for the exact configuration before you commit. See Plans for plan-level details and session budgets.

Available GPU tiers

RunInfra selects from these GPU tiers during the optimization process. The agent matches your model to the right GPU based on model size, quantization method, and your performance priority.
GPUVRAMTierBest for
L424 GBBudgetQuantized 7B-14B models
L40S48 GBMidProduction 14B-32B, great cost/performance ratio
A10080 GBHigh32B FP16, 70B quantized
H10080 GBHighProduction 70B, FP8, TensorRT-LLM
H200141 GBPremium70B FP16, large MoE models
B200180 GBPremium70B+ FP16, maximum performance

How GPU selection affects your cost

Larger GPUs serve tokens faster, which changes the economics of your workload in predictable ways:

Faster GPU

Lower latency per request. Better for real-time applications where response time matters.

More VRAM

Supports larger models or higher precision (FP16 instead of quantized). Expands your model options.

Higher tier

Better throughput, more requests per second from a single replica.
You pay per token regardless of which GPU runs your model. GPU selection affects latency and throughput, not your per-token rate.

Deployment modes

The deployment mode you choose affects when you’re billed, not how much you pay per token.
Scale-to-zero. Your endpoint only runs when it’s processing requests. When idle, nothing runs and nothing costs.
  • Pay per token only when requests are being processed
  • Cold starts under 2 seconds on RunInfra Cloud
  • Endpoint scales down after 5 minutes of inactivity
  • Best for development, variable traffic, and cost-sensitive workloads

Influencing GPU selection

The agent picks the optimal GPU during optimization. If you want to guide the selection, tell it directly:
Use a budget GPU for this, I care about cost
Use an H100 for maximum performance
What GPU do you recommend for a 14B model?
The agent considers your model size, quantization method, priority, and any hard constraints when recommending hardware. If you specify a GPU tier that is underprovisioned for your model, the agent will warn you before proceeding.

Known limitations

  • FP8 requires H100 or newer. L4, L40S, and A100 do not support FP8; the optimizer falls back to AWQ or GPTQ on those tiers.
  • TensorRT-LLM is Team plan or higher.
  • Active mode is Team plan or higher.

Common questions

Output tokens cost real GPU-seconds to generate (autoregressive, one forward pass per token). Input tokens are processed in a single batched pass; they still cost compute, but dramatically less per token. Every major provider prices this way.
Token billing is per-exact-token, not per-second. Active-mode base fees accrue per-second of warm-replica time. No minimums on either.
Yes. The variant card in the optimization results shows the GPU for each variant. The Deploy tab shows the active deployment’s GPU under Details.
The optimizer reports “no viable variants” with a recommendation to relax a constraint. Usually raising the cost ceiling or widening the latency ceiling unlocks options. Failed runs do not consume a session.
Playground requests on Starter count against your 100/day cap. On Pro+ they are unlimited and free; we do not charge tokens for playground.
Active billing is per-second with a 60-second minimum (so brief starts are rounded up). Once running, every second counts.

Next steps

Plans and sessions

Compare Starter, Pro, Team, and Enterprise plans.

Supported models

LLMs, speech-to-text, and text-to-speech with pricing by size.

Optimize for cost

Use cost priority and hard budget constraints in optimization.

Monitor cost in real time

Track spend with daily charts, per-model breakdowns, and alerts.