Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

One-sentence definitions for the vocabulary that shows up across the docs.

Inference and serving

A RunInfra resource that maps a chat prompt to one or more models plus configuration (caching, routing, guardrails). Deployed pipelines become OpenAI-compatible endpoints.
A running, callable instance of a pipeline on RunInfra Cloud. Each deployment has a URL, an API key scope, a replica count, and either Flex or Active mode.
One process on one GPU serving requests for a deployment. Deployments autoscale the number of replicas up to a plan cap.
Scale-to-zero. Replicas shut down after 5 idle minutes; next request pays a cold start. Pay per token only.
Always-on. Replicas stay warm 24/7. No cold start. Flat base fee plus per-token.
The time from zero replicas to first generated token. Instant Start keeps it under 2 seconds.
RunInfra Cloud’s weight-caching layer. Keeps model weights resident next to GPU hosts so Flex replicas spin up fast. See Instant Start.
The software that runs the model on the GPU. RunInfra uses vLLM, SGLang, TensorRT-LLM, or vLLM Omni depending on the model modality and deployment path.

Optimization

One execution of the RunInfra optimizer on a pipeline. Produces ranked variants. Counts as one “optimization session” against your plan’s monthly budget.
One (model, quantization, GPU, serving backend) combination evaluated during an optimization run. Variants have measured latency, throughput, cost, and quality scores.
Compressing a model’s weights to smaller numeric types to fit more model on a GPU and run faster with minimal quality loss.
Activation-aware Weight Quantization. 4-bit per weight. Best default for 7 to 70B models; minor quality loss vs FP16.
Calibration-based quantization at 3, 4, or 8 bits. More size options than AWQ, slightly noisier.
8-bit floating point. Fastest option on H100/H200 hardware. Preserves more quality than 4-bit quantization.
NVIDIA’s compiled inference engine. Produces a fixed binary per model per GPU. Highest throughput; longest build time. Team plan.
RunInfra’s GPU kernel optimization layer. Profiles bottlenecks and swaps in pre-compiled Triton kernels.
A small draft model proposes multiple candidate tokens; the target model verifies in one pass. 1.5 to 3x throughput speedup with no quality change. See Speculation.
Cached attention keys and values from previous tokens. Reusing the cache across turns avoids re-computing the context.

Agent and prompting

RunInfra’s chat-driven builder. Takes plain-English descriptions, builds pipelines, and runs optimizations.
The single optimization dimension the agent ranks against. One of latency, cost, throughput, quality, or balanced.
A hard limit the optimizer must respect. Example: max_latency_ms: 200. Variants that violate constraints are filtered out, not ranked down.
Letting the model invoke typed functions. The model decides which tool to call and generates structured arguments; your code executes and returns the result. See the Tool calling cookbook.
Constraining the model to return JSON matching a JSON Schema. See the Structured output cookbook.
Retrieval-Augmented Generation. Embedding a corpus, retrieving relevant chunks for each query, and grounding the generator on retrieved context. See the RAG cookbook.

Billing and plan

The unit billed for inference. Input tokens (prompt) and output tokens (completion) are billed separately at different rates.
One million tokens. RunInfra lists pricing per MTok.
One optimization run on one pipeline. Counts against your plan’s monthly session budget.
An optimization session beyond your monthly budget. Billed at $2.50 each from your credit balance on Pro and Team.
One inference call inside the dashboard’s test playground. Starter is capped at 100/day; Pro+ is unlimited.
Credential used to call a RunInfra endpoint. Workspace-scoped keys reach every verified deployment in the workspace; pipeline-scoped keys are bound to one pipeline. Manage at Settings > API Keys. See Authentication.

Next steps

Plans

Pricing, sessions, and limits in one table.

Optimization

Where the quantization terms apply.

Deployments overview

Flex, Active, replicas, cold starts.

FAQ

Common questions.