One-sentence definitions for the vocabulary that shows up across the docs.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Inference and serving
Pipeline
Pipeline
A RunInfra resource that maps a chat prompt to one or more models plus configuration (caching, routing, guardrails). Deployed pipelines become OpenAI-compatible endpoints.
Deployment
Deployment
A running, callable instance of a pipeline on RunInfra Cloud. Each deployment has a URL, an API key scope, a replica count, and either Flex or Active mode.
Replica
Replica
One process on one GPU serving requests for a deployment. Deployments autoscale the number of replicas up to a plan cap.
Flex mode
Flex mode
Scale-to-zero. Replicas shut down after 5 idle minutes; next request pays a cold start. Pay per token only.
Active mode
Active mode
Always-on. Replicas stay warm 24/7. No cold start. Flat base fee plus per-token.
Cold start
Cold start
The time from zero replicas to first generated token. Instant Start keeps it under 2 seconds.
Instant Start
Instant Start
RunInfra Cloud’s weight-caching layer. Keeps model weights resident next to GPU hosts so Flex replicas spin up fast. See Instant Start.
Serving backend
Serving backend
The software that runs the model on the GPU. RunInfra uses vLLM, SGLang, TensorRT-LLM, or vLLM Omni depending on the model modality and deployment path.
Optimization
Optimization run
Optimization run
One execution of the RunInfra optimizer on a pipeline. Produces ranked variants. Counts as one “optimization session” against your plan’s monthly budget.
Variant
Variant
One (model, quantization, GPU, serving backend) combination evaluated during an optimization run. Variants have measured latency, throughput, cost, and quality scores.
Quantization
Quantization
Compressing a model’s weights to smaller numeric types to fit more model on a GPU and run faster with minimal quality loss.
AWQ
AWQ
Activation-aware Weight Quantization. 4-bit per weight. Best default for 7 to 70B models; minor quality loss vs FP16.
GPTQ
GPTQ
Calibration-based quantization at 3, 4, or 8 bits. More size options than AWQ, slightly noisier.
FP8
FP8
8-bit floating point. Fastest option on H100/H200 hardware. Preserves more quality than 4-bit quantization.
TensorRT-LLM
TensorRT-LLM
NVIDIA’s compiled inference engine. Produces a fixed binary per model per GPU. Highest throughput; longest build time. Team plan.
Forge
Forge
RunInfra’s GPU kernel optimization layer. Profiles bottlenecks and swaps in pre-compiled Triton kernels.
Speculative decoding
Speculative decoding
A small draft model proposes multiple candidate tokens; the target model verifies in one pass. 1.5 to 3x throughput speedup with no quality change. See Speculation.
KV cache
KV cache
Cached attention keys and values from previous tokens. Reusing the cache across turns avoids re-computing the context.
Agent and prompting
The agent
The agent
RunInfra’s chat-driven builder. Takes plain-English descriptions, builds pipelines, and runs optimizations.
Priority
Priority
The single optimization dimension the agent ranks against. One of latency, cost, throughput, quality, or balanced.
Constraint
Constraint
A hard limit the optimizer must respect. Example:
max_latency_ms: 200. Variants that violate constraints are filtered out, not ranked down.Tool calling
Tool calling
Letting the model invoke typed functions. The model decides which tool to call and generates structured arguments; your code executes and returns the result. See the Tool calling cookbook.
Structured output
Structured output
Constraining the model to return JSON matching a JSON Schema. See the Structured output cookbook.
RAG
RAG
Retrieval-Augmented Generation. Embedding a corpus, retrieving relevant chunks for each query, and grounding the generator on retrieved context. See the RAG cookbook.
Billing and plan
Token
Token
The unit billed for inference. Input tokens (prompt) and output tokens (completion) are billed separately at different rates.
MTok
MTok
One million tokens. RunInfra lists pricing per MTok.
Optimization session
Optimization session
One optimization run on one pipeline. Counts against your plan’s monthly session budget.
Overage session
Overage session
An optimization session beyond your monthly budget. Billed at $2.50 each from your credit balance on Pro and Team.
Playground request
Playground request
One inference call inside the dashboard’s test playground. Starter is capped at 100/day; Pro+ is unlimited.
API key
API key
Credential used to call a RunInfra endpoint. Workspace-scoped keys reach every verified deployment in the workspace; pipeline-scoped keys are bound to one pipeline. Manage at Settings > API Keys. See Authentication.
Next steps
Plans
Pricing, sessions, and limits in one table.
Optimization
Where the quantization terms apply.
Deployments overview
Flex, Active, replicas, cold starts.
FAQ
Common questions.