Glossary - RunInfra

One-sentence definitions for the vocabulary that shows up across the docs.

Inference and serving

Pipeline

A RunInfra resource that maps a chat prompt to one or more models plus configuration (caching, routing, guardrails). Deployed pipelines become OpenAI-compatible endpoints. Some node types are design placeholders today; see “Design placeholder node” below.

Design placeholder node

A pipeline canvas node that records design intent but is not enforced at serving time. Guardrail, rate limiter, load balancer, and cache nodes are design placeholders today: they carry a “Not enforced” badge on the canvas, ship in generated code as configuration only, and the agent says so when it adds one. Enforce rate limiting at your own gateway until these nodes go live.

Deployment

A running, callable instance of a pipeline on RunInfra Cloud. Each deployment has a URL, an API key scope, a replica count, and either Flex or Active mode.

Replica

One process on one GPU serving requests for a deployment. Deployments autoscale the number of replicas up to a plan cap.

Flex mode

Scale-to-zero. Replicas shut down after 5 idle minutes; next request pays a cold start. Pay per token only.

Active mode

Always-on. Replicas stay warm 24/7. No cold start. Flat base fee plus per-token.

Cold start

The time from zero replicas to first generated token. Instant Start keeps it under 2 seconds.

Instant Start

RunInfra Cloud’s weight-caching layer. Keeps model weights resident next to GPU hosts so Flex replicas spin up fast. See Instant Start.

Serving backend

The software that runs the model on the GPU. RunInfra uses vLLM, SGLang, TensorRT-LLM, or vLLM Omni depending on the model modality and deployment path.

Optimization

Optimization run

One execution of the RunInfra optimizer on a pipeline. Produces ranked variants. Metered to the measured GPU cost and drawn from your unified credit balance.

Variant

One (model, quantization, GPU, serving backend) combination evaluated during an optimization run. Variants have measured latency, throughput, cost, and quality scores.

Quantization

Compressing a model’s weights to smaller numeric types to fit more model on a GPU and run faster with minimal quality loss.

AWQ

Activation-aware Weight Quantization. 4-bit per weight. Best default for 7 to 70B models; minor quality loss vs FP16.

GPTQ

Calibration-based quantization at 3, 4, or 8 bits. More size options than AWQ, slightly noisier.

FP8

8-bit floating point. Fastest option on H100/H200 hardware. Preserves more quality than 4-bit quantization.

TensorRT-LLM

NVIDIA’s compiled inference engine. Produces a fixed binary per model per GPU. Highest throughput; longest build time.

Forge

RunInfra’s GPU kernel optimization layer. Profiles bottlenecks and swaps in pre-compiled Triton kernels.

Speculative decoding

A small draft model proposes multiple candidate tokens; the target model verifies in one pass. 1.5 to 3x throughput speedup with no quality change. See Speculation.

KV cache

Cached attention keys and values from previous tokens. Reusing the cache across turns avoids re-computing the context.

Agent and prompting

The agent

RunInfra’s chat-driven builder. Takes plain-English descriptions, builds pipelines, and runs optimizations.

Priority

The single optimization dimension the agent ranks against. One of latency, cost, throughput, quality, or balanced.

Constraint

A hard limit the optimizer must respect. Example: max_latency_ms: 200. Variants that violate constraints are filtered out, not ranked down.

Tool calling

Letting the model invoke typed functions. The model decides which tool to call and generates structured arguments; your code executes and returns the result. See the Tool calling cookbook.

Structured output

Constraining the model to return JSON matching a JSON Schema. See the Structured output cookbook.

RAG

Retrieval-Augmented Generation. Embedding a corpus, retrieving relevant chunks for each query, and grounding the generator on retrieved context. See the RAG cookbook.

Billing and plan

Token

The unit billed for inference. Input tokens (prompt) and output tokens (completion) are billed separately at different rates.

MTok

One million tokens. RunInfra lists pricing per MTok.

Credit

The unit of the unified balance: 1 credit =

1. Funds the agent, optimization and benchmarking runs, deploys, and inference. New accounts start with

10 free.

Optimization hold

When an optimization run starts, a temporary hold is placed on your credit balance and settled to the measured GPU cost once the run finishes; the unused amount is refunded. Failed or cancelled runs refund in full.

Playground request

One inference call inside the dashboard’s test playground. Free (trial) workspaces are capped at 100/day; Core and Enterprise are unlimited.

API key

Credential used to call a RunInfra endpoint. Workspace-scoped keys reach every verified deployment in the workspace; pipeline-scoped keys are bound to one pipeline. Manage at Settings > API Keys. See Authentication.

Next steps

Plans

Core and Enterprise pricing, credits, and limits in one table.

Optimization

Where the quantization terms apply.

Deployments overview

Flex, Active, replicas, cold starts.

FAQ

Common questions.

​Inference and serving

​Optimization

​Agent and prompting

​Billing and plan

​Next steps

Plans

Optimization

Deployments overview

FAQ

Inference and serving

Optimization

Agent and prompting

Billing and plan

Next steps