Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

This changelog tracks every RunInfra release. Each entry lists what shipped and when. The roadmap section at the bottom covers features currently in development.
May 18, 2026
DocsSite
Docs polish: use cases, research, news, SSE event reference, deployment targets

Documentation expansion

A round of docs additions to match what shipped on the product:New sections
  • Use cases. Six pre-built workflow pages live under /use-cases: voice-agent, ai-assistant, embeddings, rag-search, document-ai, transcription. Each walks through the architecture, the canonical model stack, an example prompt to paste into Pipes, and a code snippet for the OpenAI-compatible API.
  • Research index at /research/overview with five published arXiv papers grouped into Compute efficiency and Model architectures. Each entry links to arXiv (PDF + abstract) and the code repo on GitHub.
  • News overview at /news/overview pointing at the live newsroom plus RSS / Atom subscription URLs and the structured-data setup AI engines rely on.
  • Deployment targets at /deployments/targets explaining the three places a pipeline can ship: managed RunPod (default), self-hosted Modal, and custom GPU.
  • SSE event reference at /api-reference/sse-events with every server-sent-event the engine emits during chat, optimization, and runbook streams, including heartbeats and reconnection rules.
Deepened pages
  • Instant Start now covers the regional cache architecture, eviction rules, multi-GPU shard staging, and the cold-start time breakdown.
  • Autoscaling explains how replica count is computed from concurrency and queue depth, with a cost-versus-latency knob table.
  • Rate limits documents the leaky-bucket burst behavior and the per-key versus workspace scope.
  • OpenAI compatibility lists the unsupported parameters explicitly, the HTTP-status to OpenAI error.type mapping, and the strict_params=true flag.
  • Plans now spells out the two independent credit pools, optimization credits and inference credits, that paid plans use.
Site polish
  • Mintlify theme tuned to sharp edges (no rounded corners on callouts or sidebar active items), brand-lime primary collapsed to a single accent across light/dark, and Inter Display as the font family.
  • Every page swept for em-dashes, en-dashes, and middle-dot separators, all replaced with commas, hyphens, or rewrites per the house style.
May 10, 2026
PlatformReliabilityPrivacy
Measured-only metrics, realtime reliability, and privacy hardening

Measured numbers, no more guesses

Every number you see in the product is now backed by a real benchmark. If we can’t measure it on real hardware, we don’t show a number at all.Optimization and feasibility
  • Feasibility cards now report fits or doesn’t-fit only. Before, the GPU comparison grid showed estimated latency and dollars-per-request derived from a physics roofline. Those numbers were ±25% accurate at best and looked authoritative. They’re gone. Real latency lands during the runbook on a real GPU.
  • KV cache quality scores now use measured FP16 comparisons in fast and deep modes. Fast-mode used to ship hardcoded heuristics (FP8 = 0.99, INT4 = 0.95). Both modes now run a real per-model inference comparison against FP16 (8 prompts in fast, 20 in deep).
  • Optimization rows drop placeholder metrics on failure paths. When a re-profile on the recommended GPU fails, the row no longer ships the orchestrator’s heuristic latency. You see the method, GPU, and quantization. No fake numbers.
  • Synthetic HuggingFace configs are gone. Gated models (Llama 3.1, Mixtral, Qwen 2.5) now require HF_TOKEN with license access. Hardcoded architecture fallbacks could drift from the actual model and silently break deployments. They’re removed.

Reliability and stability

  • Deployment subscriptions now use per-consumer Supabase channels to avoid callback registration races. Fixes the “cannot add postgres_changes callbacks” error some users hit when opening a pipeline with multiple optimization versions in history.
  • SSE drain failures surface to Sentry. Stream-truncation events that used to silently drop now show up with workspace context so we can act on them.
  • Chat, deploy, infer, and optimize requests now include workspace trace headers. X-User-Id, X-Workspace-Id, X-Plan-Tier, and X-Request-Id thread end-to-end so a single trace can be linked across RunPipe and the engine.

Privacy and observability

  • Do Not Track is respected. PostHog initialization aborts when navigator.doNotTrack === "1" regardless of cookie consent.
  • Client IPs are no longer sent to analytics. PostHog client init now passes ip: false.
  • Signout resets analytics identity. Posthog identity and Sentry scope clear on logout so the next user on the same device gets a clean session.
  • URL secret scrubbing. Sentry now redacts ?api_key=, ?token=, ?secret=, and ?password= query params from captured request URLs.

UX polish

  • User chat bubbles now use the same rounded-corner styling as dashboard tool cards. No more lone sharp panel sitting next to rounded surfaces.
  • Deployments page now has a loading skeleton instead of flashing blank during slow first loads.
April 28, 2026
PlatformAPIDeployment
Runtime selection, embeddings API, and endpoint testing

Runtime and endpoint expansion

RunInfra now exposes more of the serving stack directly in the product, so teams can choose the runtime and endpoint shape that matches their model type.Serving and models
  • Runtime-aware deployments. Pipelines can target vLLM, SGLang, TensorRT-LLM, or vLLM Omni when the selected model category supports that runtime.
  • Embeddings API. Deployed embedding models can be called through the OpenAI-compatible POST /v1/embeddings endpoint for RAG, semantic search, clustering, and retrieval workflows.
  • Voice and audio endpoints. Speech-to-text and text-to-speech deployments expose OpenAI-compatible /v1/audio/transcriptions and /v1/audio/speech endpoints.
Deployment
  • Instant Start. FlashBoot is now Instant Start, RunInfra’s weight-caching layer for faster Flex cold starts.
  • Exact endpoint playground tests. The Deploy tab playground now targets the selected deployment endpoint, so tests match the endpoint row you are inspecting.
Developer experience
  • Workspace-scoped keys. One API key can reach every verified active deployment in a workspace. Pass the target model in the request body or discover available deployments with GET /v1/models.
April 5, 2026
Release
v1.0, Initial release

Initial release

RunInfra is now live. Here is everything that shipped in the first release.Core platform
  • RunInfra is live. Build, optimize, and deploy AI inference pipelines through conversation. Describe what you need in plain English, and the agent handles the rest, model selection, GPU configuration, optimization, and deployment.
  • Chat-driven pipeline builder. No YAML, no DevOps. The AI agent selects models, configures routing, and optimizes your pipeline from a single chat interface.
  • Visual pipeline canvas. Drag-and-drop node composition with Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes for teams that prefer a visual workflow.
  • Session persistence. Conversations, optimization results, and pipeline state survive page reloads.
Optimization engine
  • GPU optimization. Benchmarks models across GPU types (L4, L40S, A100, H100, H200, B200) using real inference. Results show P50/P99 latency, throughput, and cost per request for every experiment.
  • Quantization search. Finds and tests pre-optimized model variants (AWQ, GPTQ, FP8) against your baseline and ranks them by your stated constraints.
  • Forge kernel optimization. Profiles GPU bottlenecks and applies pre-optimized Triton kernels for additional throughput improvements beyond quantization alone.
  • NVIDIA TensorRT-LLM. Compiled inference engine for maximum throughput on NVIDIA GPUs. Available on the Team plan.
  • Optimization dashboard. Compare optimization versions side by side with real metrics: latency (P50, P99), throughput, cost per request, and quality score.
Deployment
  • One-click deploy. Push optimized pipelines to production API endpoints with managed GPU hosting, auto-scaling, and monitoring.
  • Deployment modes. Flex (scale-to-zero, pay only when processing) or Active (always-on with zero cold start, Team plan). Cold starts under 2 seconds with cached model weights.
  • OpenAI-compatible endpoints. Every deployed pipeline works with the OpenAI SDK. Change two lines of code to switch from OpenAI to RunInfra.
  • Per-token pricing. Transparent billing based on model size. See estimated costs before you deploy.
Developer tools
  • API playground. Test your pipeline with real requests before deploying. See response quality, latency, and token usage in real time.
  • Code export. Generate production-ready deployment files: Python scripts, Dockerfiles, Kubernetes manifests, and Docker Compose configurations.
  • Usage analytics. Track requests, tokens, cost, and latency across all endpoints with daily charts and per-model breakdowns on the Observe dashboard.
Model support
  • LLMs. Llama, Qwen, Mistral, DeepSeek, Gemma, Phi, and Cohere models supported out of the box.
  • Speech-to-text. Whisper (all sizes) for automatic speech recognition.
  • Text-to-speech. XTTS and Bark for speech synthesis.
  • Custom models. Upload models from Hugging Face and run them through the full optimization and deployment workflow (Team plan).
Plans
  • Starter. Free. 3 pipelines, 3 optimization sessions/month, 100 playground requests/day.
  • Pro. $49/month. 20 optimization sessions, deployment to live API endpoints, priority email support.
  • Team. $249/seat/month. 100 sessions per seat, TensorRT-LLM, Active deployment mode, shared Slack support. Unused sessions roll over.
  • Enterprise. Custom pricing. Dedicated customer success manager and custom contract terms.
  • Overage sessions. $2.50 each on all paid plans.
Documentation
  • Full documentation published: prompting guide, example conversations, feature docs, and troubleshooting.

Roadmap

RunInfra currently supports LLMs, embeddings, speech-to-text (Whisper), text-to-speech (XTTS, Bark), and vision-language pipelines where the selected model and runtime support them. The following capabilities are in active development:
  • Image generation. Stable Diffusion, FLUX, and other diffusion models with GPU optimization.
  • Database integration: managed vector databases and traditional databases connected directly to inference pipelines.
  • End-to-end AI infrastructure: ingest data, store embeddings, run inference, and serve results from one platform.
Want early access to any of these features? Contact us and tell us what you’re building.