Skip to main content
Latest customer-facing RunInfra releases and product updates, newest first.
June 10, 2026
OptimizationRunbooksReliabilityUX
Measured comparisons, durable runbook executions with live re-attach, plan-edit honesty, and Code tab validation

Measured comparisons and durable executions

Measured comparisons. The compare models, compare engines, and compare techniques chat modes now run real benchmarks instead of advisory summaries. Compare 2 to 5 models sequentially on the same GPU target with the same serving configuration, each with its own measured baseline, or benchmark the same model and configuration across serving backends. Results render as a side-by-side comparison card in chat and a Compare Results panel in the session workspace, survive reload, and only declare a winner from measured rows on matching hardware. Plans can be comparison-shaped, with one lane per candidate and a final verdict phase that aggregates recorded measurements without extra GPU time. See Compare models, engines, and techniques.Durable runbook executions. Executions now survive interruption. A refresh, timeout, crash, or redeploy converges the run to a blocked, retryable state with resume and restart actions instead of stranding it at running. Reloading mid-execution re-attaches live within about a second over a dedicated execution stream, restoring phases, live cost, and the Stop control, with a snapshot poll as fallback. Cancel stops the underlying GPU work, and a canceled run can never promote an optimization version afterward.Plan-edit honesty. Saving a runbook plan edit now reports whether execution actually changed, for example “Saved. Execution steps unchanged (prose only)” vs “Runbook updated: N phases changed”. Edits that would corrupt the executable graph are rejected with an actionable message at save and at accept. Agent GPU and cost lines are validated against feasibility data when a snapshot exists, and visibly labeled as estimates otherwise.Pipeline node honesty. Guardrail, rate limiter, load balancer, and cache nodes are design placeholders today. They now carry a “Not enforced” badge on the canvas, and the agent says so when it adds one.Code tab updates. When a new optimized version is promoted, the Code tab resyncs its generated files and shows a dismissible “Code updated” notice; files you edited are never overwritten, and the agent is aware of your edited files. A new Validate action checks the current artifact set against the deploy gates with a pass, fail, or incomplete verdict, never green on incomplete checks. Each file keeps a restorable history of its last 10 versions that survives reload.
June 6, 2026
RunbooksReliabilitySandbox
Runbook deployment intent and sandbox replay now fail closed

Runbook execution hardening

Optimization-only runbooks no longer infer a deployment target from missing-target wording. Deploy phases now require explicit deploy intent and only expose target choices that the selected model and serving path can actually execute.Sandbox replay also scopes optimization lane summaries to the active runbook execution, so stale or legacy lane state cannot appear as the current execution result after hydration.
June 5, 2026
TestingModelsReliability
Stricter Hugging Face model access checks before sandbox testing

Model access checks

The pipeline Test tab now rejects malformed Hugging Face model IDs before a sandbox runtime starts. For self-hosted sandbox tests, RunInfra checks whether the resolved Hugging Face repository is reachable and blocks missing, gated, invalid-token, or access-denied models before starting GPU work.Readiness checks now treat valid raw owner/repo Hugging Face IDs as resolvable even when they are not in the static catalog, while proprietary or malformed IDs fail clearly.Runbook plans now filter agent-authored optimization techniques against the executable phase graph, so stale or unsupported technique prose is not shown as part of the plan.
June 2, 2026
OptimizationRunbooksUX
Optimization-first session tabs with smoother Runbook transition during plan review

Session tab flow

New pipeline sessions now open on the Optimization tab first, with Runbook shown when the agent is actively drafting a plan.When a runbook starts streaming, RunInfra moves into Runbook for review. After the runbook is accepted, the session returns to Optimization so live execution metrics and measured results stay in focus.
June 1, 2026
OptimizationTestingUX
Test tab comparison targets for baseline, optimized versions, current preview, and live endpoints

Test tab comparison targets

The pipeline Test tab now lets you choose whether a run should target the current preview graph, the baseline graph, a saved optimization version, or the live endpoint. Baseline runs ignore applied optimization state, while version runs replay the selected optimization snapshot.The batch test suite uses the same target selection, so quality checks and manual probes compare the same runtime target.Cleaner export experience. The Export tab now focuses on one optimized pipeline package instead of raw runtime formats or model-card editing. Self-hosted users get the package download, setup path, and verification flow in one place.
May 30, 2026
APISDKDocsExport
API reference, native SDK docs, export bundles, and integration examples refreshed for deployment-aware gateway usage

API and SDK coverage expansion

RunInfra’s public integration docs now describe the OpenAI-compatible gateway across text, vision-language, embeddings, and deployment-gated image and audio routes.Responses adapter. Added a dedicated /v1/responses reference for the Responses-shaped chat-completions adapter, including streaming, instructions, response_format, and supported tool pass-through fields.Multimodal endpoints. Updated the API reference for /v1/images/generations, /v1/audio/speech, /v1/audio/transcriptions, and /v1/embeddings with current request fields, model examples, billing-aware retry guidance, and error behavior.Portable export bundles. Optimized deployments can now be downloaded as portable packages for local testing or self-hosted deployment, including the files and instructions needed to run the selected model outside RunInfra Cloud.Export readiness labels. Export pages now show whether a package includes benchmark and quality evidence, or whether you should validate the model before relying on it.Native SDK guide. TypeScript and Python examples now cover scoped keys, pipeline IDs, request IDs, typed errors, streaming, model discovery, replay-safe retry limits, and local webhook signature verification. Image, audio, and voice helpers are marked as deployment-gated preview surfaces.Integration examples. OpenAI, LangChain, LlamaIndex, and Vercel AI SDK pages now call out that apps must pass a callable model ID from GET /v1/models instead of an OpenAI model name.Error reference. Expanded the OpenAPI spec and error guide with the public gateway statuses developers should handle, including rate limits, credit exhaustion, idempotency conflicts, replay-unavailable responses, upstream failures, and gateway timeouts.
May 26, 2026
OptimizationUXPlanning
Smarter optimization planning, clearer GPU choices, and smoother runbook streaming

Optimization planning improvements

RunInfra now does a better job turning plain-English goals into executable optimization plans, especially when users care about cost, latency, or serving configuration.Latency clarification. When a prompt mentions latency without a target, RunInfra asks whether to use a specific latency budget or optimize for the cheapest compatible GPU first.Cheapest-compatible GPU search. Cost-first prompts now prioritize GPUs that fit the model and constraints before spending time on higher-cost options.Serving-first runbooks. LLM optimization now tunes serving settings earlier, so batch size, prefix cache, and GPU utilization become the baseline for later quantization and kernel work.Faster hardware confirmation. When intake already identifies a compatible GPU, the hardware phase validates that choice instead of running an unnecessary full sweep.LLM inference detection. Prompts that mention Llama, Qwen, Mistral, DeepSeek, vLLM, SGLang, or TensorRT-LLM now route into the LLM inference workflow automatically.Plan transparency. Runbooks now show open questions, quality gates, estimate assumptions, and total duration ranges directly in the plan UI.Smoother plan reveal. Long runbooks stream into the Plan tab with a steady reading cadence, then switch into the editor after the rendered text catches up.
May 18, 2026
DocsSite
Use cases, research, news, SSE event reference, and deployment-target docs

Documentation expansion

Use cases. Added workflow guides for voice agents, AI assistants, embeddings, RAG search, document AI, and transcription.Research index. Published a research overview with grouped paper links, abstracts, PDFs, and code repositories.News overview. Added newsroom links, RSS and Atom subscription URLs, and structured-data guidance.Deployment targets. Documented managed RunPod, self-hosted Modal, and custom GPU deployment paths.SSE event reference. Added event names, heartbeats, reconnection rules, and stream handling guidance for chat, optimization, and runbook streams.Site polish. Tightened the docs theme, typography, navigation, and house style so examples and reference pages read consistently.
May 10, 2026
PlatformReliabilityPrivacy
Measured metrics, realtime reliability, privacy controls, and dashboard polish

Metrics, reliability, and privacy

Measured optimization metrics. Optimization surfaces now highlight measured latency, throughput, cost, and quality from real runs.Quality checks. Quantization comparisons now use real model-output checks against a baseline, with faster and deeper modes depending on the workflow.Reliable deployment updates. Deployment subscriptions use per-consumer channels so dashboard state stays current when multiple optimization versions are open.Traceable requests. Chat, deploy, inference, and optimization requests now carry request and workspace trace headers so support can correlate issues quickly.Privacy controls. Analytics now respect Do Not Track, avoid client IP collection, clear identity on signout, and scrub secrets from captured URLs.Dashboard polish. Chat bubbles and deployment loading states now match the rest of the dashboard UI.
April 28, 2026
PlatformAPIDeployment
Runtime selection, embeddings API, audio endpoints, and endpoint testing

Runtime and endpoint expansion

Runtime-aware deployments. Pipelines can target vLLM, SGLang, TensorRT-LLM, or vLLM Omni when the selected model and modality support that runtime.Embeddings API. Deployed embedding models can be called through OpenAI-compatible POST /v1/embeddings for RAG, semantic search, clustering, and retrieval workflows.Audio endpoints. Speech-to-text and text-to-speech deployments expose OpenAI-compatible /v1/audio/transcriptions and /v1/audio/speech routes.Instant Start. RunInfra’s weight-caching layer reduces Flex cold starts for compatible deployments.Endpoint playground tests. The Deploy tab playground now targets the selected deployment endpoint so tests match the endpoint row being inspected.Workspace-scoped keys. Workspace keys can reach verified deployments by passing the target model in the request body or by discovering available models with GET /v1/models.
April 5, 2026
Release
RunInfra public launch

Initial release

RunInfra launched with chat-driven pipeline building, optimization, and deployment for production AI inference.Core platform. Describe a use case in plain English and RunInfra helps select models, configure routing, optimize the pipeline, and deploy an API endpoint.Visual pipeline canvas. Compose Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes when a visual workflow is preferred.Optimization engine. Benchmark GPU choices, quantization options, serving settings, and kernel optimizations against latency, throughput, cost, and quality goals.Deployment. Ship endpoints with managed GPU hosting, autoscaling, monitoring, Flex scale-to-zero mode, and Active always-on mode for teams that need zero cold start.OpenAI-compatible APIs. Call deployed pipelines with familiar OpenAI-style routes and SDKs.Developer tools. Test endpoints in the playground, export deployment code, and inspect usage analytics across requests, tokens, cost, and latency.Model support. Start with LLMs, speech-to-text, text-to-speech, and custom Hugging Face models.

Roadmap

RunInfra currently supports LLMs, embeddings, speech-to-text, text-to-speech, vision-language, and image-generation pipelines where the selected model and runtime support them. Upcoming product areas:
  • Managed data connections: vector databases and application databases connected directly to inference pipelines.
  • End-to-end AI infrastructure: ingest data, store embeddings, run inference, and serve results from one platform.
Want early access to an upcoming capability? Contact us and tell us what you’re building.