Changelog - RunInfra

Latest customer-facing RunInfra releases and product updates, newest first.

July 8, 2026

BillingDocs

Free signup credits doubled to $10 and per-modality optimization runbooks published

Free credits and runbooks

10 in free signup credits.** New accounts now start with

10 in free credits, up from $5, with no credit card required. The raise applies to new signups; existing balances are unchanged.Optimization runbooks published. The per-modality optimization runbooks are now in the docs navigation. Start with the overview for LLM, embedding, ASR, TTS, audio-LM, image, video, and vision-language optimization, plus load testing, SLO monitoring, and customer override guides.

July 5, 2026

DeploymentsAPIExport

Voice and vision managed deployments mint keys on verify, gateway key env rename, tei export kits, and persistent worker sizing

Deployments, keys, and export kits

Managed voice and vision deployments. Managed deployments for voice pipelines, speech-to-text, text-to-speech, audio-language, and vision-language models now verify through a modality-native gateway canary and can mint working API keys immediately after the deployment verifies. Previously only LLM and embedding deployments could. Minted keys pair with the flat base URL https://api.runinfra.ai/v1. Image generation, rerank, and classification managed deployments remain unavailable; rerank and classification are delivered through export kits.Gateway key env name. The dashboard reveal panel and generated snippets now suggest storing the managed gateway key as RUNINFRA_GATEWAY_KEY. The self-hosted export kit keeps RUNINFRA_API_KEY for its own self-minted secret, so the two secrets never share a name. See Authentication.Rerank and classification export kits. These kits now generate a text-embeddings-inference serving setup with an explicit tei export format. The kit’s serve script answers the kit’s own documented endpoints, /rerank for rerank kits and /predict for classification kits.Persistent worker sizing. Worker counts requested at deploy time now persist across restart, start, and GPU changes. Explicit request values always win, then the deployment’s original sizing, then plan defaults. Worker counts are capped at 32 per deployment; plan or operational caps can be lower.

July 3, 2026

BillingAccountSecurity

One honest credit balance, self-serve account deletion, and a hardened sign-in

Billing you can trust at a glance

One balance, everywhere. The navbar, billing page, and session views now read the same live available balance. It updates within seconds of any spend, top-up, upgrade, or auto-recharge, and shows a “$X held” note while a run has an active hold.The full session money story. Before an optimization run starts you see the estimated charge and the hold that will be placed (about 2x the estimate, minimum

5). During the run, one live estimated-charge figure. At the end, a receipt: charged amount and released hold. Failed or cancelled runs release the hold in full. Agent chat messages hold

1.00 and settle to actual usage, typically under $0.20.Free tier sees its credits. The $5 signup credits and every transaction against them are now visible on Settings > Cost from day one, and the live balance shows in the top-bar credits chip on every page.Real charts. Cost and usage graphs plot the real measured values instead of smoothed curves.

Account

Self-serve account deletion. Settings > Workspace now has a danger zone. The deletion dialog previews exactly what will be removed, lists any blockers (live deployments, running jobs) with inline actions to clear them, cancels your subscription immediately, and requires a typed confirmation plus your password. Deletion is immediate and permanent.

Security

Sign-in hardening. Passwords now require 8 characters minimum. Email confirmation and password-reset links work when opened in a different browser, and expired links say so with a resend path. Signup credit grants are protected against abuse behind the scenes, with no change for real users.

July 2, 2026

SessionUX

Session tabs become an open working set with a searchable view catalog

Session tabs, redesigned

Main tabs by default. A session now opens with the two main surfaces, Optimization and Plan. Every other view (Pipeline, Test, Deploy, Workspace, Code, Config, Export) opens on demand and joins the strip as a real tab.Open, close, and come back. Tabs are a working set now: hover a tab to reveal its close control, close the ones you are not using, and the strip always keeps at least one tab. Closing the active tab activates its neighbor, and your open tabs persist per session on your device.The plus button opens a searchable catalog. The + at the end of the strip opens a full-panel view catalog: jump to an existing tab, pick a suggested view derived from live session state (a session without models suggests the canvas, a linked runbook suggests Plan), or open any other view. Type to filter, use arrow keys and Enter to select, press Escape to close. Views the agent needs, like Plan when a runbook starts streaming, still open automatically.

June 17, 2026

BillingCostUX

Billing overview and daily cost explorer updates

Billing and cost visibility

Billing overview. Settings > Billing now starts with a compact overview of workspace balance, plan, monthly amount, renewal, and deployment access before the detailed plan and invoice sections.Daily cost explorer. Settings > Cost now keeps spend charts on daily UTC ticks and adds a daily cost table with spend, requests, average request cost, and GPU cost from the billing data already returned for the page.

June 10, 2026

OptimizationRunbooksReliabilityUX

Measured comparisons, durable runbook executions with live re-attach, plan-edit honesty, and Code tab validation

Measured comparisons and durable executions

Measured comparisons. The compare models, compare engines, and compare techniques chat modes now run real benchmarks instead of advisory summaries. Compare 2 to 5 models sequentially on the same GPU target with the same serving configuration, each with its own measured baseline, or benchmark the same model and configuration across serving backends. Results render as a side-by-side comparison card in chat and a Compare Results panel in the session workspace, survive reload, and only declare a winner from measured rows on matching hardware. Plans can be comparison-shaped, with one lane per candidate and a final verdict phase that aggregates recorded measurements without extra GPU time. See Compare models, engines, and techniques.Durable runbook executions. Executions now survive interruption. A refresh, timeout, crash, or redeploy converges the run to a blocked, retryable state with resume and restart actions instead of stranding it at running. Reloading mid-execution re-attaches live within about a second over a dedicated execution stream, restoring phases, live cost, and the Stop control, with a snapshot poll as fallback. Cancel stops the underlying GPU work, and a canceled run can never promote an optimization version afterward.Plan-edit honesty. Saving a runbook plan edit now reports whether execution actually changed, for example “Saved. Execution steps unchanged (prose only)” vs “Runbook updated: N phases changed”. Edits that would corrupt the executable graph are rejected with an actionable message at save and at accept. Agent GPU and cost lines are validated against feasibility data when a snapshot exists, and visibly labeled as estimates otherwise.Pipeline node honesty. Guardrail, rate limiter, load balancer, and cache nodes are design placeholders today. They now carry a “Not enforced” badge on the canvas, and the agent says so when it adds one.Code tab updates. When a new optimized version is promoted, the Code tab resyncs its generated files and shows a dismissible “Code updated” notice; files you edited are never overwritten, and the agent is aware of your edited files. A new Validate action checks the current artifact set against the deploy gates with a pass, fail, or incomplete verdict, never green on incomplete checks. Each file keeps a restorable history of its last 10 versions that survives reload.

June 6, 2026

RunbooksReliabilitySandbox

Runbook deployment intent and sandbox replay now fail closed

Runbook execution hardening

Optimization-only runbooks no longer infer a deployment target from missing-target wording. Deploy phases now require explicit deploy intent and only expose target choices that the selected model and serving path can actually execute.Sandbox replay also scopes optimization lane summaries to the active runbook execution, so stale or legacy lane state cannot appear as the current execution result after hydration.

June 5, 2026

TestingModelsReliability

Stricter Hugging Face model access checks before sandbox testing

Model access checks

The pipeline Test tab now rejects malformed Hugging Face model IDs before a sandbox runtime starts. For self-hosted sandbox tests, RunInfra checks whether the resolved Hugging Face repository is reachable and blocks missing, gated, invalid-token, or access-denied models before starting GPU work.Readiness checks now treat valid raw owner/repo Hugging Face IDs as resolvable even when they are not in the static catalog, while proprietary or malformed IDs fail clearly.Runbook plans now filter agent-authored optimization techniques against the executable phase graph, so stale or unsupported technique prose is not shown as part of the plan.

June 2, 2026

OptimizationRunbooksUX

Optimization-first session tabs with smoother Runbook transition during plan review

Session tab flow

New pipeline sessions now open on the Optimization tab first, with Runbook shown when the agent is actively drafting a plan.When a runbook starts streaming, RunInfra moves into Runbook for review. After the runbook is accepted, the session returns to Optimization so live execution metrics and measured results stay in focus.

June 1, 2026

OptimizationTestingUX

Test tab comparison targets for baseline, optimized versions, current preview, and live endpoints

Test tab comparison targets

The pipeline Test tab now lets you choose whether a run should target the current preview graph, the baseline graph, a saved optimization version, or the live endpoint. Baseline runs ignore applied optimization state, while version runs replay the selected optimization snapshot.The batch test suite uses the same target selection, so quality checks and manual probes compare the same runtime target.Cleaner export experience. The Export tab now focuses on one optimized pipeline package instead of raw runtime formats or model-card editing. Self-hosted users get the package download, setup path, and verification flow in one place.

May 30, 2026

APISDKDocsExport

API reference, native SDK docs, export bundles, and integration examples refreshed for deployment-aware gateway usage

API and SDK coverage expansion

RunInfra’s public integration docs now describe the OpenAI-compatible gateway across text, vision-language, embeddings, and deployment-gated image and audio routes.Responses adapter. Added a dedicated /v1/responses reference for the Responses-shaped chat-completions adapter, including streaming, instructions, response_format, and supported tool pass-through fields.Multimodal endpoints. Updated the API reference for /v1/images/generations, /v1/audio/speech, /v1/audio/transcriptions, and /v1/embeddings with current request fields, model examples, billing-aware retry guidance, and error behavior.Portable export bundles. Optimized deployments can now be downloaded as portable packages for local testing or self-hosted deployment, including the files and instructions needed to run the selected model outside RunInfra Cloud.Export readiness labels. Export pages now show whether a package includes benchmark and quality evidence, or whether you should validate the model before relying on it.Native SDK guide. TypeScript and Python examples now cover scoped keys, pipeline IDs, request IDs, typed errors, streaming, model discovery, replay-safe retry limits, and local webhook signature verification. Image, audio, and voice helpers are marked as deployment-gated preview surfaces.Integration examples. OpenAI, LangChain, LlamaIndex, and Vercel AI SDK pages now call out that apps must pass a callable model ID from GET /v1/models instead of an OpenAI model name.Error reference. Expanded the OpenAPI spec and error guide with the public gateway statuses developers should handle, including rate limits, credit exhaustion, idempotency conflicts, replay-unavailable responses, upstream failures, and gateway timeouts.

May 26, 2026

OptimizationUXPlanning

Smarter optimization planning, clearer GPU choices, and smoother runbook streaming

Optimization planning improvements

RunInfra now does a better job turning plain-English goals into executable optimization plans, especially when users care about cost, latency, or serving configuration.Latency clarification. When a prompt mentions latency without a target, RunInfra asks whether to use a specific latency budget or optimize for the cheapest compatible GPU first.Cheapest-compatible GPU search. Cost-first prompts now prioritize GPUs that fit the model and constraints before spending time on higher-cost options.Serving-first runbooks. LLM optimization now tunes serving settings earlier, so batch size, prefix cache, and GPU utilization become the baseline for later quantization and kernel work.Faster hardware confirmation. When intake already identifies a compatible GPU, the hardware phase validates that choice instead of running an unnecessary full sweep.LLM inference detection. Prompts that mention Llama, Qwen, Mistral, DeepSeek, vLLM, SGLang, or TensorRT-LLM now route into the LLM inference workflow automatically.Plan transparency. Runbooks now show open questions, quality gates, estimate assumptions, and total duration ranges directly in the plan UI.Smoother plan reveal. Long runbooks stream into the Plan tab with a steady reading cadence, then switch into the editor after the rendered text catches up.

May 18, 2026

DocsSite

Use cases, research, news, SSE event reference, and deployment-target docs

Documentation expansion

Use cases. Added workflow guides for voice agents, AI assistants, embeddings, RAG search, document AI, and transcription.Research index. Published a research overview with grouped paper links, abstracts, PDFs, and code repositories.News overview. Added newsroom links, RSS and Atom subscription URLs, and structured-data guidance.Deployment targets. Documented managed RunPod, self-hosted Modal, and custom GPU deployment paths.SSE event reference. Added event names, heartbeats, reconnection rules, and stream handling guidance for chat, optimization, and runbook streams.Site polish. Tightened the docs theme, typography, navigation, and house style so examples and reference pages read consistently.

May 10, 2026

PlatformReliabilityPrivacy

Measured metrics, realtime reliability, privacy controls, and dashboard polish

Metrics, reliability, and privacy

Measured optimization metrics. Optimization surfaces now highlight measured latency, throughput, cost, and quality from real runs.Quality checks. Quantization comparisons now use real model-output checks against a baseline, with faster and deeper modes depending on the workflow.Reliable deployment updates. Deployment subscriptions use per-consumer channels so dashboard state stays current when multiple optimization versions are open.Traceable requests. Chat, deploy, inference, and optimization requests now carry request and workspace trace headers so support can correlate issues quickly.Privacy controls. Analytics now respect Do Not Track, avoid client IP collection, clear identity on signout, and scrub secrets from captured URLs.Dashboard polish. Chat bubbles and deployment loading states now match the rest of the dashboard UI.

April 28, 2026

PlatformAPIDeployment

Runtime selection, embeddings API, audio endpoints, and endpoint testing

Runtime and endpoint expansion

Runtime-aware deployments. Pipelines can target vLLM, SGLang, TensorRT-LLM, or vLLM Omni when the selected model and modality support that runtime.Embeddings API. Deployed embedding models can be called through OpenAI-compatible POST /v1/embeddings for RAG, semantic search, clustering, and retrieval workflows.Audio endpoints. Speech-to-text and text-to-speech deployments expose OpenAI-compatible /v1/audio/transcriptions and /v1/audio/speech routes.Instant Start. RunInfra’s weight-caching layer reduces Flex cold starts for compatible deployments.Endpoint playground tests. The Deploy tab playground now targets the selected deployment endpoint so tests match the endpoint row being inspected.Workspace-scoped keys. Workspace keys can reach verified deployments by passing the target model in the request body or by discovering available models with GET /v1/models.

April 5, 2026

Release

RunInfra public launch

Initial release

RunInfra launched with chat-driven pipeline building, optimization, and deployment for production AI inference.Core platform. Describe a use case in plain English and RunInfra helps select models, configure routing, optimize the pipeline, and deploy an API endpoint.Visual pipeline canvas. Compose Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes when a visual workflow is preferred.Optimization engine. Benchmark GPU choices, quantization options, serving settings, and kernel optimizations against latency, throughput, cost, and quality goals.Deployment. Ship endpoints with managed GPU hosting, autoscaling, monitoring, Flex scale-to-zero mode, and Active always-on mode for teams that need zero cold start.OpenAI-compatible APIs. Call deployed pipelines with familiar OpenAI-style routes and SDKs.Developer tools. Test endpoints in the playground, export deployment code, and inspect usage analytics across requests, tokens, cost, and latency.Model support. Start with LLMs, speech-to-text, text-to-speech, and custom Hugging Face models.

Roadmap

RunInfra currently supports LLMs, embeddings, speech-to-text, text-to-speech, vision-language, and image-generation pipelines where the selected model and runtime support them. Upcoming product areas:

Managed data connections: vector databases and application databases connected directly to inference pipelines.
End-to-end AI infrastructure: ingest data, store embeddings, run inference, and serve results from one platform.

Want early access to an upcoming capability? Contact us and tell us what you’re building.

​Free credits and runbooks

​Deployments, keys, and export kits

​Billing you can trust at a glance

​Account

​Security

​Session tabs, redesigned

​Billing and cost visibility

​Measured comparisons and durable executions

​Runbook execution hardening

​Model access checks

​Session tab flow

​Test tab comparison targets

​API and SDK coverage expansion

​Optimization planning improvements

​Documentation expansion

​Metrics, reliability, and privacy

​Runtime and endpoint expansion

​Initial release

​Roadmap

Free credits and runbooks

Deployments, keys, and export kits

Billing you can trust at a glance

Account

Security

Session tabs, redesigned

Billing and cost visibility

Measured comparisons and durable executions

Runbook execution hardening

Model access checks

Session tab flow

Test tab comparison targets

API and SDK coverage expansion

Optimization planning improvements

Documentation expansion

Metrics, reliability, and privacy

Runtime and endpoint expansion

Initial release

Roadmap