Changelog

New features, improvements, and fixes shipped to RunInfra.

Apr 5, 2026

Initial Release

New

RunInfra is live. Build, optimize, and deploy AI inference pipelines through conversation. Describe what you need in plain English, and the agent handles the rest.
Chat-driven pipeline builder. No YAML, no DevOps. The AI agent selects models, configures routing, and optimizes your pipeline from a single chat.
GPU optimization engine. Benchmarks models across GPU types (L4 through B200) using real inference. Searches for pre-optimized model variants (AWQ, GPTQ, FP8) and tests them against your baseline.
Forge kernel optimization. Profiles GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup.
NVIDIA TensorRT-LLM. Compiled inference engine for maximum throughput on NVIDIA GPUs. Available on Team plan.
One-click deploy. Push optimized pipelines to production API endpoints with managed GPU hosting, auto-scaling, and monitoring.
Deployment modes. Scale-to-zero (pay only when processing) or always-on (zero cold start, Team plan). Cold starts under 2 seconds with cached model weights.
Per-token pricing. Simple, transparent billing based on model size. See estimated costs before you deploy.
Plan tiers. Starter (free), Pro ($99/mo), Team ($249/seat/mo), Enterprise (custom). Each tier unlocks more optimization methods, deployment options, and scaling.
OpenAI-compatible endpoints. Every deployed pipeline works with the OpenAI SDK. Change two lines to switch from OpenAI.
API playground. Test your pipeline with real requests before deploying. See response quality, latency, and token usage.
Optimization dashboard. Real metrics for every experiment: latency (P50, P99), throughput, cost per request, and quality score. Compare optimization versions side by side.
Visual pipeline canvas. Drag-and-drop node composition with Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes.
Code export. Generate production-ready deployment files: Python scripts, Dockerfiles, Kubernetes manifests, and Docker Compose configurations.
Session persistence. Conversations, optimization results, and pipeline state survive page reloads.
Usage analytics. Track requests, tokens, cost, and latency across all endpoints with daily charts and per-model breakdowns.
Full documentation. Prompting guide, example conversations, feature docs, and troubleshooting.

Roadmap

RunInfra is currently focused on large language models. Here's what's coming next:

Vision models. Image classification, object detection, and visual Q&A with the same chat-driven optimization workflow.
Speech models. Speech-to-text and text-to-speech endpoints, optimized and deployed like any LLM pipeline.
Image generation. Stable Diffusion, FLUX, and other diffusion models with GPU optimization.
Embedding models. Optimized embedding endpoints for RAG, semantic search, and retrieval pipelines.
Database integration. Managed vector databases and traditional databases connected directly to your inference pipelines.
End-to-end AI infrastructure. Build complete AI systems: ingest data, store embeddings, run inference, and serve results, all from one platform.

Want early access to any of these? Contact us.

Initial Release

New

RunInfra is live. Build, optimize, and deploy AI inference pipelines through conversation. Describe what you need in plain English, and the agent handles the rest.
Chat-driven pipeline builder. No YAML, no DevOps. The AI agent selects models, configures routing, and optimizes your pipeline from a single chat.
GPU optimization engine. Benchmarks models across GPU types (L4 through B200) using real inference. Searches for pre-optimized model variants (AWQ, GPTQ, FP8) and tests them against your baseline.
Forge kernel optimization. Profiles GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup.
NVIDIA TensorRT-LLM. Compiled inference engine for maximum throughput on NVIDIA GPUs. Available on Team plan.
One-click deploy. Push optimized pipelines to production API endpoints with managed GPU hosting, auto-scaling, and monitoring.
Deployment modes. Scale-to-zero (pay only when processing) or always-on (zero cold start, Team plan). Cold starts under 2 seconds with cached model weights.
Per-token pricing. Simple, transparent billing based on model size. See estimated costs before you deploy.
Plan tiers. Starter (free), Pro ($99/mo), Team ($249/seat/mo), Enterprise (custom). Each tier unlocks more optimization methods, deployment options, and scaling.
OpenAI-compatible endpoints. Every deployed pipeline works with the OpenAI SDK. Change two lines to switch from OpenAI.
API playground. Test your pipeline with real requests before deploying. See response quality, latency, and token usage.
Optimization dashboard. Real metrics for every experiment: latency (P50, P99), throughput, cost per request, and quality score. Compare optimization versions side by side.
Visual pipeline canvas. Drag-and-drop node composition with Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes.
Code export. Generate production-ready deployment files: Python scripts, Dockerfiles, Kubernetes manifests, and Docker Compose configurations.
Session persistence. Conversations, optimization results, and pipeline state survive page reloads.
Usage analytics. Track requests, tokens, cost, and latency across all endpoints with daily charts and per-model breakdowns.
Full documentation. Prompting guide, example conversations, feature docs, and troubleshooting.

Roadmap

RunInfra is currently focused on large language models. Here's what's coming next:

Vision models. Image classification, object detection, and visual Q&A with the same chat-driven optimization workflow.

Speech models. Speech-to-text and text-to-speech endpoints, optimized and deployed like any LLM pipeline.

Image generation. Stable Diffusion, FLUX, and other diffusion models with GPU optimization.

Embedding models. Optimized embedding endpoints for RAG, semantic search, and retrieval pipelines.

Database integration. Managed vector databases and traditional databases connected directly to your inference pipelines.

End-to-end AI infrastructure. Build complete AI systems: ingest data, store embeddings, run inference, and serve results, all from one platform.

Want early access to any of these? Contact us.