RunInfra/Docs
GuideChangelog
Sign inGet started
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog

Changelog

New features, improvements, and fixes shipped to RunInfra.

Apr 5, 2026

Initial Release

New
  • RunInfra is live. Build, optimize, and deploy AI inference pipelines through conversation. Describe what you need in plain English, and the agent handles the rest.
  • Chat-driven pipeline builder. No YAML, no DevOps. The AI agent selects models, configures routing, and optimizes your pipeline from a single chat.
  • GPU optimization engine. Benchmarks models across GPU types (L4 through B200) using real inference. Searches for pre-optimized model variants (AWQ, GPTQ, FP8) and tests them against your baseline.
  • Forge kernel optimization. Profiles GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup.
  • NVIDIA TensorRT-LLM. Compiled inference engine for maximum throughput on NVIDIA GPUs. Available on Team plan.
  • One-click deploy. Push optimized pipelines to production API endpoints with managed GPU hosting, auto-scaling, and monitoring.
  • Deployment modes. Scale-to-zero (pay only when processing) or always-on (zero cold start, Team plan). Cold starts under 2 seconds with cached model weights.
  • Per-token pricing. Simple, transparent billing based on model size. See estimated costs before you deploy.
  • Plan tiers. Starter (free), Pro ($99/mo), Team ($249/seat/mo), Enterprise (custom). Each tier unlocks more optimization methods, deployment options, and scaling.
  • OpenAI-compatible endpoints. Every deployed pipeline works with the OpenAI SDK. Change two lines to switch from OpenAI.
  • API playground. Test your pipeline with real requests before deploying. See response quality, latency, and token usage.
  • Optimization dashboard. Real metrics for every experiment: latency (P50, P99), throughput, cost per request, and quality score. Compare optimization versions side by side.
  • Visual pipeline canvas. Drag-and-drop node composition with Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes.
  • Code export. Generate production-ready deployment files: Python scripts, Dockerfiles, Kubernetes manifests, and Docker Compose configurations.
  • Session persistence. Conversations, optimization results, and pipeline state survive page reloads.
  • Usage analytics. Track requests, tokens, cost, and latency across all endpoints with daily charts and per-model breakdowns.
  • Full documentation. Prompting guide, example conversations, feature docs, and troubleshooting.

Roadmap

RunInfra is currently focused on large language models. Here's what's coming next:

  • Vision models. Image classification, object detection, and visual Q&A with the same chat-driven optimization workflow.
  • Speech models. Speech-to-text and text-to-speech endpoints, optimized and deployed like any LLM pipeline.
  • Image generation. Stable Diffusion, FLUX, and other diffusion models with GPU optimization.
  • Embedding models. Optimized embedding endpoints for RAG, semantic search, and retrieval pipelines.
  • Database integration. Managed vector databases and traditional databases connected directly to your inference pipelines.
  • End-to-end AI infrastructure. Build complete AI systems: ingest data, store embeddings, run inference, and serve results, all from one platform.

Want early access to any of these? Contact us.

NextDocumentation