Changelog
New features, improvements, and fixes shipped to RunInfra.
Initial Release
New
- RunInfra is live. Build, optimize, and deploy AI inference pipelines through conversation. Describe what you need in plain English, and the agent handles the rest.
- Chat-driven pipeline builder. No YAML, no DevOps. The AI agent selects models, configures routing, and optimizes your pipeline from a single chat.
- GPU optimization engine. Benchmarks models across GPU types (L4 through B200) using real inference. Searches for pre-optimized model variants (AWQ, GPTQ, FP8) and tests them against your baseline.
- Forge kernel optimization. Profiles GPU bottlenecks and applies pre-optimized Triton kernels for additional speedup.
- NVIDIA TensorRT-LLM. Compiled inference engine for maximum throughput on NVIDIA GPUs. Available on Team plan.
- One-click deploy. Push optimized pipelines to production API endpoints with managed GPU hosting, auto-scaling, and monitoring.
- Deployment modes. Scale-to-zero (pay only when processing) or always-on (zero cold start, Team plan). Cold starts under 2 seconds with cached model weights.
- Per-token pricing. Simple, transparent billing based on model size. See estimated costs before you deploy.
- Plan tiers. Starter (free), Pro ($99/mo), Team ($249/seat/mo), Enterprise (custom). Each tier unlocks more optimization methods, deployment options, and scaling.
- OpenAI-compatible endpoints. Every deployed pipeline works with the OpenAI SDK. Change two lines to switch from OpenAI.
- API playground. Test your pipeline with real requests before deploying. See response quality, latency, and token usage.
- Optimization dashboard. Real metrics for every experiment: latency (P50, P99), throughput, cost per request, and quality score. Compare optimization versions side by side.
- Visual pipeline canvas. Drag-and-drop node composition with Model, Cache, Guardrail, Rate Limiter, Router, and Load Balancer nodes.
- Code export. Generate production-ready deployment files: Python scripts, Dockerfiles, Kubernetes manifests, and Docker Compose configurations.
- Session persistence. Conversations, optimization results, and pipeline state survive page reloads.
- Usage analytics. Track requests, tokens, cost, and latency across all endpoints with daily charts and per-model breakdowns.
- Full documentation. Prompting guide, example conversations, feature docs, and troubleshooting.
Roadmap
RunInfra is currently focused on large language models. Here's what's coming next:
- Vision models. Image classification, object detection, and visual Q&A with the same chat-driven optimization workflow.
- Speech models. Speech-to-text and text-to-speech endpoints, optimized and deployed like any LLM pipeline.
- Image generation. Stable Diffusion, FLUX, and other diffusion models with GPU optimization.
- Embedding models. Optimized embedding endpoints for RAG, semantic search, and retrieval pipelines.
- Database integration. Managed vector databases and traditional databases connected directly to your inference pipelines.
- End-to-end AI infrastructure. Build complete AI systems: ingest data, store embeddings, run inference, and serve results, all from one platform.
Want early access to any of these? Contact us.