RunInfra/Docs
GuideChangelog
Sign inGet started
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog

Welcome to RunInfra

Build, optimize, and deploy AI inference pipelines through conversation.

RunInfra turns plain English into production AI endpoints. Describe what you need, and RunInfra's AI agent builds it, optimizes it, and deploys it for you.

No YAML. No DevOps. No GPU configuration. Just chat.

What you can build

RunInfra currently supports LLM workloads, with vision, speech, and image models coming soon. Here are some examples you can type right into the chat:

Deploy Llama 3.1 8B as a low-latency customer support chatbot
Build a multi-model pipeline: Phi-3 for simple queries, Llama 70B for complex reasoning
Optimize Qwen 2.5 14B for throughput and deploy as a batch summarization API
I need a code generation endpoint using DeepSeek V3, keep cost under $0.005 per request

The agent handles everything: model selection, GPU benchmarking, finding optimized model variants, kernel optimization, deployment, and scaling.

How it works

Describe

Tell RunInfra what you need in plain English. The agent asks clarifying questions if needed, then builds your pipeline automatically.

Optimize

RunInfra profiles your model on real GPUs, searches for optimized variants (AWQ, GPTQ, FP8), benchmarks them, and applies pre-optimized GPU kernels for additional speedup. You see results in real time and pick the best configuration.

Deploy

One click deploys your optimized pipeline as a production API. Scale-to-zero (pay only when used) or always-on (zero cold start). RunInfra Cloud gets you from cold to serving in under 2 seconds.

Integrate

Your endpoint is OpenAI-compatible. Point any OpenAI SDK at your RunInfra URL and it works. Python, JavaScript, cURL, LangChain, LlamaIndex.

Why RunInfra

Own your AI. With closed-source APIs, you pay per token with no control over latency, throughput, or cost. With RunInfra, you own the model and the infrastructure. RunInfra optimizes GPU kernels so your open-source models run as fast (or faster) than proprietary APIs, at a fraction of the cost.

Get started

Quickstart

Your first pipeline in 5 minutes.

Read more
Prompting Guide

How to talk to RunInfra's agent effectively.

Read more
Example Prompts

Real conversations for every use case.

Read more

How is this guide?

PreviousQuickstartNextPrompting Best Practices

On this page

What you can buildHow it worksDescribeOptimizeDeployIntegrateWhy RunInfraGet started