RunInfra is an AI-powered platform that lets you describe the endpoint you need in plain English and handles everything else: model selection, GPU benchmarking, optimization, deployment, and scaling. Whether you’re building a low-latency chatbot, a batch summarization API, or a multi-model reasoning pipeline, you go from idea to live endpoint without writing infrastructure code. No YAML. No DevOps. No GPU configuration. Just chat.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
What you can build
RunInfra covers large language models, speech-to-text, text-to-speech, embeddings, vision-language, and image generation out of the box. Six pre-built use cases ship as starting points you can fork in chat.Voice agents
Streaming STT, LLM, and TTS fused on one GPU at sub-600ms turn-taking.
AI assistants
Llama, Hermes, Qwen with tool use, streaming, and structured output.
Embeddings + rerank
BGE encoders + cross-encoder reranker fused on one GPU in one round-trip.
RAG search
Hybrid retrieval, grounded generation, citation spans you can audit.
Document AI
Qwen2.5-VL and Llama 3.2 Vision parsing PDFs and forms to JSON.
Transcription
Open Whisper with diarization and PII redaction.
Example prompts
Copy any of these into the Pipes chat to see how it works.How it works
A four-stage workflow from description to live endpoint.Describe
Tell RunInfra what you need in plain English. The agent asks clarifying questions when needed, then builds your pipeline automatically.
Optimize
Real GPU profiling across L4 to B200, Hugging Face variant search (AWQ, GPTQ, FP8), and Forge kernel tuning. Results stream in real time.
Deploy
One click ships an OpenAI-compatible endpoint. Flex (scale-to-zero) or Active (always-on). Cold starts under 2 seconds.
Why RunInfra
Closed-source APIs charge per token with no control over latency, throughput, or cost. With RunInfra you own the model and the infrastructure. RunInfra optimizes GPU kernels so your open-source models run as fast as, or faster than, proprietary APIs at a fraction of the cost.Get started
Quickstart
Your first pipeline in 5 minutes.
Use cases
Pre-built workflows you can fork in chat.
Which model?
Decision table by use case and priority.
Deployments
Flex vs Active, deployment targets, scaling.