Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra is an AI-powered platform that lets you describe the endpoint you need in plain English and handles everything else: model selection, GPU benchmarking, optimization, deployment, and scaling. Whether you’re building a low-latency chatbot, a batch summarization API, or a multi-model reasoning pipeline, you go from idea to live endpoint without writing infrastructure code. No YAML. No DevOps. No GPU configuration. Just chat.

What you can build

RunInfra covers large language models, speech-to-text, text-to-speech, embeddings, vision-language, and image generation out of the box. Six pre-built use cases ship as starting points you can fork in chat.

Voice agents

Streaming STT, LLM, and TTS fused on one GPU at sub-600ms turn-taking.

AI assistants

Llama, Hermes, Qwen with tool use, streaming, and structured output.

Embeddings + rerank

BGE encoders + cross-encoder reranker fused on one GPU in one round-trip.

RAG search

Hybrid retrieval, grounded generation, citation spans you can audit.

Document AI

Qwen2.5-VL and Llama 3.2 Vision parsing PDFs and forms to JSON.

Transcription

Open Whisper with diarization and PII redaction.

Example prompts

Copy any of these into the Pipes chat to see how it works.
Deploy Llama 3.1 8B as a low-latency customer support chatbot.
Optimize for latency, keep P99 under 200ms.
The agent handles model selection, GPU benchmarking, optimized variant search, kernel optimization, deployment, and autoscaling.

How it works

A four-stage workflow from description to live endpoint.

Describe

Tell RunInfra what you need in plain English. The agent asks clarifying questions when needed, then builds your pipeline automatically.

Optimize

Real GPU profiling across L4 to B200, Hugging Face variant search (AWQ, GPTQ, FP8), and Forge kernel tuning. Results stream in real time.

Deploy

One click ships an OpenAI-compatible endpoint. Flex (scale-to-zero) or Active (always-on). Cold starts under 2 seconds.

Integrate

Every OpenAI SDK works unchanged. Python, TypeScript, curl, LangChain, LlamaIndex, Vercel AI SDK.

Why RunInfra

Closed-source APIs charge per token with no control over latency, throughput, or cost. With RunInfra you own the model and the infrastructure. RunInfra optimizes GPU kernels so your open-source models run as fast as, or faster than, proprietary APIs at a fraction of the cost.

Get started

Quickstart

Your first pipeline in 5 minutes.

Use cases

Pre-built workflows you can fork in chat.

Which model?

Decision table by use case and priority.

Deployments

Flex vs Active, deployment targets, scaling.