RunInfra is the fastest way to ship open-source AI models as production APIs. Describe the endpoint you need in plain English and RunInfra picks the model, benchmarks real GPUs, applies kernel optimizations, and deploys an OpenAI-compatible HTTP endpoint.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Get started in minutes
Start fast with a chat prompt
Describe your use case in Pipes. The agent builds, optimizes, and deploys in one flow.
Optimize on real GPUs
Profile across L4 to B200. Search AWQ, GPTQ, FP8 variants. Apply Forge kernels.
Deploy with one click
Flex scale-to-zero or Active always-on. Cold starts under 2 seconds.
What you can build
Voice agents
Streaming STT, LLM, TTS fused at sub-600ms turn-taking.
AI assistants
Llama, Hermes, Qwen with tools, streaming, structured output.
Embeddings + rerank
BGE encoder + cross-encoder reranker in one round-trip.
RAG search
Hybrid retrieval, grounded generation, auditable citations.
Document AI
Vision-language models parsing PDFs and forms to JSON.
Transcription
Whisper with diarization and PII redaction.
Resources and help
Which model should I use?
Pick the right model for your use case.
Example prompts
Copy-ready prompts for every pipeline shape.
API reference
Complete OpenAI-compatible HTTP API.
Plans and pricing
Compare Starter, Pro, Team, and Enterprise.
Troubleshooting
Fix 4xx, 5xx, cold starts, and deploy failures.
Talk to sales
Volume pricing, SLAs, and SOC 2 or HIPAA.