Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra is the fastest way to ship open-source AI models as production APIs. Describe the endpoint you need in plain English and RunInfra picks the model, benchmarks real GPUs, applies kernel optimizations, and deploys an OpenAI-compatible HTTP endpoint.

Get started in minutes

Start fast with a chat prompt

Describe your use case in Pipes. The agent builds, optimizes, and deploys in one flow.

Optimize on real GPUs

Profile across L4 to B200. Search AWQ, GPTQ, FP8 variants. Apply Forge kernels.

Deploy with one click

Flex scale-to-zero or Active always-on. Cold starts under 2 seconds.
Not sure where to start? Pick a model with the model catalog, then choose Flex to prototype and move to Active for production traffic. Need help tuning a workload? Talk to our team.

What you can build

Voice agents

Streaming STT, LLM, TTS fused at sub-600ms turn-taking.

AI assistants

Llama, Hermes, Qwen with tools, streaming, structured output.

Embeddings + rerank

BGE encoder + cross-encoder reranker in one round-trip.

RAG search

Hybrid retrieval, grounded generation, auditable citations.

Document AI

Vision-language models parsing PDFs and forms to JSON.

Transcription

Whisper with diarization and PII redaction.

Resources and help

Which model should I use?

Pick the right model for your use case.

Example prompts

Copy-ready prompts for every pipeline shape.

API reference

Complete OpenAI-compatible HTTP API.

Plans and pricing

Compare Starter, Pro, Team, and Enterprise.

Troubleshooting

Fix 4xx, 5xx, cold starts, and deploy failures.

Talk to sales

Volume pricing, SLAs, and SOC 2 or HIPAA.