# RunInfra

> Plain English to production AI inference endpoints. RunInfra selects models, benchmarks GPUs, applies kernel optimizations, and deploys OpenAI-compatible APIs.

## Docs

- [AI onboarding prompt](https://runinfra.ai/docs/ai-onboarding/prompt-block.md): Copy-paste prompt for any LLM. Teaches your AI assistant how to ship code against RunInfra correctly.
- [Audio](https://runinfra.ai/docs/api-reference/audio.md): POST /v1/audio/speech and /v1/audio/transcriptions, text-to-speech and speech-to-text.
- [Authentication](https://runinfra.ai/docs/api-reference/authentication.md): API key scopes, creation, rotation, and expiration for the RunInfra inference API.
- [Chat completions](https://runinfra.ai/docs/api-reference/chat-completions.md): POST /v1/chat/completions, OpenAI-compatible chat with streaming, tools, and structured output.
- [Embeddings](https://runinfra.ai/docs/api-reference/embeddings.md): POST /v1/embeddings, vector embeddings for semantic search, RAG, and clustering.
- [Error codes](https://runinfra.ai/docs/api-reference/errors.md): HTTP status codes RunInfra returns, what causes them, and how to recover.
- [API reference](https://runinfra.ai/docs/api-reference/introduction.md): OpenAI-compatible inference API. Set base_url once, reach verified deployments with a workspace-scoped key or a pipeline-scoped key.
- [Models](https://runinfra.ai/docs/api-reference/models.md): GET /v1/models - list verified deployed models in your workspace.
- [Rate limits](https://runinfra.ai/docs/api-reference/rate-limits.md): Per-key request budgets, response headers, 429 behavior, and how to raise limits.
- [SSE events](https://runinfra.ai/docs/api-reference/sse-events.md): Server-sent event types fired by RunInfra during optimization sessions, chat streams, and infra activity. Covers event names, payload shapes, heartbeat rules, and reconnection patterns.
- [Changelog](https://runinfra.ai/docs/changelog.md): A full record of RunInfra releases, feature launches, and platform updates, newest entries first, with a look at what is coming next.
- [Cookbook](https://runinfra.ai/docs/cookbook/overview.md): Copy-paste recipes for the most common RunInfra inference patterns. Every recipe runs out of the box with a free Starter key.
- [Retrieval-augmented generation](https://runinfra.ai/docs/cookbook/rag.md): Embed, retrieve, generate. A complete RAG loop in 30 lines using two RunInfra pipelines.
- [Streaming responses](https://runinfra.ai/docs/cookbook/streaming.md): Token-by-token responses with the OpenAI SDK. Server-sent events over HTTPS, same format as OpenAI.
- [Structured output](https://runinfra.ai/docs/cookbook/structured-output.md): Guaranteed-parseable JSON responses via JSON Schema. Works with every RunInfra model that supports tool calling.
- [Tool calling](https://runinfra.ai/docs/cookbook/tool-calling.md): Function calling with typed arguments. Model picks a tool, you run it, feed the result back. Multi-turn loop.
- [Autoscaling](https://runinfra.ai/docs/deployments/autoscaling.md): How RunInfra replicas scale up and down with traffic. Flex scale-to-zero or Active always-on, with concurrency, queue depth, and cost-latency math.
- [Instant Start](https://runinfra.ai/docs/deployments/instant-start.md): RunInfra's weight-caching layer that keeps cold starts fast on scale-to-zero deployments. Covers the cache architecture, eviction rules, multi-GPU sync, and the parts of cold start it does not eliminate.
- [Deployments overview](https://runinfra.ai/docs/deployments/overview.md): Deploy any optimized RunInfra pipeline as an OpenAI-compatible production API. Two modes, sub-2s cold starts, per-token billing.
- [Speculative decoding](https://runinfra.ai/docs/deployments/speculation.md): A small draft model proposes tokens, the target model verifies them in a single pass. Higher throughput with no quality change.
- [Deployment targets](https://runinfra.ai/docs/deployments/targets.md): Three places RunInfra can ship your pipeline: managed RunPod Serverless (default), self-hosted Modal in your own account, or a custom GPU you bring.
- [Account and access](https://runinfra.ai/docs/faq/account.md): FAQ about sign up, API keys, workspaces, seats, and dashboard access.
- [Billing](https://runinfra.ai/docs/faq/billing.md): FAQ about token pricing, optimization sessions, invoices, overage, and credits.
- [Infrastructure](https://runinfra.ai/docs/faq/infrastructure.md): FAQ about GPUs, regions, data residency, uptime, and security.
- [Models and inference](https://runinfra.ai/docs/faq/models-inference.md): FAQ about supported models, quantization, context windows, streaming, tool calling, and fine-tuning.
- [GPUs and pricing](https://runinfra.ai/docs/features/gpu-pricing.md): RunInfra bills per million tokens, not per GPU hour. Understand how GPU selection, deployment mode, and model size affect your inference cost.
- [Image generation](https://runinfra.ai/docs/features/image-generation.md): Text-to-image inference on RunInfra: FLUX, SDXL, and Stable Diffusion 3.5 served through a Diffusers FastAPI runtime with torchao FP8 + torch.compile on Ada / Hopper / Blackwell GPUs.
- [Models](https://runinfra.ai/docs/features/models.md): RunInfra supports thousands of LLMs, embeddings, vision-language, speech-to-text, and text-to-speech models from Hugging Face, with custom model upload available on Team plan.
- [Monitoring](https://runinfra.ai/docs/features/monitoring.md): Track requests, latency percentiles, throughput, token usage, and cost across all your RunInfra endpoints from a single real-time dashboard.
- [Optimization](https://runinfra.ai/docs/features/optimization.md): GPU profiling, quantized-variant search, Forge kernels, and speculation. The RunInfra optimizer picks the right configuration so you don't have to.
- [Build with RunInfra](https://runinfra.ai/docs/index.md): Plain English to production AI endpoints.
- [LangChain](https://runinfra.ai/docs/integrations/langchain.md): Use RunInfra as the LLM provider for any LangChain application. One-line config change.
- [LlamaIndex](https://runinfra.ai/docs/integrations/llamaindex.md): Use RunInfra as the LLM and embedding provider in any LlamaIndex pipeline.
- [Using RunInfra with other libraries](https://runinfra.ai/docs/integrations/overview.md): RunInfra's OpenAI-compatible HTTP API works with any library that speaks OpenAI. The fastest path for the most common frameworks.
- [Vercel AI SDK](https://runinfra.ai/docs/integrations/vercel-ai-sdk.md): Use RunInfra with the Vercel AI SDK. Works with Next.js, SvelteKit, Nuxt, and Remix.
- [Which model should I use?](https://runinfra.ai/docs/introduction/model-picker.md): Pick the right model for your use case. Decision table by task, size, and performance priority.
- [Plans and pricing](https://runinfra.ai/docs/introduction/plans.md): Compare Starter, Pro, Team, and Enterprise plans, including optimization sessions, token pricing, rollover rules, and overage costs.
- [Quickstart](https://runinfra.ai/docs/introduction/quickstart.md): Create an account, describe your pipeline, optimize it, and deploy a live OpenAI-compatible inference endpoint, all without writing infrastructure code.
- [What is RunInfra?](https://runinfra.ai/docs/introduction/welcome.md): RunInfra turns plain English into production AI inference endpoints. Describe your use case and the AI agent builds, optimizes, and deploys it for you.
- [News](https://runinfra.ai/docs/news/overview.md): Announcements, product notes, research updates, and engineering posts from the RunInfra team. Subscribe via RSS or Atom.
- [Prompting best practices](https://runinfra.ai/docs/prompting/best-practices.md): Learn what to include in every RunInfra prompt so the agent builds the right pipeline the first time, without back-and-forth clarification.
- [Debugging](https://runinfra.ai/docs/prompting/debugging.md): Fix common RunInfra issues: wrong model selection, poor optimization results, slow cold starts, and failed deployments, with direct corrective prompts.
- [Example prompts](https://runinfra.ai/docs/prompting/example-prompts.md): Copy-ready prompts for chatbots, summarizers, code generation, multilingual APIs, and more, with notes on what the RunInfra agent builds for each.
- [Glossary](https://runinfra.ai/docs/reference/glossary.md): RunInfra domain terms in one page. GPU, quantization, serving, and agent vocabulary.
- [Research](https://runinfra.ai/docs/research/overview.md): Open papers from the RunInfra team on attention efficiency, LLM inference, kernel optimization, and the architectures behind production AI infrastructure.
- [Idea to pipeline](https://runinfra.ai/docs/tips/from-idea-to-pipeline.md): Walk through every step of building, optimizing, deploying, and integrating a RunInfra AI pipeline, from blank page to production endpoint.
- [Troubleshooting](https://runinfra.ai/docs/tips/troubleshooting.md): Fix common issues with RunInfra pipeline building, optimization, deployment, and API integration, organized by category for fast diagnosis.
- [OpenAI compatibility](https://runinfra.ai/docs/tools-sdks/openai-compatibility.md): RunInfra exposes an OpenAI-shaped HTTP API for verified deployment endpoints. Point any OpenAI SDK at a RunInfra deployment and it works.
- [RunInfra SDK](https://runinfra.ai/docs/tools-sdks/runinfra-sdk.md): Use the native RunInfra TypeScript and Python SDKs for optimized deployment access, scoped API keys, request IDs, retries, streaming, audio, images, and webhook verification helpers.
- [AI assistants](https://runinfra.ai/docs/use-cases/ai-assistant.md): Personal assistants on the open models you own. Llama, Hermes, Qwen with tool use, policy, and streaming on a single GPU.
- [Document AI](https://runinfra.ai/docs/use-cases/document-ai.md): Open vision-language models parsing PDFs, forms, and tables into structured JSON. Per-page billed becomes per-million-tokens.
- [Embeddings and reranking](https://runinfra.ai/docs/use-cases/embeddings.md): BGE, E5, GTE, Nomic. Encoder and cross-encoder reranker fused on one GPU in a single round-trip.
- [Use cases](https://runinfra.ai/docs/use-cases/overview.md): Six pre-built workflows you can fork in chat: voice agents, AI assistants, embeddings, RAG search, document AI, and transcription. Each ships with its own model stack, optimization recipe, and benchmark targets.
- [RAG search](https://runinfra.ai/docs/use-cases/rag-search.md): Cited Q&A you can audit on your own corpus. Hybrid retrieval, grounded generation, and citation spans.
- [Transcription](https://runinfra.ai/docs/use-cases/transcription.md): Long-form audio to searchable transcripts. Speaker diarization, PII redaction, and export on your own stack with open Whisper.
- [Voice agents](https://runinfra.ai/docs/use-cases/voice-agent.md): Sub-600ms turn-taking on a single L40S. Streaming speech-to-text, an open LLM, and streaming text-to-speech fused on one GPU.

## OpenAPI Specs

- [openapi](https://runinfra.ai/docs/api-reference/openapi.json)