Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Build and deploy your first AI inference endpoint. Create an account, describe what you need in plain English, run optimization, deploy to a live URL, and make your first API call. About five minutes, start to finish.

Free to start

Starter plan is free forever. No credit card required.

OpenAI-compatible

Works with every OpenAI SDK and LangChain/LlamaIndex.

Under 2s cold start

Cached weights keep even scale-to-zero endpoints fast.
You can build, optimize, and test pipelines in the playground on the free Starter plan. Deploying to a live endpoint requires Pro ($49/mo).
1

Create an account

Sign up at runinfra.ai/sign-up using GitHub or Google. No credit card is required to get started.
2

Describe your pipeline

Open Pipes and type what you need in plain English:
I need a fast chatbot using Llama 3.1 8B optimized for low latency
The agent builds your pipeline, selects the model, and configures everything automatically. To make changes, just continue the conversation:
Add a response cache and switch to Qwen 2.5 7B instead
3

Optimize

The agent benchmarks your model across GPUs, searches for optimized variants (AWQ, GPTQ, FP8), and identifies the best configuration. You see real-time progress as experiments complete.You can set specific targets before optimization starts:
Optimize for latency, keep cost under $0.10 per request
Optimization takes 2-5 minutes. When it finishes, review the results and select the configuration you want to deploy.
4

Deploy your endpoint

Click Deploy in the deploy tab. RunInfra provisions a GPU endpoint with scale-to-zero and fast cold starts (under 2 seconds). Your endpoint URL and API key appear as soon as the endpoint is ready.
Deployment requires the Pro plan ($49/mo). On the free Starter plan you can build, optimize, and test in the playground without upgrading.
5

Use your endpoint

Your endpoint is OpenAI-compatible. Use your RunInfra API key and the RunInfra base URL with any OpenAI SDK or HTTP client.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is RunInfra?"}],
)

print(response.choices[0].message.content)
Replace YOUR_RUNINFRA_API_KEY with your actual API key from Settings > API Keys.

Next steps

Use cases

Six pre-built workflows: voice, assistants, embeddings, RAG, document AI, transcription.

Prompting best practices

Write better prompts and get better pipelines from the agent.

Deployment

Flex vs Active endpoints, scaling, and cold-start configuration.

API reference

The full OpenAI-compatible HTTP API surface.