Quickstart - RunInfra

Build and deploy your first AI inference endpoint. Create an account, describe what you need in plain English, run optimization, deploy to a live URL, and make your first API call. About five minutes, start to finish.

Free to start

$10 in free credits on signup. No credit card required.

OpenAI-compatible

Works with OpenAI Python/JavaScript SDKs plus documented LangChain and LlamaIndex examples.

Under 2s cold start

Cached weights keep even scale-to-zero endpoints fast.

Your

10 in free credits let you build, optimize, and test pipelines in the playground. Deploying to a live endpoint requires a paid Core plan (custom amount from

50/mo).

Create an account

Describe your pipeline

Open the dashboard and type what you need in plain English:

I need a fast chatbot using Llama 3.1 8B optimized for low latency

The agent builds your pipeline, selects the model, and configures everything automatically. To make changes, just continue the conversation:

Add a response cache and switch to Qwen 2.5 7B instead

Optimize

The agent benchmarks your model across GPUs, searches for optimized variants (AWQ, GPTQ, FP8), and identifies the best configuration. You see real-time progress as experiments complete.You can set specific targets before optimization starts:

Optimize for latency, keep cost under $0.10 per request

Optimization takes 2-5 minutes. When it finishes, review the results and select the configuration you want to deploy.

Deploy your endpoint

Click Deploy in the deploy tab. RunInfra provisions a GPU endpoint with scale-to-zero and fast cold starts (under 2 seconds). Your endpoint URL and API key appear as soon as the endpoint is ready.

Deployment requires a paid Core plan (custom amount from

50/mo). Your

10 in free credits let you build, optimize, and test in the playground without upgrading.

Use your endpoint

Your endpoint is OpenAI-compatible. Use your RunInfra API key and the RunInfra base URL with OpenAI Python or JavaScript SDKs, or direct HTTP requests.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

response = client.chat.completions.create(
    model=os.environ["RUNINFRA_MODEL"],
    messages=[{"role": "user", "content": "What is RunInfra?"}],
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.runinfra.ai/v1",
  apiKey: "YOUR_RUNINFRA_API_KEY",
});

const response = await client.chat.completions.create({
  model: process.env.RUNINFRA_MODEL,
  messages: [{ role: "user", content: "What is RunInfra?" }],
});

console.log(response.choices[0].message.content);

curl https://api.runinfra.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${RUNINFRA_MODEL:?Set RUNINFRA_MODEL from GET /v1/models}\",\"messages\":[{\"role\":\"user\",\"content\":\"What is RunInfra?\"}]}"

Replace YOUR_RUNINFRA_API_KEY with your actual API key from Settings > API Keys, and set RUNINFRA_MODEL to a model id returned by GET /v1/models.

Next steps

Use cases

Six pre-built workflows: voice, assistants, embeddings, RAG, document AI, transcription.

Prompting best practices

Write better prompts and get better pipelines from the agent.

Deployment

Flex vs Active endpoints, scaling, and cold-start configuration.

API reference

Supported OpenAI-compatible endpoints and fields.

Free to start

OpenAI-compatible

Under 2s cold start

​Next steps

Use cases

Prompting best practices

Deployment

API reference

Next steps