Quickstart

Your first AI inference pipeline in 5 minutes.

Create an account

Describe what you need

Open Pipes and type what you want:

I need a fast chatbot using Llama 3.1 8B optimized for low latency

The agent builds your pipeline, selects the model, and configures everything automatically.

Want changes? Just say so:

Add a response cache and switch to Qwen 2.5 7B instead

Optimize

The agent benchmarks your model across GPUs, searches for optimized variants, and finds the best configuration. You see real-time progress as experiments complete.

Set specific targets if you want:

Optimize for latency, keep cost under $0.10 per request

Optimization takes 2-5 minutes.

Deploy

Deployment requires Pro plan ($99/mo). The free plan lets you build, optimize, and test in the playground.

Click Deploy in the deploy tab. RunInfra provisions a GPU endpoint with scale-to-zero and fast cold starts (under 2 seconds). Your endpoint URL and API key appear when ready.

Use your endpoint

Your endpoint is OpenAI-compatible. Use any OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1/YOUR_PIPELINE_ID",
    api_key="ri_your_api_key",
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is RunInfra?"}],
)

print(response.choices[0].message.content)

Next steps

Prompting Guide

Write better prompts, get better pipelines.

Example Prompts

See real conversations for chatbots, summarizers, and more.

Deployment

Flex vs Active, scaling, and more.

How is this guide?