Idea to pipeline - RunInfra

RunInfra turns a plain-English description into a production-ready AI inference endpoint. You don’t need to know which model, GPU, or quantization method to use, the agent handles all of that. This guide walks you through the complete journey, from clarifying your use case to monitoring a live API in production.

Start with the use case

Before opening RunInfra, spend a moment clarifying what you’re building:

What task? Chat, summarization, translation, code generation, Q&A, or classification.
Who uses it? End users (low latency matters), internal tools (cost matters), or batch jobs (throughput matters).
How much traffic? 10 requests per day or 10,000 requests per minute, the answer shapes GPU and deployment mode selection.

You don’t need to answer every question perfectly. Rough estimates are enough to get started. You can always refine later through conversation.

Describe it in chat

Open the dashboard and write a single prompt that covers your use case. Include the task, traffic estimate, latency target, and budget if you have them:

example prompt

I'm building a customer FAQ chatbot for our e-commerce site.
Needs to handle 200 requests per minute. Keep latency under 150ms.
Budget is $300/month.

The agent builds the pipeline, picks a model, and asks any clarifying questions it needs.

Refine through conversation

Don’t try to perfect the pipeline in a single prompt. Iterate with follow-up messages:

add multilingual support

Actually, make it multilingual, we have Spanish and French customers too.

add caching

Add a response cache for common questions.

ask for a recommendation

What model do you recommend for this?

Each message updates the pipeline in real time. The agent explains its choices so you stay in control.

Guardrail, rate limiter, load balancer, and cache nodes are design placeholders today. They record intent and carry a “Not enforced” badge on the canvas rather than enforcing behavior at serving time.

Optimize

When the pipeline looks right, ask the agent to optimize it:

trigger optimization

Optimize for latency.

The agent profiles GPUs, searches for pre-optimized model variants (AWQ, GPTQ, FP8), applies Forge kernel optimizations, and ranks the results. This takes 2-5 minutes.Review the results in the optimization dashboard. If the numbers don’t meet your constraints, guide the agent:

adjust optimization target

The cost is too high. Can you try a smaller model?

change optimization goal

Try optimizing for cost instead of latency.

Optimization results show real inference metrics, P50/P99 latency, throughput, cost per request, and a quality score, so you can compare variants before committing.

Test in the playground

Before deploying, send test prompts through the built-in playground. Check:

Does the output quality match your expectations?
Is the latency acceptable end to end?
Do edge cases (empty input, very long prompts, non-English text) behave correctly?

Catch quality issues here, not in production. Switching models or variants after deployment requires a redeployment.

Deploy

When you’re satisfied, deploy with a single instruction:

deploy command

Deploy this.

Or click Deploy in the deploy tab. Choose Flex (scale-to-zero) for most use cases, you pay only when processing requests. Choose Active (always-on, paid Core plan) if you need zero cold start.Your endpoint URL and API key appear in 1-3 minutes.

Integrate

Drop the endpoint URL and API key into your application. For supported OpenAI-compatible endpoints, configure the RunInfra base URL, key, and deployed model ID:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

response = client.chat.completions.create(
    model=os.environ["RUNINFRA_MODEL"],
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.runinfra.ai/v1",
  apiKey: "YOUR_RUNINFRA_API_KEY",
});

const response = await client.chat.completions.create({
  model: process.env.RUNINFRA_MODEL,
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

curl https://api.runinfra.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"model\":\"${RUNINFRA_MODEL:?Set RUNINFRA_MODEL from GET /v1/models}\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}]}"

RunInfra endpoints also work with the documented LangChain and LlamaIndex integrations, plus clients that let you set a custom OpenAI-compatible base URL for the supported endpoint you call.

Monitor and iterate

Check the Deployments metrics overview after your first real traffic:

Are latency numbers matching what optimization predicted?
Any errors or unexpected 5xx responses?
What is the actual cost per request?

If something needs adjustment, go back to the dashboard chat and ask the agent. You can re-optimize, switch GPU tiers, or swap models at any time without changing your integration code.

Prompting best practices

Write prompts that get the pipeline right on the first try.

Optimization

Understand GPU profiling, quantization search, and Forge kernels.

Monitoring

Explore usage analytics, latency charts, and per-model breakdowns.