Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra turns a plain-English description into a production-ready AI inference endpoint. You don’t need to know which model, GPU, or quantization method to use, the agent handles all of that. This guide walks you through the complete journey, from clarifying your use case to monitoring a live API in production.
1

Start with the use case

Before opening RunInfra, spend a moment clarifying what you’re building:
  • What task? Chat, summarization, translation, code generation, Q&A, or classification.
  • Who uses it? End users (low latency matters), internal tools (cost matters), or batch jobs (throughput matters).
  • How much traffic? 10 requests per day or 10,000 requests per minute, the answer shapes GPU and deployment mode selection.
You don’t need to answer every question perfectly. Rough estimates are enough to get started. You can always refine later through conversation.
2

Describe it in chat

Open Pipes and write a single prompt that covers your use case. Include the task, traffic estimate, latency target, and budget if you have them:
example prompt
I'm building a customer FAQ chatbot for our e-commerce site.
Needs to handle 200 requests per minute. Keep latency under 150ms.
Budget is $300/month.
The agent builds the pipeline, picks a model, and asks any clarifying questions it needs.
3

Refine through conversation

Don’t try to perfect the pipeline in a single prompt. Iterate with follow-up messages:
add multilingual support
Actually, make it multilingual, we have Spanish and French customers too.
add caching
Add a response cache for common questions.
ask for a recommendation
What model do you recommend for this?
Each message updates the pipeline in real time. The agent explains its choices so you stay in control.
4

Optimize

When the pipeline looks right, ask the agent to optimize it:
trigger optimization
Optimize for latency.
The agent profiles GPUs, searches for pre-optimized model variants (AWQ, GPTQ, FP8), applies Forge kernel optimizations, and ranks the results. This takes 2-5 minutes.Review the results in the optimization dashboard. If the numbers don’t meet your constraints, guide the agent:
adjust optimization target
The cost is too high. Can you try a smaller model?
change optimization goal
Try optimizing for cost instead of latency.
Optimization results show real inference metrics, P50/P99 latency, throughput, cost per request, and a quality score, so you can compare variants before committing.
5

Test in the playground

Before deploying, send test prompts through the built-in playground. Check:
  • Does the output quality match your expectations?
  • Is the latency acceptable end to end?
  • Do edge cases (empty input, very long prompts, non-English text) behave correctly?
Catch quality issues here, not in production. Switching models or variants after deployment requires a redeployment.
6

Deploy

When you’re satisfied, deploy with a single instruction:
deploy command
Deploy this.
Or click Deploy in the deploy tab. Choose Flex (scale-to-zero) for most use cases, you pay only when processing requests. Choose Active (always-on, Team plan) if you need zero cold start.Your endpoint URL and API key appear in 1-3 minutes.
7

Integrate

Drop the endpoint URL and API key into your application. Every RunInfra endpoint is OpenAI-compatible, so you only need to change two values:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
RunInfra endpoints also work out of the box with LangChain, LlamaIndex, and any other library that supports the OpenAI SDK.
8

Monitor and iterate

Check the Observe dashboard after your first real traffic:
  • Are latency numbers matching what optimization predicted?
  • Any errors or unexpected 5xx responses?
  • What is the actual cost per request?
If something needs adjustment, go back to the Pipes chat and ask the agent. You can re-optimize, switch GPU tiers, or swap models at any time without changing your integration code.

Prompting best practices

Write prompts that get the pipeline right on the first try.

Optimization

Understand GPU profiling, quantization search, and Forge kernels.

Monitoring

Explore usage analytics, latency charts, and per-model breakdowns.