RunInfra turns a plain-English description into a production-ready AI inference endpoint. You don’t need to know which model, GPU, or quantization method to use, the agent handles all of that. This guide walks you through the complete journey, from clarifying your use case to monitoring a live API in production.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Start with the use case
Before opening RunInfra, spend a moment clarifying what you’re building:
- What task? Chat, summarization, translation, code generation, Q&A, or classification.
- Who uses it? End users (low latency matters), internal tools (cost matters), or batch jobs (throughput matters).
- How much traffic? 10 requests per day or 10,000 requests per minute, the answer shapes GPU and deployment mode selection.
Describe it in chat
Open Pipes and write a single prompt that covers your use case. Include the task, traffic estimate, latency target, and budget if you have them:The agent builds the pipeline, picks a model, and asks any clarifying questions it needs.
example prompt
Refine through conversation
Don’t try to perfect the pipeline in a single prompt. Iterate with follow-up messages:Each message updates the pipeline in real time. The agent explains its choices so you stay in control.
add multilingual support
add caching
ask for a recommendation
Optimize
When the pipeline looks right, ask the agent to optimize it:The agent profiles GPUs, searches for pre-optimized model variants (AWQ, GPTQ, FP8), applies Forge kernel optimizations, and ranks the results. This takes 2-5 minutes.Review the results in the optimization dashboard. If the numbers don’t meet your constraints, guide the agent:
trigger optimization
adjust optimization target
change optimization goal
Optimization results show real inference metrics, P50/P99 latency, throughput, cost per request, and a quality score, so you can compare variants before committing.
Test in the playground
Before deploying, send test prompts through the built-in playground. Check:
- Does the output quality match your expectations?
- Is the latency acceptable end to end?
- Do edge cases (empty input, very long prompts, non-English text) behave correctly?
Deploy
When you’re satisfied, deploy with a single instruction:Or click Deploy in the deploy tab. Choose Flex (scale-to-zero) for most use cases, you pay only when processing requests. Choose Active (always-on, Team plan) if you need zero cold start.Your endpoint URL and API key appear in 1-3 minutes.
deploy command
Integrate
Drop the endpoint URL and API key into your application. Every RunInfra endpoint is OpenAI-compatible, so you only need to change two values:RunInfra endpoints also work out of the box with LangChain, LlamaIndex, and any other library that supports the OpenAI SDK.
Monitor and iterate
Check the Observe dashboard after your first real traffic:
- Are latency numbers matching what optimization predicted?
- Any errors or unexpected 5xx responses?
- What is the actual cost per request?
Prompting best practices
Write prompts that get the pipeline right on the first try.
Optimization
Understand GPU profiling, quantization search, and Forge kernels.
Monitoring
Explore usage analytics, latency charts, and per-model breakdowns.