Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

The best way to learn what RunInfra’s agent can do is to see real prompts in action. Every example below is something you can type directly into the agent. Each one is followed by an explanation of what the agent builds and why, so you can adapt the pattern to your own use case.

Customer support chatbot

Deploy Llama 3.1 8B as a customer support chatbot.
Optimize for latency, P99 under 200ms.
What the agent does: Creates a pipeline with Llama 3.1 8B, profiles on L4/L40S GPUs, searches for an optimized AWQ 4-bit variant, benchmarks it, finds the fastest configuration, and deploys as a scale-to-zero endpoint.

Document summarizer

I need a summarization API using Qwen 2.5 14B.
Optimize for cost, max $0.003 per request.
Add a response cache for repeated documents.
What the agent does: Builds a pipeline with Qwen 2.5 14B, a response cache node, and cost-priority optimization. Searches for optimized model variants and picks the cheapest that meets your constraints.

Multi-model routing

Build a pipeline with two models: Phi-3 Mini for simple questions
and Llama 3.1 70B for complex reasoning. Route based on query
complexity. Budget is $300/month.
What the agent does: Creates a router that analyzes query complexity, routes simple queries to the cheap small model and complex ones to the large model. Optimizes both models and estimates monthly cost to fit your budget.

Code generation API

Deploy DeepSeek Coder V2 optimized for throughput.
I need to handle 1000 RPM for our CI pipeline.
What the agent does: Profiles DeepSeek Coder V2, optimizes for throughput priority, configures scaling to handle 1000 RPM, and deploys with an appropriate replica count.

Low-cost internal tool

Cheapest possible chatbot for an internal FAQ tool.
Under 50 requests per day. Doesn't need to be fast.
What the agent does: Recommends a small model (Phi-3 Mini or Qwen 2.5 3B), finds an optimized AWQ 4-bit variant, deploys on the cheapest GPU tier, and configures scale-to-zero to minimize cost during idle periods.

Multilingual translation

I need a translation endpoint that handles English, Spanish, French,
German, and Japanese. Use @Qwen-2.5-7B since it's good at multilingual.
Optimize for quality.
What the agent does: Builds a Qwen 2.5 7B pipeline with quality-priority optimization. Searches for FP8 and high-quality optimized variants to preserve multilingual accuracy, with a quality score threshold above 0.95.

Batch processing

Set up Mistral Small 22B for batch document processing.
I'll send 10,000 documents per day. Optimize for throughput
and keep total cost under $50/day.
What the agent does: Configures Mistral Small 22B with throughput priority, calculates the GPU tier and replica count needed for 10K docs/day within your $50 budget, and finds an optimized variant to reduce per-request cost.

Maximum performance

I want the absolute fastest inference for Llama 3.1 70B.
Use an H100 with TensorRT-LLM. Cost doesn't matter.
What the agent does: Configures Llama 3.1 70B on H100 with TensorRT-LLM backend, finds an FP8 variant (native on H100), and enables speculative decoding. Deploys as always-on for zero cold start.
TensorRT-LLM support requires the Team plan.

Starting from scratch

If you don’t know which model to use, describe what you need and let the agent decide.
I'm building a chatbot for recipe recommendations.
What model would you suggest? I want it cheap and fast.
What the agent does: Recommends a small, cost-effective model (likely Phi-3 Mini or Qwen 2.5 3B based on the simple use case), explains the reasoning, and offers to build the pipeline once you confirm.
You don’t have to know anything about model sizes or GPU types to get started. Describing your use case and what matters most (cost, speed, quality) is enough for the agent to make a solid recommendation.

Refining an existing pipeline

After the agent builds something, keep iterating. The agent remembers the full conversation and updates the pipeline with each message.
The latency is too high, can you try a different GPU?
Switch from AWQ to FP8 and re-optimize
Add a guardrail to filter harmful content
Compare this version with the previous one
Each of these messages updates the pipeline without starting over. You can compare any two versions side by side to see exactly what changed and which one performs better.

Next steps

Prompting best practices

The four elements every strong prompt should include.

Debug the agent

Redirect the agent when a pipeline needs course correction.

End-to-end guide

From idea to live production API, step by step.

Supported models

LLMs, speech-to-text, and text-to-speech available today.