Example prompts - RunInfra

The best way to learn what RunInfra’s agent can do is to see real prompts in action. Every example below is something you can type directly into the agent. Each one is followed by an explanation of what the agent builds and why, so you can adapt the pattern to your own use case.

Customer support chatbot

Deploy Llama 3.1 8B as a customer support chatbot.
Optimize for latency, P99 under 200ms.

What the agent does: Creates a pipeline with Llama 3.1 8B, profiles on L4/L40S GPUs, searches compatible low-VRAM variants, benchmarks them, finds the fastest configuration, and deploys as a scale-to-zero endpoint.

Document summarizer

I need a summarization API using Qwen 2.5 14B.
Optimize for cost, max $0.003 per request.
Add a response cache for repeated documents.

What the agent does: Builds a pipeline with Qwen 2.5 14B, a response cache node, and cost-priority optimization. Searches for optimized model variants and picks the cheapest that meets your constraints.

Multi-model routing

Build a pipeline with two models: Phi-3 Mini for simple questions
and Llama 3.1 70B for complex reasoning. Route based on query
complexity. Budget is $300/month.

What the agent does: Creates a router that analyzes query complexity, routes simple queries to the cheap small model and complex ones to the large model. Optimizes both models and estimates monthly cost to fit your budget.

Code generation API

Deploy DeepSeek Coder V2 optimized for throughput.
I need to handle 1000 RPM for our CI pipeline.

What the agent does: Profiles DeepSeek Coder V2, optimizes for throughput priority, configures scaling to handle 1000 RPM, and deploys with an appropriate replica count.

Low-cost internal tool

Cheapest possible chatbot for an internal FAQ tool.
Under 50 requests per day. Doesn't need to be fast.

What the agent does: Recommends a small model (Phi-3 Mini or Qwen 2.5 3B), finds a compatible low-VRAM variant, deploys on the cheapest GPU tier, and configures scale-to-zero to minimize cost during idle periods.

Multilingual translation

I need a translation endpoint that handles English, Spanish, French,
German, and Japanese. Use @Qwen-2.5-7B since it's good at multilingual.
Optimize for quality.

What the agent does: Builds a Qwen 2.5 7B pipeline with quality-priority optimization. Searches for FP8 or other compatible variants that preserve multilingual accuracy, and keeps the safer baseline when quality evidence is missing or regresses.

Batch processing

Set up Mistral Small 22B for batch document processing.
I'll send 10,000 documents per day. Optimize for throughput
and keep total cost under $50/day.

What the agent does: Configures Mistral Small 22B with throughput priority, calculates the GPU tier and replica count needed for 10K docs/day within your $50 budget, and finds an optimized variant to reduce per-request cost.

Maximum performance

I want the absolute fastest inference for Llama 3.1 70B.
Use an H100 with TensorRT-LLM. Cost doesn't matter.

What the agent does: Configures Llama 3.1 70B on H100 with TensorRT-LLM backend, finds an FP8 variant (native on H100), and enables speculative decoding. Deploys as always-on for zero cold start.

TensorRT-LLM support requires a paid Core plan.

Starting from scratch

If you don’t know which model to use, describe what you need and let the agent decide.

I'm building a chatbot for recipe recommendations.
What model would you suggest? I want it cheap and fast.

What the agent does: Recommends a small, cost-effective model (likely Phi-3 Mini or Qwen 2.5 3B based on the simple use case), explains the reasoning, and offers to build the pipeline once you confirm.

You don’t have to know anything about model sizes or GPU types to get started. Describing your use case and what matters most (cost, speed, quality) is enough for the agent to make a solid recommendation.

Refining an existing pipeline

After the agent builds something, keep iterating. The agent remembers the full conversation and updates the pipeline with each message.

The latency is too high, can you try a different GPU?

Try FP8 where compatible and re-optimize

Add a guardrail to filter harmful content

Compare this version with the previous one

Each of these messages updates the pipeline without starting over. You can compare any two versions side by side to see exactly what changed and which one performs better.

Guardrail, rate limiter, load balancer, and cache nodes are design placeholders today. They record intent on the canvas and in generated code, carry a “Not enforced” badge, and are not enforced at serving time. The agent says so when it adds one. Enforce rate limiting at your own gateway until these nodes go live.

Next steps

Prompting best practices

The four elements every strong prompt should include.

Debug the agent

Redirect the agent when a pipeline needs course correction.

End-to-end guide

From idea to live production API, step by step.

Supported models

LLMs, embeddings, rerankers, vision-language, speech-to-text, and text-to-speech models.

​Customer support chatbot

​Document summarizer

​Multi-model routing

​Code generation API

​Low-cost internal tool

​Multilingual translation

​Batch processing

​Maximum performance

​Starting from scratch

​Refining an existing pipeline

​Next steps

Prompting best practices

Debug the agent

End-to-end guide

Supported models

Customer support chatbot

Document summarizer

Multi-model routing

Code generation API

Low-cost internal tool

Multilingual translation

Batch processing

Maximum performance

Starting from scratch

Refining an existing pipeline

Next steps