Example Prompts

Real conversations for every use case. Copy, paste, and customize.

These are real prompts you can type into RunInfra. Each one shows what to say and what the agent does.

Customer support chatbot

Deploy Llama 3.1 8B as a customer support chatbot. 
Optimize for latency, P99 under 200ms.

What the agent does: Creates a pipeline with Llama 3.1 8B, profiles on L4/L40S GPUs, searches for an optimized AWQ 4-bit variant, benchmarks it, finds the fastest configuration, and deploys as a scale-to-zero endpoint.

Document summarizer

I need a summarization API using Qwen 2.5 14B. 
Optimize for cost, max $0.003 per request. 
Add a response cache for repeated documents.

What the agent does: Builds a pipeline with Qwen 2.5 14B, a response cache node, and cost-priority optimization. Searches for optimized model variants and picks the cheapest that meets your constraints.

Multi-model routing

Build a pipeline with two models: Phi-3 Mini for simple questions 
and Llama 3.1 70B for complex reasoning. Route based on query 
complexity. Budget is $300/month.

What the agent does: Creates a router that analyzes query complexity, routes simple queries to the cheap small model and complex ones to the large model. Optimizes both models and estimates monthly cost to fit your budget.

Code generation API

Deploy DeepSeek Coder V2 optimized for throughput. 
I need to handle 1000 RPM for our CI pipeline.

What the agent does: Profiles DeepSeek Coder V2, optimizes for throughput priority, configures scaling to handle 1000 RPM, and deploys with appropriate replica count.

Low-cost internal tool

Cheapest possible chatbot for an internal FAQ tool. 
Under 50 requests per day. Doesn't need to be fast.

What the agent does: Recommends a small model (Phi-3 Mini or Qwen 2.5 3B), finds an optimized AWQ 4-bit variant, deploys on the cheapest GPU tier, and configures scale-to-zero to minimize cost.

Multilingual translation

I need a translation endpoint that handles English, Spanish, French, 
German, and Japanese. Use @Qwen-2.5-7B since it's good at multilingual. 
Optimize for quality.

What the agent does: Builds a Qwen 2.5 7B pipeline with quality-priority optimization. Searches for FP8 and high-quality optimized variants to preserve multilingual accuracy. Quality score must stay above 0.95.

Batch processing

Set up Mistral Small 22B for batch document processing. 
I'll send 10,000 documents per day. Optimize for throughput 
and keep total cost under $50/day.

What the agent does: Configures Mistral Small 22B with throughput priority, calculates the GPU tier and replica count needed for 10K docs/day within $50 budget, and finds an optimized variant to reduce per-request cost.

Maximum performance

I want the absolute fastest inference for Llama 3.1 70B. 
Use an H100 with TensorRT-LLM. Cost doesn't matter.

What the agent does: Configures Llama 3.1 70B on H100 with TensorRT-LLM backend, finds an FP8 variant (native on H100), and enables speculative decoding. Deploys as always-on for zero cold start. (Requires Team plan for TensorRT-LLM.)

Starting from scratch

Don't know what model to use? Let the agent decide:

I'm building a chatbot for recipe recommendations. 
What model would you suggest? I want it cheap and fast.

What the agent does: Recommends a small, cost-effective model (likely Phi-3 Mini or Qwen 2.5 3B based on the simple use case), explains why, and offers to build the pipeline.

Refining an existing pipeline

After the agent builds something, keep iterating:

User: The latency is too high, can you try a different GPU?

User: Switch from AWQ to FP8 and re-optimize

User: Add a guardrail to filter harmful content

User: Compare this version with the previous one

The agent remembers your full conversation and updates the pipeline accordingly.

How is this guide?