Prompting Best Practices

How to talk to RunInfra's agent to get the best results.

RunInfra's agent builds your pipeline from natural language. The more specific you are, the better the result. Here's how to write prompts that work.

Include these four things

The best prompts cover:

What you're building - the use case (chatbot, summarizer, translator, code gen)
Which model - name a model or describe what you need ("a fast 7B model")
What matters most - latency, cost, throughput, or quality
Scale - expected traffic ("100 RPM", "internal tool", "production API")

Great prompt

Deploy Mistral 7B as a customer support chatbot. Optimize for latency, 
keep cost under $0.001 per request, and target 500 requests per minute.

The agent knows exactly what to build, which model, what to optimize for, and how to size the infrastructure.

Weak prompt

Make me an AI

Too vague. The agent will ask clarifying questions, which slows things down.

Use @ to mention models

Reference specific models with @:

Optimize @Qwen-2.5-7B with AWQ quantization for a summarization API

Compare @Llama-3.1-8B and @Mistral-7B for code generation

The agent resolves these to Hugging Face model IDs automatically.

Set constraints in natural language

You don't need special syntax. Just say what you need:

Keep latency under 100ms P99

Stay under $500/month total cost

Quality score must be above 0.9

I need at least 1000 requests per minute

The agent translates these into hard constraints that filter optimization results.

Ask for what you want directly

The agent can do more than build pipelines. Just ask:

What you want	What to say
Change the model	"Switch to Qwen 2.5 14B"
Add caching	"Add a response cache"
Change optimization target	"Optimize for cost instead"
Run optimization	"Optimize now"
Compare versions	"Compare version 1 and 2"
Roll back	"Go back to version 1"
Deploy	Click Deploy in the deploy tab
Change GPU	"Use an H100 for this"
Export code	"Generate deployment code"
Search models	"Find a good model for code generation under 10B params"

Iterate, don't restart

You don't need to get everything right in one message. Build incrementally:

User: Build a chatbot with Llama 3.1 8B
Agent: [builds pipeline]

User: Add a cache with 1 hour TTL
Agent: [adds cache node]

User: Actually, switch to Qwen 2.5 7B, it's better for multilingual
Agent: [swaps model]

User: Optimize for latency
Agent: [runs optimization]

User: Deploy as scale-to-zero
Agent: [deploys]

Each message refines the pipeline. The agent remembers the full conversation context.

Let the agent decide when you're unsure

If you don't know which model, GPU, or quantization method to use, ask:

What model would you recommend for a low-cost translation API?

What GPU should I use for a 14B model?

Should I use AWQ or GPTQ for this?

The agent makes recommendations based on your use case, model size, and constraints.

Tips

Be specific about performance: "fast" is subjective. "Under 100ms P99" is measurable.
Mention the use case: "customer support chatbot" gives the agent context to make better decisions.
Don't worry about technical details: The agent handles GPU selection, quantization, serving backend, scaling, and configuration. You focus on what you want, not how to build it.
Review before deploying: The agent shows optimization results before deployment. Check the metrics.

How is this guide?