Prompting best practices

RunInfra’s agent turns natural language into production AI inference endpoints, but the quality of what it builds depends on the clarity of what you ask. The more context you give upfront, use case, model, performance priorities, and expected scale, the faster the agent can make the right decisions and the less time you spend iterating on a suboptimal result.

Include these four things

Every strong prompt covers four elements. Leaving any of them out forces the agent to guess or ask clarifying questions.

Element	What to provide	Example
Use case	What you’re building	”customer support chatbot”, “summarization API”
Model	A specific model or description	”Llama 3.1 8B” or “a fast 7B model”
Priority	Latency, cost, throughput, or quality	”optimize for latency”
Scale	Expected traffic or deployment type	”500 RPM”, “internal tool”, “production API”

Strong prompt vs. weak prompt

Compare these two examples to see the difference specificity makes:

Deploy Mistral 7B as a customer support chatbot. Optimize for latency,
keep cost under $0.001 per request, and target 500 requests per minute.

Make me an AI

The strong prompt tells the agent exactly what to build, which model to use, what to optimize for, and how to size the infrastructure. The weak prompt leaves every decision open, so the agent will ask clarifying questions before it can do anything useful.

Use @ to mention models

Reference specific models by name with the @ symbol. The agent resolves these to Hugging Face model IDs automatically.

Optimize @Qwen-2.5-7B with a compatible low-VRAM variant for a summarization API

Compare @Llama-3.1-8B and @Mistral-7B for code generation

Use @ mentions when you know the exact model you want. If you’re unsure, describe what you need (“a fast 7B multilingual model”) and the agent will recommend one.

Set constraints in natural language

You don’t need special syntax to define limits. Write constraints the same way you’d say them out loud.

Keep latency under 100ms P99

Stay under $500/month total cost

Measured quality gate must pass

I need at least 1000 requests per minute

The agent translates these into hard constraints that filter optimization results. Anything that doesn’t meet your requirements gets ruled out before the agent presents options.

“Fast” is subjective, “under 100ms P99” is measurable. The more precise your constraint, the more useful the optimization results will be.

Ask for what you want directly

The agent handles far more than the initial build. You can update any part of your pipeline by just saying what you want.

What you want	What to say
Change the model	”Switch to Qwen 2.5 14B”
Add caching	”Add a response cache”
Change optimization target	”Optimize for cost instead”
Run optimization	”Optimize now”
Compare models	”Compare Llama 3.1 8B and Mistral 7B on this pipeline”
Compare versions	”Compare version 1 and 2”
Roll back	”Go back to version 1”
Change GPU	”Use an H100 for this”
Export code	”Generate deployment code”
Search models	”Find a good model for code generation under 10B params”

Iterate, don’t restart

You don’t have to get everything right in a single message. Build incrementally, each message refines the pipeline, and the agent remembers the full conversation context.

User:  Build a chatbot with Llama 3.1 8B
Agent: [builds pipeline]

User:  Add a cache with 1 hour TTL
Agent: [adds cache node]

User:  Actually, switch to Qwen 2.5 7B, it's better for multilingual
Agent: [swaps model]

User:  Optimize for latency
Agent: [runs optimization]

User:  Deploy as scale-to-zero
Agent: [deploys]

Starting simple and refining is often faster than writing a perfect prompt upfront. Get a working pipeline first, then tune from there.

Let the agent decide when you’re unsure

If you don’t know which model, GPU, or quantization method to use, ask. The agent makes recommendations based on your use case, model size, and constraints.

What model would you recommend for a low-cost translation API?

What GPU should I use for a 14B model?

Should I use a 4-bit variant or FP8 for this?

You don’t need to know the technical details. The agent handles GPU selection, quantization, serving backend, scaling, and configuration. Your job is to describe what you want to build and what success looks like.

Good prompt vs. bad prompt

Every good prompt is specific about use case, model, priority, and scale. Every bad prompt drops one or more of those. Concrete pairs:

Starting from scratch

Bad: Build me an AI thing for my startup.Good:

Build a customer-support chatbot for our SaaS dashboard. Traffic is 300 RPM peak, latency budget P99 under 200ms, monthly cost under $800. English only.

Why: Every element is measurable. The agent can rank variants against hard targets and skip the clarifying-question loop.

Asking for a model

Bad: Use a smart model.Good: Use Llama 3.1 8B. If it doesn't hit my latency target, suggest the next size up.Why: You named a specific candidate and gave the agent permission to escalate if it fails.

Setting constraints

Bad: Make it fast and cheap.Good: Optimize for latency. Hard ceiling $0.001 per request, P99 under 150ms, measured quality gate must pass.Why: “Fast” and “cheap” are moods. Numbers are constraints the optimizer can respect.

Asking for changes

Bad: The latency is bad.Good: Latency P99 is 340ms, I need it under 200ms. Try a smaller model or a faster GPU, whichever costs less.Why: You gave the measured current value, the target value, and a menu of acceptable fixes.

When unsure what to pick

Bad: What model is best?Good: For a multilingual FAQ bot handling 500 RPM with budget $300/month, which model do you recommend and why?Why: The agent recommends well when it knows the full context. Open-ended “best model” depends on a dozen dimensions.

Debugging a bad result

Bad: This is not working.Good: The low-VRAM variant failed the quality gate. Can we try FP8 where compatible, keeping cost under $0.002 per request?Why: You cited the specific metric that failed and proposed a direction. The agent can take one step forward instead of guessing.

Tips

Name the use case. “Customer support chatbot” gives the agent enough context to make opinionated picks. “A chatbot” leaves every decision open.

Review results before deploying. The agent shows optimization metrics first. Don’t deploy a variant that barely clears your quality floor; the playground is free to re-test.

Vague prompts don’t fail, they stall. The agent asks clarifying questions until it can act. Save round trips by including all four elements (use case, model, priority, scale) upfront.

Next steps

Example prompts

Real conversations for chatbots, summarizers, routing, and more.

Debug the agent

Redirect the agent when it picks the wrong model or over-builds.

Optimization

How GPU profiling, quantization search, and ranking work.

End-to-end guide

The full workflow, from first prompt to production traffic.

​Include these four things

​Strong prompt vs. weak prompt

​Use @ to mention models

​Set constraints in natural language

​Ask for what you want directly

​Iterate, don’t restart

​Let the agent decide when you’re unsure

​Good prompt vs. bad prompt

​Tips

​Next steps

Example prompts

Debug the agent

Optimization

End-to-end guide

Include these four things

Strong prompt vs. weak prompt

Use @ to mention models

Set constraints in natural language

Ask for what you want directly

Iterate, don’t restart

Let the agent decide when you’re unsure

Good prompt vs. bad prompt

Tips

Next steps