Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra’s agent turns natural language into production AI inference endpoints, but the quality of what it builds depends on the clarity of what you ask. The more context you give upfront, use case, model, performance priorities, and expected scale, the faster the agent can make the right decisions and the less time you spend iterating on a suboptimal result.

Include these four things

Every strong prompt covers four elements. Leaving any of them out forces the agent to guess or ask clarifying questions.
ElementWhat to provideExample
Use caseWhat you’re building”customer support chatbot”, “summarization API”
ModelA specific model or description”Llama 3.1 8B” or “a fast 7B model”
PriorityLatency, cost, throughput, or quality”optimize for latency”
ScaleExpected traffic or deployment type”500 RPM”, “internal tool”, “production API”

Strong prompt vs. weak prompt

Compare these two examples to see the difference specificity makes:
Deploy Mistral 7B as a customer support chatbot. Optimize for latency,
keep cost under $0.001 per request, and target 500 requests per minute.
The strong prompt tells the agent exactly what to build, which model to use, what to optimize for, and how to size the infrastructure. The weak prompt leaves every decision open, so the agent will ask clarifying questions before it can do anything useful.

Use @ to mention models

Reference specific models by name with the @ symbol. The agent resolves these to Hugging Face model IDs automatically.
Optimize @Qwen-2.5-7B with AWQ quantization for a summarization API
Compare @Llama-3.1-8B and @Mistral-7B for code generation
Use @ mentions when you know the exact model you want. If you’re unsure, describe what you need (“a fast 7B multilingual model”) and the agent will recommend one.

Set constraints in natural language

You don’t need special syntax to define limits. Write constraints the same way you’d say them out loud.
Keep latency under 100ms P99
Stay under $500/month total cost
Quality score must be above 0.9
I need at least 1000 requests per minute
The agent translates these into hard constraints that filter optimization results. Anything that doesn’t meet your requirements gets ruled out before the agent presents options.
“Fast” is subjective, “under 100ms P99” is measurable. The more precise your constraint, the more useful the optimization results will be.

Ask for what you want directly

The agent handles far more than the initial build. You can update any part of your pipeline by just saying what you want.
What you wantWhat to say
Change the model”Switch to Qwen 2.5 14B”
Add caching”Add a response cache”
Change optimization target”Optimize for cost instead”
Run optimization”Optimize now”
Compare versions”Compare version 1 and 2”
Roll back”Go back to version 1”
Change GPU”Use an H100 for this”
Export code”Generate deployment code”
Search models”Find a good model for code generation under 10B params”

Iterate, don’t restart

You don’t have to get everything right in a single message. Build incrementally, each message refines the pipeline, and the agent remembers the full conversation context.
User:  Build a chatbot with Llama 3.1 8B
Agent: [builds pipeline]

User:  Add a cache with 1 hour TTL
Agent: [adds cache node]

User:  Actually, switch to Qwen 2.5 7B, it's better for multilingual
Agent: [swaps model]

User:  Optimize for latency
Agent: [runs optimization]

User:  Deploy as scale-to-zero
Agent: [deploys]
Starting simple and refining is often faster than writing a perfect prompt upfront. Get a working pipeline first, then tune from there.

Let the agent decide when you’re unsure

If you don’t know which model, GPU, or quantization method to use, ask. The agent makes recommendations based on your use case, model size, and constraints.
What model would you recommend for a low-cost translation API?
What GPU should I use for a 14B model?
Should I use AWQ or GPTQ for this?
You don’t need to know the technical details. The agent handles GPU selection, quantization, serving backend, scaling, and configuration. Your job is to describe what you want to build and what success looks like.

Good prompt vs. bad prompt

Every good prompt is specific about use case, model, priority, and scale. Every bad prompt drops one or more of those. Concrete pairs:
Bad: Build me an AI thing for my startup.Good: Build a customer-support chatbot for our SaaS dashboard. Traffic is 300 RPM peak, latency budget P99 under 200ms, monthly cost under $800. English only.Why: Every element is measurable. The agent can rank variants against hard targets and skip the clarifying-question loop.
Bad: Use a smart model.Good: Use Llama 3.1 8B. If it doesn't hit my latency target, suggest the next size up.Why: You named a specific candidate and gave the agent permission to escalate if it fails.
Bad: Make it fast and cheap.Good: Optimize for latency. Hard ceiling $0.001 per request, P99 under 150ms, quality score at least 0.9.Why: “Fast” and “cheap” are moods. Numbers are constraints the optimizer can respect.
Bad: The latency is bad.Good: Latency P99 is 340ms, I need it under 200ms. Try a smaller model or a faster GPU, whichever costs less.Why: You gave the measured current value, the target value, and a menu of acceptable fixes.
Bad: What model is best?Good: For a multilingual FAQ bot handling 500 RPM with budget $300/month, which model do you recommend and why?Why: The agent recommends well when it knows the full context. Open-ended “best model” depends on a dozen dimensions.
Bad: This is not working.Good: The AWQ variant scored 0.72 on quality, below my 0.9 floor. Can we try FP8 instead, keeping cost under $0.002 per request?Why: You cited the specific metric that failed and proposed a direction. The agent can take one step forward instead of guessing.

Tips

Name the use case. “Customer support chatbot” gives the agent enough context to make opinionated picks. “A chatbot” leaves every decision open.
Review results before deploying. The agent shows optimization metrics first. Don’t deploy a variant that barely clears your quality floor; the playground is free to re-test.
Vague prompts don’t fail, they stall. The agent asks clarifying questions until it can act. Save round trips by including all four elements (use case, model, priority, scale) upfront.

Next steps

Example prompts

Real conversations for chatbots, summarizers, routing, and more.

Debug the agent

Redirect the agent when it picks the wrong model or over-builds.

Optimization

How GPU profiling, quantization search, and ranking work.

End-to-end guide

The full workflow, from first prompt to production traffic.