RunInfra’s agent turns natural language into production AI inference endpoints, but the quality of what it builds depends on the clarity of what you ask. The more context you give upfront, use case, model, performance priorities, and expected scale, the faster the agent can make the right decisions and the less time you spend iterating on a suboptimal result.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Include these four things
Every strong prompt covers four elements. Leaving any of them out forces the agent to guess or ask clarifying questions.| Element | What to provide | Example |
|---|---|---|
| Use case | What you’re building | ”customer support chatbot”, “summarization API” |
| Model | A specific model or description | ”Llama 3.1 8B” or “a fast 7B model” |
| Priority | Latency, cost, throughput, or quality | ”optimize for latency” |
| Scale | Expected traffic or deployment type | ”500 RPM”, “internal tool”, “production API” |
Strong prompt vs. weak prompt
Compare these two examples to see the difference specificity makes:Use @ to mention models
Reference specific models by name with the@ symbol. The agent resolves these to Hugging Face model IDs automatically.
Set constraints in natural language
You don’t need special syntax to define limits. Write constraints the same way you’d say them out loud.“Fast” is subjective, “under 100ms P99” is measurable. The more precise your constraint, the more useful the optimization results will be.
Ask for what you want directly
The agent handles far more than the initial build. You can update any part of your pipeline by just saying what you want.| What you want | What to say |
|---|---|
| Change the model | ”Switch to Qwen 2.5 14B” |
| Add caching | ”Add a response cache” |
| Change optimization target | ”Optimize for cost instead” |
| Run optimization | ”Optimize now” |
| Compare versions | ”Compare version 1 and 2” |
| Roll back | ”Go back to version 1” |
| Change GPU | ”Use an H100 for this” |
| Export code | ”Generate deployment code” |
| Search models | ”Find a good model for code generation under 10B params” |
Iterate, don’t restart
You don’t have to get everything right in a single message. Build incrementally, each message refines the pipeline, and the agent remembers the full conversation context.Let the agent decide when you’re unsure
If you don’t know which model, GPU, or quantization method to use, ask. The agent makes recommendations based on your use case, model size, and constraints.You don’t need to know the technical details. The agent handles GPU selection, quantization, serving backend, scaling, and configuration. Your job is to describe what you want to build and what success looks like.
Good prompt vs. bad prompt
Every good prompt is specific about use case, model, priority, and scale. Every bad prompt drops one or more of those. Concrete pairs:Starting from scratch
Starting from scratch
Bad:
Build me an AI thing for my startup.Good: Build a customer-support chatbot for our SaaS dashboard. Traffic is 300 RPM peak, latency budget P99 under 200ms, monthly cost under $800. English only.Why: Every element is measurable. The agent can rank variants against hard targets and skip the clarifying-question loop.Asking for a model
Asking for a model
Bad:
Use a smart model.Good: Use Llama 3.1 8B. If it doesn't hit my latency target, suggest the next size up.Why: You named a specific candidate and gave the agent permission to escalate if it fails.Setting constraints
Setting constraints
Bad:
Make it fast and cheap.Good: Optimize for latency. Hard ceiling $0.001 per request, P99 under 150ms, quality score at least 0.9.Why: “Fast” and “cheap” are moods. Numbers are constraints the optimizer can respect.Asking for changes
Asking for changes
Bad:
The latency is bad.Good: Latency P99 is 340ms, I need it under 200ms. Try a smaller model or a faster GPU, whichever costs less.Why: You gave the measured current value, the target value, and a menu of acceptable fixes.When unsure what to pick
When unsure what to pick
Bad:
What model is best?Good: For a multilingual FAQ bot handling 500 RPM with budget $300/month, which model do you recommend and why?Why: The agent recommends well when it knows the full context. Open-ended “best model” depends on a dozen dimensions.Debugging a bad result
Debugging a bad result
Bad:
This is not working.Good: The AWQ variant scored 0.72 on quality, below my 0.9 floor. Can we try FP8 instead, keeping cost under $0.002 per request?Why: You cited the specific metric that failed and proposed a direction. The agent can take one step forward instead of guessing.Tips
Next steps
Example prompts
Real conversations for chatbots, summarizers, routing, and more.
Debug the agent
Redirect the agent when it picks the wrong model or over-builds.
Optimization
How GPU profiling, quantization search, and ranking work.
End-to-end guide
The full workflow, from first prompt to production traffic.