Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra’s agent gets most things right when you give it clear instructions, but sometimes it picks the wrong model, produces optimization results that don’t meet your requirements, or builds a more complex pipeline than you need. Every problem below has a straightforward fix, usually a single follow-up message that redirects the agent without losing your progress.
Be direct. Name the exact model you want and tell the agent to switch.
I said Llama, not Mistral. Switch to Llama 3.1 8B.
The agent will swap the model and preserve the rest of your pipeline configuration. If you want to use the @ mention syntax to be unambiguous:
Switch to @Llama-3.1-8B
If the results don’t meet your expectations, re-run optimization with more specific instructions. Each run creates a new version you can compare against previous ones.
The latency is still too high. Try a faster GPU.
Can you try TensorRT-LLM instead of vLLM?
Optimize again but prioritize latency over cost.
Replace subjective feedback (“it’s slow”) with measurable targets (“latency is 400ms but I need under 100ms P99”). This gives the agent concrete constraints to optimize against.
If the agent asks for clarification instead of building, tell it to proceed with its best judgment.
Just go with your best recommendation and we'll iterate from there.
This tells the agent to make decisions and move forward. You can always refine the result afterward, it’s faster than answering several questions upfront.
If the agent added nodes you don’t need, tell it exactly what to remove.
Remove the guardrail and the load balancer.
I just need the model and a cache.
The agent will simplify the pipeline to match your description. Being explicit about what you want to keep (“I just need the model and a cache”) helps the agent understand the target state, not just what to remove.
Optimization typically completes in 2-5 minutes. If it appears stuck, ask for a status update.
What's the status of the optimization?
The agent will show you current progress. If the run has genuinely stalled, you can ask it to restart with a smaller search space, for example, limiting to a single GPU tier or fewer quantization options.
When deployment fails, the agent surfaces error diagnostics automatically. Common fixes:
Try deploying on a different GPU tier.
The model might be too large for this GPU. What do you recommend?
If the error message mentions an out-of-memory condition, the model likely exceeds the VRAM available on your selected GPU. Asking the agent for a recommendation gives it a chance to suggest a larger GPU tier or a quantized model variant that fits.
The first request after a cold start takes 1-2 seconds on RunInfra Cloud. This is expected behavior for scale-to-zero endpoints, the GPU spins up on demand and subsequent requests are fast.If cold starts are unacceptable for your use case, switch to always-on deployment.
Switch to always-on deployment so there's no cold start.
Always-on deployment requires the Team plan.

General debugging tips

When nothing above resolves the issue, these four approaches cover most remaining cases:
  • Be specific about what’s wrong. “It’s broken” doesn’t give the agent anything to act on. “Latency is 500ms but I need under 100ms” does.
  • Ask the agent to explain its decisions. “Why did you pick this GPU?” or “Why is this quantization method better?” surfaces the agent’s reasoning so you can correct any wrong assumptions.
  • Compare versions. “Compare version 1 and version 3” shows you exactly what changed and which configuration performs better on your target metrics.
  • Start over if needed. If the pipeline has drifted too far from what you want, it’s sometimes faster to reset than to keep patching. Say “Reset the pipeline and let’s start from scratch” and give the agent a cleaner, more specific prompt the second time.

Next steps

Prompting best practices

Write prompts that avoid the problems on this page entirely.

Troubleshooting

Fix pipeline, deployment, and API integration issues by category.

Optimization

Understand how constraints and priority affect ranked results.

Deployment

Flex scale-to-zero and Active always-on endpoints.