Debugging Prompts

When things go wrong and how to guide the agent back on track.

Sometimes the agent misunderstands or picks suboptimal options. Here's how to course-correct.

The agent picked the wrong model

User: I said Llama, not Mistral. Switch to Llama 3.1 8B.

Be direct. Name the exact model you want.

If the results don't meet your expectations:

User: The latency is still too high. Try a faster GPU.

User: Can you try TensorRT-LLM instead of vLLM?

User: Optimize again but prioritize latency over cost.

The agent re-runs optimization with your new parameters. Each run creates a new version you can compare.

If the agent keeps asking for clarification instead of building:

User: Just go with your best recommendation and we'll iterate from there.

This tells the agent to make decisions and move forward.

If the agent added nodes you don't need:

User: Remove the guardrail and the load balancer. 
I just need the model and a cache.

Optimization typically takes 2-5 minutes. If it seems stuck:

User: What's the status of the optimization?

The agent will show you current progress.

If deployment fails, the agent shows error diagnostics. Common fixes:

User: Try deploying on a different GPU tier.

User: The model might be too large for this GPU. What do you recommend?

The first request after a cold start takes 1-2 seconds on RunInfra Cloud. This is normal for scale-to-zero endpoints. Subsequent requests are fast.

If cold starts are unacceptable:

User: Switch to always-on deployment so there's no cold start.

(Requires Team plan.)

Be specific: "It's broken" doesn't help. "Latency is 500ms but I need under 100ms" does.
Ask the agent to explain: "Why did you pick this GPU?" or "Why is this quantization method better?"
Compare versions: "Compare version 1 and version 3" to see what changed.
Start over if needed: "Reset the pipeline and let's start from scratch."

How is this guide?

Optimization results are bad

If the results don't meet your expectations:

User: The latency is still too high. Try a faster GPU.

User: Can you try TensorRT-LLM instead of vLLM?

User: Optimize again but prioritize latency over cost.

The agent re-runs optimization with your new parameters. Each run creates a new version you can compare.

General debugging tips

Be specific: "It's broken" doesn't help. "Latency is 500ms but I need under 100ms" does.

Ask the agent to explain: "Why did you pick this GPU?" or "Why is this quantization method better?"

Compare versions: "Compare version 1 and version 3" to see what changed.

Start over if needed: "Reset the pipeline and let's start from scratch."

How is this guide?