Debugging

The agent picked the wrong model

Be direct. Name the exact model you want and tell the agent to switch.

I said Llama, not Mistral. Switch to Llama 3.1 8B.

The agent will swap the model and preserve the rest of your pipeline configuration. If you want to use the @ mention syntax to be unambiguous:

Switch to @Llama-3.1-8B

Optimization results are bad

If the results don’t meet your expectations, re-run optimization with more specific instructions. Each run creates a new version you can compare against previous ones.

The latency is still too high. Try a faster GPU.

Can you try TensorRT-LLM instead of vLLM?

Optimize again but prioritize latency over cost.

Replace subjective feedback (“it’s slow”) with measurable targets (“latency is 400ms but I need under 100ms P99”). This gives the agent concrete constraints to optimize against.

The agent keeps asking clarifying questions

If the agent asks for clarification instead of building, tell it to proceed with its best judgment.

Just go with your best recommendation and we'll iterate from there.

This tells the agent to make decisions and move forward. You can always refine the result afterward, it’s faster than answering several questions upfront.

The pipeline is too complex

If the agent added nodes you don’t need, tell it exactly what to remove.

Remove the guardrail and the load balancer.
I just need the model and a cache.

The agent will simplify the pipeline to match your description. Being explicit about what you want to keep (“I just need the model and a cache”) helps the agent understand the target state, not just what to remove.

Optimization is taking too long

Optimization typically completes in 2-5 minutes. If it appears stuck, ask for a status update.

What's the status of the optimization?

The agent will show you current progress. If the run has genuinely stalled, you can ask it to restart with a smaller search space, for example, limiting to a single GPU tier or fewer quantization options.

Deployment failed

When deployment fails, the agent surfaces error diagnostics automatically. Common fixes:

Try deploying on a different GPU tier.

The model might be too large for this GPU. What do you recommend?

If the error message mentions an out-of-memory condition, the model likely exceeds the VRAM available on your selected GPU. Asking the agent for a recommendation gives it a chance to suggest a larger GPU tier or a quantized model variant that fits.

The endpoint is slow on the first request

The first request after a cold start takes 1-2 seconds on RunInfra Cloud. This is expected behavior for scale-to-zero endpoints, the GPU spins up on demand and subsequent requests are fast.If cold starts are unacceptable for your use case, switch to always-on deployment.

Switch to always-on deployment so there's no cold start.

Always-on deployment requires a paid Core plan.

Prompting best practices

Write prompts that avoid the problems on this page entirely.

Troubleshooting

Fix pipeline, deployment, and API integration issues by category.

Optimization

Understand how constraints and priority affect ranked results.

Deployment

Flex scale-to-zero and Active always-on endpoints.

General debugging tips

Next steps

Prompting best practices

Troubleshooting

Optimization

Deployment

​General debugging tips

​Next steps

Prompting best practices

Troubleshooting

Optimization

Deployment

General debugging tips

Next steps