Troubleshooting - RunInfra

Most RunInfra problems fall into one of four categories: chat and pipeline building, optimization, deployment, or API integration. This page covers the most common issues in each area and tells you exactly what to do. If your situation isn’t listed, check the support channels at the bottom.

Chat and pipeline building

The agent keeps asking questions instead of building

The agent asks clarifying questions when your initial prompt is ambiguous. If the back-and-forth is slowing you down, tell it to proceed:

Just go with your best recommendation and we'll iterate from there.

The agent will make opinionated choices and build the pipeline. You can always refine model selection, caching, and constraints in follow-up messages.

The agent picked the wrong model

Name the model you want explicitly:

Switch to Llama 3.1 8B.

RunInfra supports a wide range of open-source LLMs. If you’re unsure which model fits your use case, ask: “What model do you recommend for low-latency chat under $200/month?”

I want to start over

Reset the current pipeline entirely:

Reset the pipeline and start from scratch.

This clears the current configuration and conversation context. Your previous optimization results and deployment history are preserved in the dashboard.

Optimization

Optimization is taking too long

Optimization typically completes in 2-5 minutes. The agent is profiling GPUs and running real inference benchmarks across model variants, so this is expected.If it appears stuck after 10 minutes, check in:

What's the status of the optimization?

The agent will report progress or surface any errors it encountered.

Results don't meet my constraints

Optimization results reflect the constraints you gave the agent. If the results don’t satisfy your latency, cost, or throughput targets, adjust the approach:

Optimize again with a smaller model.

Try a faster GPU.

Relax the latency constraint to 300ms.

Each of these triggers a new optimization run with updated parameters. You can run multiple optimization sessions and compare results side by side in the dashboard.

I refreshed the page during an optimization run

Nothing is lost. The run executes server-side, so a refresh, a dropped connection, or a closed tab does not stop it. When you reopen the session, the dashboard re-attaches to the running execution within about a second: phases, live cost, and the Stop control resume updating.If a run was interrupted by a timeout, crash, or redeploy, it converges to a blocked state with retry actions instead of appearing to run forever. Use the resume or restart action to continue.Stopping a run cancels the underlying GPU work and its billing. A canceled run never promotes an optimization version; any candidate measured before the cancel keeps its measured results.

Quality evidence is weak or the optimized variant regressed

Aggressive low-bit quantization can reduce output quality on complex tasks. If the measured gate fails, stays pending, or your own test set looks worse, request a higher-precision variant:

Try FP8 where compatible, or fall back to FP16 for this model.

FP8 can preserve more model fidelity than 4-bit quantization on compatible GPU/runtime pairs. Expect higher memory use or cost compared with low-bit variants.

Deployment

Deployment failed

The agent shows error diagnostics inline when deployment fails. The most common causes are GPU availability and model size mismatches. Try:

Try a different GPU tier.

The model might be too large for this GPU. Recommend something bigger.

If a specific GPU tier is temporarily unavailable in a region, switching tiers usually resolves the issue immediately.

The first request is slow (30-60 seconds)

This is normal behavior for the very first request after deployment. The model needs to load from storage and compile before it can serve inference. Subsequent requests are fast.

RunInfra Cloud uses weight caching to keep cold starts under 2 seconds for all requests after the first. You don’t need to configure anything to enable this.

If you need zero cold start on every request, deploy in Active mode (available on a paid Core plan), which keeps the model resident on GPU at all times.

Endpoint returns 503

A 503 response means the endpoint is stopped or still provisioning. Two things to check:

Open Deployments and verify the endpoint status.
Ask the agent: “What’s the status of my deployment?”

If the endpoint is in Stopped state, start it from the Deployments dashboard or ask the agent to start it. If it’s still provisioning, wait 1-3 minutes and retry.

API integration

401 Unauthorized

Your API key is invalid, revoked, or not being sent correctly. Check all three:

The Authorization header is Bearer YOUR_KEY (no extra quotes or whitespace)
You are using a key from the same workspace as the target deployment
The key hasn’t been revoked in Settings > API Keys

Generate a new key if needed, key creation is instant and doesn’t require redeployment.

403 Forbidden

Common causes: a pipeline-scoped key that doesn’t match the pipeline id in the URL, or a plan-level limit was exceeded. Switch to a workspace-scoped key or regenerate a key tied to the correct pipeline. See Authentication for the two scopes.

429 Too Many Requests

You’ve exceeded the rate limit for your API key. The response includes a Retry-After header, wait that number of seconds before retrying.To increase the rate limit permanently, go to Settings > API Keys and update the limit for the key. Higher limits may require the Core or Enterprise plan.

403 with an upgrade prompt

You’ve hit a plan-level limit, for example running low on credits, or no paid plan yet. Add credits at Settings > Cost, or move to the Core plan at Settings > Billing, to continue.

Still stuck?

The right support channel depends on your plan:

Plan	Support channel
Core	Priority email support
Enterprise	Dedicated customer success manager

You can also send feedback directly from within the app, or email support from your billing page.

Debug the agent

Redirect the agent when pipelines need course correction.

Monitor endpoints

Catch problems in the deployment metrics before they reach your users.

Deployment

Flex, Active, scaling, and cold-start configuration.

FAQ

Answers to common questions about the platform.

​Chat and pipeline building

​Optimization

​Deployment

​API integration

​Still stuck?

​Related

Debug the agent

Monitor endpoints

Deployment

FAQ

Chat and pipeline building

Optimization

Deployment

API integration

Still stuck?

Related