Most RunInfra problems fall into one of four categories: chat and pipeline building, optimization, deployment, or API integration. This page covers the most common issues in each area and tells you exactly what to do. If your situation isn’t listed, check the support channels at the bottom.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Chat and pipeline building
The agent keeps asking questions instead of building
The agent keeps asking questions instead of building
The agent asks clarifying questions when your initial prompt is ambiguous. If the back-and-forth is slowing you down, tell it to proceed:The agent will make opinionated choices and build the pipeline. You can always refine model selection, caching, and constraints in follow-up messages.
The agent picked the wrong model
The agent picked the wrong model
Name the model you want explicitly:RunInfra supports a wide range of open-source LLMs. If you’re unsure which model fits your use case, ask: “What model do you recommend for low-latency chat under $200/month?”
I want to start over
I want to start over
Reset the current pipeline entirely:This clears the current configuration and conversation context. Your previous optimization results and deployment history are preserved in the dashboard.
Optimization
Optimization is taking too long
Optimization is taking too long
Optimization typically completes in 2-5 minutes. The agent is profiling GPUs and running real inference benchmarks across model variants, so this is expected.If it appears stuck after 10 minutes, check in:The agent will report progress or surface any errors it encountered.
Results don't meet my constraints
Results don't meet my constraints
Optimization results reflect the constraints you gave the agent. If the results don’t satisfy your latency, cost, or throughput targets, adjust the approach:Each of these triggers a new optimization run with updated parameters. You can run multiple optimization sessions and compare results side by side in the dashboard.
Quality score is too low with the optimized variant
Quality score is too low with the optimized variant
Aggressive quantization (AWQ 4-bit, GPTQ 4-bit) can reduce output quality on complex tasks. If the quality score is unacceptable, request a higher-precision variant:FP8 preserves more model fidelity than 4-bit quantization, especially on H100 and H200 GPUs. Expect slightly higher cost per token compared to AWQ variants.
Deployment
Deployment failed
Deployment failed
The agent shows error diagnostics inline when deployment fails. The most common causes are GPU availability and model size mismatches. Try:If a specific GPU tier is temporarily unavailable in a region, switching tiers usually resolves the issue immediately.
The first request is slow (30-60 seconds)
The first request is slow (30-60 seconds)
This is normal behavior for the very first request after deployment. The model needs to load from storage and compile before it can serve inference. Subsequent requests are fast.If you need zero cold start on every request, upgrade to the Team plan and deploy in Active mode, which keeps the model resident on GPU at all times.
RunInfra Cloud uses weight caching to keep cold starts under 2 seconds for all requests after the first. You don’t need to configure anything to enable this.
Endpoint returns 503
Endpoint returns 503
A 503 response means the endpoint is stopped or still provisioning. Two things to check:
- Open Deployments and verify the endpoint status.
- Ask the agent: “What’s the status of my deployment?”
API integration
401 Unauthorized
401 Unauthorized
403 Forbidden
403 Forbidden
Common causes: a pipeline-scoped key that doesn’t match the pipeline id in the URL, or a plan-level limit was exceeded. Switch to a workspace-scoped key or regenerate a key tied to the correct pipeline. See Authentication for the two scopes.
429 Too Many Requests
429 Too Many Requests
You’ve exceeded the rate limit for your API key. The response includes a
Retry-After header, wait that number of seconds before retrying.To increase the rate limit permanently, go to Settings > API Keys and update the limit for the key. Higher limits may require a plan upgrade.403 with an upgrade prompt
403 with an upgrade prompt
You’ve hit a plan-level limit, for example, exceeding the number of allowed pipelines, optimization sessions, or playground requests for your current plan. Upgrade your plan at Settings > Billing to continue.
Still stuck?
The right support channel depends on your plan:| Plan | Support channel |
|---|---|
| Starter (free) | Community support |
| Pro | Priority email support |
| Team | Shared Slack channel |
| Enterprise | Dedicated customer success manager |
Related
Debug the agent
Redirect the agent when pipelines need course correction.
Monitor endpoints
Catch problems in Observe before they reach your users.
Deployment
Flex, Active, scaling, and cold-start configuration.
FAQ
Answers to common questions about the platform.