Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Most RunInfra problems fall into one of four categories: chat and pipeline building, optimization, deployment, or API integration. This page covers the most common issues in each area and tells you exactly what to do. If your situation isn’t listed, check the support channels at the bottom.

Chat and pipeline building

The agent asks clarifying questions when your initial prompt is ambiguous. If the back-and-forth is slowing you down, tell it to proceed:
Just go with your best recommendation and we'll iterate from there.
The agent will make opinionated choices and build the pipeline. You can always refine model selection, caching, and constraints in follow-up messages.
Name the model you want explicitly:
Switch to Llama 3.1 8B.
RunInfra supports a wide range of open-source LLMs. If you’re unsure which model fits your use case, ask: “What model do you recommend for low-latency chat under $200/month?”
Reset the current pipeline entirely:
Reset the pipeline and start from scratch.
This clears the current configuration and conversation context. Your previous optimization results and deployment history are preserved in the dashboard.

Optimization

Optimization typically completes in 2-5 minutes. The agent is profiling GPUs and running real inference benchmarks across model variants, so this is expected.If it appears stuck after 10 minutes, check in:
What's the status of the optimization?
The agent will report progress or surface any errors it encountered.
Optimization results reflect the constraints you gave the agent. If the results don’t satisfy your latency, cost, or throughput targets, adjust the approach:
Optimize again with a smaller model.
Try a faster GPU.
Relax the latency constraint to 300ms.
Each of these triggers a new optimization run with updated parameters. You can run multiple optimization sessions and compare results side by side in the dashboard.
Aggressive quantization (AWQ 4-bit, GPTQ 4-bit) can reduce output quality on complex tasks. If the quality score is unacceptable, request a higher-precision variant:
Search for an FP8 version instead of AWQ 4-bit.
FP8 preserves more model fidelity than 4-bit quantization, especially on H100 and H200 GPUs. Expect slightly higher cost per token compared to AWQ variants.

Deployment

The agent shows error diagnostics inline when deployment fails. The most common causes are GPU availability and model size mismatches. Try:
Try a different GPU tier.
The model might be too large for this GPU. Recommend something bigger.
If a specific GPU tier is temporarily unavailable in a region, switching tiers usually resolves the issue immediately.
This is normal behavior for the very first request after deployment. The model needs to load from storage and compile before it can serve inference. Subsequent requests are fast.
RunInfra Cloud uses weight caching to keep cold starts under 2 seconds for all requests after the first. You don’t need to configure anything to enable this.
If you need zero cold start on every request, upgrade to the Team plan and deploy in Active mode, which keeps the model resident on GPU at all times.
A 503 response means the endpoint is stopped or still provisioning. Two things to check:
  1. Open Deployments and verify the endpoint status.
  2. Ask the agent: “What’s the status of my deployment?”
If the endpoint is in Stopped state, start it from the Deployments dashboard or ask the agent to start it. If it’s still provisioning, wait 1-3 minutes and retry.

API integration

Your API key is invalid, revoked, or not being sent correctly. Check all three:
  • The Authorization header is Bearer YOUR_KEY (no extra quotes or whitespace)
  • You are using a key from the same workspace as the target deployment
  • The key hasn’t been revoked in Settings > API Keys
Generate a new key if needed, key creation is instant and doesn’t require redeployment.
Common causes: a pipeline-scoped key that doesn’t match the pipeline id in the URL, or a plan-level limit was exceeded. Switch to a workspace-scoped key or regenerate a key tied to the correct pipeline. See Authentication for the two scopes.
You’ve exceeded the rate limit for your API key. The response includes a Retry-After header, wait that number of seconds before retrying.To increase the rate limit permanently, go to Settings > API Keys and update the limit for the key. Higher limits may require a plan upgrade.
You’ve hit a plan-level limit, for example, exceeding the number of allowed pipelines, optimization sessions, or playground requests for your current plan. Upgrade your plan at Settings > Billing to continue.

Still stuck?

The right support channel depends on your plan:
PlanSupport channel
Starter (free)Community support
ProPriority email support
TeamShared Slack channel
EnterpriseDedicated customer success manager
You can also send feedback directly from within the app, or email support from your billing page.

Debug the agent

Redirect the agent when pipelines need course correction.

Monitor endpoints

Catch problems in Observe before they reach your users.

Deployment

Flex, Active, scaling, and cold-start configuration.

FAQ

Answers to common questions about the platform.