Deployment
Deploy, test, and use your optimized pipeline as a production API.
Deployment requires Pro plan ($99/mo) or higher. The free plan lets you build, optimize, and test in the playground.
Deploying
Click Deploy in the deploy tab, or tell the agent:
Deploy this pipelineRunInfra provisions a GPU, loads your optimized model, and gives you an endpoint URL and API key. Provisioning takes 1-3 minutes.
Testing in the playground
Before deploying (or after), use the built-in playground in the Deploy tab to test your pipeline:
- Type a prompt
- Adjust temperature and max tokens
- Click Send (or Cmd+Enter)
The response streams in real time with latency and token metrics. Use this to check output quality, latency, and edge cases before going to production.
| Plan | Playground requests |
|---|---|
| Starter | 100 per day |
| Pro+ | Unlimited |
The first request may take 30-60 seconds as the model loads. Subsequent requests are fast.
Using your endpoint
Your deployed endpoint is OpenAI-compatible. The deploy tab generates ready-to-use code snippets for Python, JavaScript, and cURL with your real endpoint URL.
Point any OpenAI SDK at your RunInfra URL and it works:
from openai import OpenAI
client = OpenAI(
base_url="https://api.runinfra.ai/v1/YOUR_PIPELINE_ID",
api_key="ri_your_api_key",
)Works with LangChain, LlamaIndex, and any OpenAI-compatible client. Streaming, async, and error handling all work the same way.
If you're switching from OpenAI, change base_url and api_key. Everything else stays the same.
Two deployment modes
Flex (scale-to-zero)
Available on Pro and above. The endpoint scales down when idle and spins up on demand.
- Cost: Pay per token only when processing. Nothing when idle.
- Cold start: Under 2 seconds on RunInfra Cloud.
- Idle timeout: 5 minutes of no requests before scaling down.
- Best for: Development, variable traffic, cost-sensitive workloads.
Active (always-on)
Always-on endpoints require Team plan.
The endpoint stays warm 24/7. Zero cold start.
- Cost: Same per-token rate, but a flat base fee keeps your endpoint warm 24/7.
- Cold start: None.
- Best for: Production APIs with SLA requirements, latency-critical applications.
Fast cold starts
RunInfra Cloud caches your model weights near the GPU. Cold starts go from 20-60 seconds down to under 2 seconds. This is automatic on all Pro+ deployments.
The first deployment takes the full cold start time while the cache is built. After that, all subsequent starts use the cached weights.
Scaling
Each GPU replica handles up to 30 concurrent requests. Scale replicas for more capacity:
| Plan | Max replicas |
|---|---|
| Pro | 8 (up to 240 concurrent) |
| Team | 32 (up to 960 concurrent) |
| Enterprise | Custom |
The agent recommends replica count based on your traffic target:
Deploy with 4 replicas for high trafficManaging endpoints
After deployment, you can:
- Stop: Pause the endpoint. No charges while stopped.
- Start: Restart a stopped endpoint. Cached weights make restarts fast.
- Change GPU: Switch to a different GPU tier (the agent will warn you if re-optimization is recommended).
Do this from the Deployments page or ask the agent.
How is this guide?