RunInfra/Docs
GuideChangelog
Sign inGet started
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog
Documentation
Introduction
Welcome to RunInfraQuickstartPlans and PricingFAQ
Prompting
Prompting Best PracticesExample PromptsDebugging Prompts
Features
OptimizationDeploymentMonitoringModelsGPU and Pricing
Tips & Tricks
From Idea to PipelineTroubleshooting
Changelog

Deployment

Deploy, test, and use your optimized pipeline as a production API.

Deployment requires Pro plan ($99/mo) or higher. The free plan lets you build, optimize, and test in the playground.

Deploying

Click Deploy in the deploy tab, or tell the agent:

Deploy this pipeline

RunInfra provisions a GPU, loads your optimized model, and gives you an endpoint URL and API key. Provisioning takes 1-3 minutes.

Testing in the playground

Before deploying (or after), use the built-in playground in the Deploy tab to test your pipeline:

  1. Type a prompt
  2. Adjust temperature and max tokens
  3. Click Send (or Cmd+Enter)

The response streams in real time with latency and token metrics. Use this to check output quality, latency, and edge cases before going to production.

PlanPlayground requests
Starter100 per day
Pro+Unlimited

The first request may take 30-60 seconds as the model loads. Subsequent requests are fast.

Using your endpoint

Your deployed endpoint is OpenAI-compatible. The deploy tab generates ready-to-use code snippets for Python, JavaScript, and cURL with your real endpoint URL.

Point any OpenAI SDK at your RunInfra URL and it works:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1/YOUR_PIPELINE_ID",
    api_key="ri_your_api_key",
)

Works with LangChain, LlamaIndex, and any OpenAI-compatible client. Streaming, async, and error handling all work the same way.

If you're switching from OpenAI, change base_url and api_key. Everything else stays the same.

Two deployment modes

Flex (scale-to-zero)

Available on Pro and above. The endpoint scales down when idle and spins up on demand.

  • Cost: Pay per token only when processing. Nothing when idle.
  • Cold start: Under 2 seconds on RunInfra Cloud.
  • Idle timeout: 5 minutes of no requests before scaling down.
  • Best for: Development, variable traffic, cost-sensitive workloads.

Active (always-on)

Always-on endpoints require Team plan.

The endpoint stays warm 24/7. Zero cold start.

  • Cost: Same per-token rate, but a flat base fee keeps your endpoint warm 24/7.
  • Cold start: None.
  • Best for: Production APIs with SLA requirements, latency-critical applications.

Fast cold starts

RunInfra Cloud caches your model weights near the GPU. Cold starts go from 20-60 seconds down to under 2 seconds. This is automatic on all Pro+ deployments.

The first deployment takes the full cold start time while the cache is built. After that, all subsequent starts use the cached weights.

Scaling

Each GPU replica handles up to 30 concurrent requests. Scale replicas for more capacity:

PlanMax replicas
Pro8 (up to 240 concurrent)
Team32 (up to 960 concurrent)
EnterpriseCustom

The agent recommends replica count based on your traffic target:

Deploy with 4 replicas for high traffic

Managing endpoints

After deployment, you can:

  • Stop: Pause the endpoint. No charges while stopped.
  • Start: Restart a stopped endpoint. Cached weights make restarts fast.
  • Change GPU: Switch to a different GPU tier (the agent will warn you if re-optimization is recommended).

Do this from the Deployments page or ask the agent.

How is this guide?

PreviousDocumentationNextGPU and Pricing

On this page

DeployingTesting in the playgroundUsing your endpointTwo deployment modesFlex (scale-to-zero)Active (always-on)Fast cold startsScalingManaging endpoints