Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

When you deploy a pipeline, RunInfra provisions it on one of three targets. The default is managed RunPod Serverless because it covers most workloads with the least operational burden. The other two exist for teams that need data residency, custom hardware, or their own cost contract.

Managed RunPod

Scale-to-zero with FlashBoot. RunInfra runs the infra, you pay per-million-tokens (Flex) or per-second + base fee (Active).

Self-hosted Modal

Run on your Modal account using your reserved GPUs. You own the infra and the bill; RunInfra deploys + manages the pipeline software.

Custom GPU

Bring a Kubernetes cluster, a bare-metal box, or a cloud GPU instance. RunInfra ships a deployable image; you handle the runtime.

Managed RunPod (default)

The fastest path. Two modes:
ModeCold startBillingBest for
Flex (Pro+)Under 2 s with Instant StartPer-million-tokens, scale to zeroPrototyping, bursty traffic, anything with idle gaps
Active (Team+)Zero, always warmPer-second reserved-GPU fee + lower per-token rateSteady traffic, latency-critical workloads
Available regions are listed on the deploy form. The GPU tier (L4, L40S, A100, H100, H200, B200) is picked during optimization based on your priority (latency, throughput, cost, quality). How to use it: This is the default in Pipes. Optimize the pipeline, click Deploy, pick Flex or Active. The endpoint URL is ready in 1-3 minutes.

Self-hosted Modal

Run the same RunInfra pipeline against your own Modal account. When to pick this:
  • You already have reserved GPU capacity on Modal
  • You want the billing and audit trail to live in your account
  • You need a specific region not offered on managed RunPod
  • Compliance requires your data to never leave a specific cloud account
What you handle:
  • Your Modal account, billing, and any reserved-capacity contracts
  • IAM and access scoping inside Modal
  • Region selection and quota
What RunInfra handles:
  • Pipeline software (serving stack, optimizations, version pinning)
  • Routing and rate limiting on the OpenAI-compatible gateway
  • Optimization and re-optimization runs
  • Monitoring and audit logs surfaced into the RunInfra dashboard
Set up at Settings > Deployment > Self-hosted with a Modal token. See Talk to sales if you need this enabled.

Custom GPU

Bring your own infrastructure. Bare metal, on-prem, a cloud GPU instance, or a Kubernetes cluster. When to pick this:
  • Strict data residency (air-gapped, regulated industry)
  • Existing GPU contract you don’t want to leave (Crusoe, Lambda, CoreWeave, an enterprise NVIDIA contract)
  • Workload size that justifies dedicated hardware
What you get:
  • A deployable container image of your optimized pipeline
  • The serve.sh boot script, the model artifacts, the runtime config
  • A Dockerfile and a docker-compose.yaml for local validation
  • OpenAI-compatible HTTP endpoint exposed by the container
What you handle:
  • Hardware, networking, storage, observability
  • Container orchestration (Docker, Kubernetes, systemd, whatever you run)
  • Rolling updates and rollback
The Code tab of your pipeline shows the full export. Click Export to download the bundle. See the deployment overview for the export contract.

How to choose

Need data residency or custom hardware?    -> Custom GPU
Have a Modal contract or reserved GPUs?    -> Self-hosted Modal
Anything else (default)                    -> Managed RunPod
If you’re not sure, start with managed RunPod. You can migrate later: the pipeline definition is portable across all three targets.

Pricing

  • Managed RunPod (Flex): per-million-tokens, see GPU pricing for current rates
  • Managed RunPod (Active): lower per-million-tokens plus a per-second reserved fee, see GPU pricing
  • Self-hosted Modal: your Modal bill plus a RunInfra platform fee, billed monthly
  • Custom GPU: your infrastructure cost plus a RunInfra platform fee, billed monthly
Contact sales for self-hosted and custom GPU pricing.