Deployment targets

When you deploy a pipeline, RunInfra provisions it on one of four targets. The default is managed RunInfra Cloud because it covers most workloads with the least operational burden. The others exist for teams that want the bill and the data to live in their own cloud account, need custom hardware, or have an existing cost contract.

Managed RunInfra Cloud

Scale-to-zero with Instant Start. RunInfra runs the infra, you pay per-million-tokens (Flex) or per-second + base fee (Active).

Your RunPod (bring your own cloud)

One-click deliver the selected runtime to your OWN RunPod account with a stored scoped key. Optimization evidence still comes from RunInfra’s Modal measurement loop; you pay RunPod directly for delivery GPU usage.

Self-hosted Modal

Run on your Modal account using your reserved GPUs. You own the infra and the bill; RunInfra deploys + manages the pipeline software.

Custom GPU

Bring a Kubernetes cluster, a bare-metal box, or a cloud GPU instance. RunInfra ships a deployable image; you handle the runtime.

Managed RunInfra Cloud (default)

The fastest path. Two modes: Managed delivery is not an optimization proof source. Optimization, profiling, and measurement still run through RunInfra’s Modal loop before a runtime is treated as optimized.

Mode	Cold start	Billing	Best for
Flex (Core)	Under 2 s with Instant Start	Per-million-tokens, scale to zero	Prototyping, bursty traffic, anything with idle gaps
Active (Core)	Zero, always warm	Per-second reserved-GPU fee + lower per-token rate	Steady traffic, latency-critical workloads

Available regions are listed on the deploy form. The GPU tier (L4, L40S, A100, H100, H200, B200) is picked during optimization based on your priority (latency, throughput, cost, quality). How to use it: This is the default in the dashboard. Optimize the pipeline, click Deploy, pick Flex or Active. The endpoint URL is ready in 1-3 minutes.

Your RunPod (bring your own cloud)

The same one-click delivery flow as managed, but the endpoint runs on your OWN RunPod account. RunInfra provisions and manages it; the GPU bill is yours, paid to RunPod directly. RunPod BYOC is not the optimization substrate. Optimization proof is produced by the RunInfra Modal measurement loop before a model is treated as optimized. When to pick this:

You want the GPU cost and usage to sit in your own RunPod account, not RunInfra’s
You have RunPod credits, a committed-spend contract, or negotiated rates
Compliance requires inference to run in an account you control

How it works:

Connect a scoped RunPod API key from the deploy tab (Destination > Your RunPod) or Settings. The key is encrypted (AES-256-GCM); RunInfra never logs it and only the engine decrypts it to act on your account.
Optimize a pipeline through the RunInfra Modal measurement loop, click Deploy, choose Your RunPod.
RunInfra provisions a serverless endpoint on your account from a stock public worker image plus the selected runtime config (no custom image build), and returns an OpenAI-compatible endpoint.
Inference is proxied through the RunInfra gateway (api.runinfra.ai) to your endpoint, so your SDK code and API keys do not change.

Supported models: LLM and embedding only for BYOC delivery. The deploy uses a stock worker image with runtime config, not a custom-built image or optimization run on RunPod, so image, video, speech, and voice models stay blocked from BYOC until their Modal evidence and delivery path are separately certified. Billing: RunInfra does not charge for inference served on your own cloud; you pay RunPod directly for the GPU. Requires a Core plan or above. What RunInfra handles: provisioning, the OpenAI-compatible gateway and routing, delivery configuration, monitoring and audit logs, and teardown, all on your account under your key. Optimization and re-optimization evidence remains owned by the RunInfra Modal measurement loop. What you handle: your RunPod account, its balance, and the GPU bill. Set it up at Settings > connect RunPod, or pick Your RunPod from the Destination selector in the deploy tab. Run the same RunInfra pipeline against your own Modal account. When to pick this:

You already have reserved GPU capacity on Modal
You want the billing and audit trail to live in your account
You need a specific region not offered on managed RunInfra Cloud
Compliance requires your data to never leave a specific cloud account

What you handle:

Your Modal account, billing, and any reserved-capacity contracts
IAM and access scoping inside Modal
Region selection and quota

What RunInfra handles:

Pipeline software (serving stack, optimizations, version pinning)
Routing and rate limiting on the OpenAI-compatible gateway
Optimization and re-optimization runs
Monitoring and audit logs surfaced into the RunInfra dashboard

This target is sales-gated. Talk to sales to enable Self-hosted Modal for your account.

Custom GPU

Bring your own infrastructure. Bare metal, on-prem, a cloud GPU instance, or a Kubernetes cluster. When to pick this:

Strict data residency (air-gapped, regulated industry)
Existing GPU contract you don’t want to leave (Crusoe, Lambda, CoreWeave, an enterprise NVIDIA contract)
Workload size that justifies dedicated hardware

What you get:

A deployable container image of your optimized pipeline
A signed FETCH_MODEL.sh weight fetcher plus runtime files under runinfra/
runinfra/serve.sh, runinfra/Dockerfile, and runinfra/docker-compose.yml for local validation
An HTTP route matching the kit’s documented contract for the deployment modality: OpenAI-compatible where the modality has one, /rerank for rerank kits, /predict for classification kits

What you handle:

Hardware, networking, storage, observability
Container orchestration (Docker, Kubernetes, systemd, whatever you run)
Rolling updates and rollback

Open the Export tab to build and download the optimized pipeline package. See the deployment overview for the export contract.

How to choose

Air-gapped / custom hardware?                  -> Custom GPU
Have a Modal contract or reserved GPUs?        -> Self-hosted Modal
Want the GPU bill in your own RunPod account?  -> Your RunPod (bring your own cloud)
Anything else (default)                        -> Managed RunInfra Cloud

If you’re not sure, start with managed RunInfra Cloud. You can migrate later: the pipeline definition is portable across all four targets.

Pricing

Managed RunInfra Cloud (Flex): per-million-tokens, see GPU pricing for current rates
Managed RunInfra Cloud (Active): lower per-million-tokens plus a per-second reserved fee, see GPU pricing
Your RunPod (bring your own cloud): you pay RunPod directly for the GPU; RunInfra does not bill for inference served on your cloud. Requires a Core plan or above.
Self-hosted Modal: your Modal bill plus a RunInfra platform fee, billed monthly
Custom GPU: your infrastructure cost plus a RunInfra platform fee, billed monthly

Contact sales for self-hosted and custom GPU pricing.

Managed RunInfra Cloud

Your RunPod (bring your own cloud)

Self-hosted Modal

Custom GPU

​Managed RunInfra Cloud (default)

​Your RunPod (bring your own cloud)

​Self-hosted Modal

​Custom GPU

​How to choose

​Pricing

Managed RunInfra Cloud (default)

Your RunPod (bring your own cloud)

Self-hosted Modal

Custom GPU

How to choose

Pricing