RunInfra is now public.See what's new

Own your AI. Optimized down to the kernel.

Pick any open-source model, and RunInfra benchmarks GPUs, optimizes kernels, and deploys a production API. You ship faster, pay less.

How It Works

From model to production API

Describe what you need. The agent builds, optimizes, benchmarks, and deploys your inference pipeline end to end.

Describe

Tell the agent what you need

Pick models with @mentions, set constraints, and the agent designs your inference pipeline. Plain English in, production config out.

RunInfra Agent
B200
Describe your AI pipeline...
Build

Smart routing, multi-model pipelines

Visual pipeline with complexity-aware routing and inference-time scaling. Your models, your infrastructure, your data.

Optimize

Custom GPU kernels compiled

The agent generates optimized Triton kernels. FlashAttention-2, fused ops, and quantization validated end to end.

RunInfra Forgev2.4.1
$
Benchmark

See what changed

Side-by-side comparison of baseline vLLM against your optimized config. Latency, throughput, memory, cost.

RunInfra Agent
Ask a follow-up...
Deploy

Pay per million tokens

Scale to zero when idle. Auto-scale to handle peak traffic. No GPU hourly costs, just tokens processed.

Model Catalog

Every open-source model, optimized

The full Hugging Face catalog at your fingertips. Pick any model, the agent handles quantization, kernel compilation, and deployment.

Llama 4 Maverick
400B MoEChat
DeepSeek R1
671B MoEReasoning
Qwen 2.5 72B
72BGeneral
Hermes 3 70B
70BInstruct
Mistral Large
123BChat
Gemma 3 27B
27BGeneral
Phi-4
14BReasoning
Nemotron 70B
70BInstruct
DeepSeek V3
685B MoECoding
Nous Hermes 2 Mixtral
47B MoEChat
Llama 3.3 70B
70BGeneral
Qwen 2.5 Coder 32B
32BCoding
Grok 2
314BChat
Nous Capybara 34B
34BChat
Llama 3.1 405B
405BGeneral
Command R+
104BRAG
Mixtral 8x7B
56B MoEMoE
DeepSeek R1 Distill 70B
70BReasoning
Gemma 3 12B
12BLightweight
Hermes 3 8B
8BChat
Llama 4 Maverick
400B MoEChat
DeepSeek R1
671B MoEReasoning
Qwen 2.5 72B
72BGeneral
Hermes 3 70B
70BInstruct
Mistral Large
123BChat
Gemma 3 27B
27BGeneral
Phi-4
14BReasoning
Nemotron 70B
70BInstruct
DeepSeek V3
685B MoECoding
Nous Hermes 2 Mixtral
47B MoEChat
Llama 3.3 70B
70BGeneral
Qwen 2.5 Coder 32B
32BCoding
Grok 2
314BChat
Nous Capybara 34B
34BChat
Llama 3.1 405B
405BGeneral
Command R+
104BRAG
Mixtral 8x7B
56B MoEMoE
DeepSeek R1 Distill 70B
70BReasoning
Gemma 3 12B
12BLightweight
Hermes 3 8B
8BChat
Llama 4 Scout
109B MoEGeneral
Mixtral 8x22B
176B MoEMoE
Nous Hermes 2 Yi 34B
34BGeneral
DeepSeek Coder V2
236B MoECoding
SDXL Turbo
3.5BImage
Qwen 2.5 7B
7BLightweight
GPT-OSS 20B
20BChat
Nous Hermes 13B
13BInstruct
Codestral
22BCoding
Llama 3.1 8B
8BLightweight
Qwen 2.5 32B
32BGeneral
DeepSeek R1 Distill 8B
8BReasoning
Phi-3 Mini
3.8BEdge
Nous Hermes 2 Solar
10.7BInstruct
Gemma 2 9B
9BGeneral
Mistral 7B
7BChat
Llama 3.2 3B
3BEdge
DeepSeek Coder 33B
33BCoding
Command R
35BRAG
Nous Puffin 70B
70BChat
Llama 4 Scout
109B MoEGeneral
Mixtral 8x22B
176B MoEMoE
Nous Hermes 2 Yi 34B
34BGeneral
DeepSeek Coder V2
236B MoECoding
SDXL Turbo
3.5BImage
Qwen 2.5 7B
7BLightweight
GPT-OSS 20B
20BChat
Nous Hermes 13B
13BInstruct
Codestral
22BCoding
Llama 3.1 8B
8BLightweight
Qwen 2.5 32B
32BGeneral
DeepSeek R1 Distill 8B
8BReasoning
Phi-3 Mini
3.8BEdge
Nous Hermes 2 Solar
10.7BInstruct
Gemma 2 9B
9BGeneral
Mistral 7B
7BChat
Llama 3.2 3B
3BEdge
DeepSeek Coder 33B
33BCoding
Command R
35BRAG
Nous Puffin 70B
70BChat
Deployment

Two ways to run inference

Run on our managed GPUs with per-million-token pricing, or export and deploy on your own infrastructure.

Managed

RunInfra Cloud

Your optimized model runs on our infrastructure with auto-scaling and scale-to-zero. Pay per million tokens, no idle costs.

Per-million-token billing
Auto-scaling to demand
Scale-to-zero when idle
Observability and full analytics
Continuous optimization post-deploy

Bring your own

Self-Hosted

Export your optimized config and deploy anywhere. Your GPUs, your cloud, your rules. We generate the kernels, you own the runtime.

Export optimized model config
Any cloud (AWS, GCP, Azure)
Deploy on your own GPUs
Full infrastructure control
No vendor lock-in
Pricing

Simple, transparent pricing

Start free and scale as you grow. Only pay for the GPU compute you use.

Starter

Build and test pipelines, no deployment.

$
/ month

 

Chat-driven pipeline builder
3 optimization sessions / month
Full Hugging Face model catalog
AWQ optimized model search
Smart routing (complexity, cost, latency)
Pipeline playground (100 req/day)
3 active pipelines
7-day metrics retention
Community support

Pro

Unlocks the Deploy tab. Inference is billed from credits you top up separately.

$
/ month

+ pay-per-million-token credits (purchased separately)

Everything in Starter, plus:
Deploy tab unlocked in chat sessions
Pay-per-million-token inference credits (top up any time)
Unlimited optimization sessions
Deployed API endpoints (scale-to-zero)
Unlimited active pipelines
Forge GPU kernel optimization
Full optimization suite (AWQ, GPTQ, FP8)
RunQuant: custom quantization engine
Pipeline versioning with comparison
Stress testing and preflight checks
Scaling config (up to 8 replicas)
Fast cold starts (under 2s)
90-day metrics with cost analytics
99.9% SLA, priority support

Team

For teams that need advanced optimization and collaboration.

$
/ seat / month

Min 3 seats. 10% token discount at 100M+/mo

Everything in Pro, plus:
Always-on endpoints (zero cold start)
NVIDIA TensorRT-LLM integration
Speculative decoding
Advanced routing (weighted, multi-model)
Custom model uploads
Scaling config (up to 32 replicas)
1-year metrics retention
SSO (coming soon), audit logs, RBAC
99.95% SLA
Shared Slack support

Enterprise

Dedicated infrastructure, compliance, and volume pricing.

Custom

 

Everything in Team, plus:
Dedicated GPU infrastructure with reserved capacity
Private model onboarding (fine-tuned weights)
Custom SLAs (up to 99.99% uptime)
Volume token pricing (up to 40% off)
Unlimited metrics retention
SOC 2 and HIPAA compliance
Dedicated CSM and private Slack
FAQ

Common questions

Can't find what you're looking for? Get in touch

What is RunInfra?

RunInfra is a GPU optimization platform for open-source LLMs. You pick a model from Hugging Face, and RunInfra benchmarks it across GPU tiers, generates custom Triton kernels, and deploys an optimized production API. No YAML, no DevOps.

Deploy your first optimized model
in under 5 minutes

Start Building for Free
End-to-end encryption
Isolated GPU infrastructure
Zero data retention
SOC 2 Type IIIn progress
RunInfra

Own your AI. We benchmark GPUs, optimize kernels, and deploy open-source models as production APIs.

Start building

© 2026 RunInfra. All rights reserved.