RunInfra is now public.See what's new
Pricing
Simple, transparent pricing
Start free and scale as you grow. Only pay for the GPU compute you use.
Starter
Build and test pipelines, no deployment.
$/ month
0
0123456789
Chat-driven pipeline builder
3 optimization sessions / month
Full Hugging Face model catalog
AWQ optimized model search
Smart routing (complexity, cost, latency)
Pipeline playground (100 req/day)
3 active pipelines
7-day metrics retention
Community support
Pro
Deploy optimized endpoints. Pay per token.
$/ month
0
01234567890
0123456789Inference billed per million tokens
Everything in Starter, plus:
Unlimited optimization sessions
Deployed API endpoints (scale-to-zero)
Unlimited active pipelines
Forge GPU kernel optimization
Full optimization suite (AWQ, GPTQ, FP8)
RunQuant: custom quantization engine
Pipeline versioning with comparison
Stress testing and preflight checks
Scaling config (up to 8 replicas)
Fast cold starts (under 2s)
90-day metrics with cost analytics
99.9% SLA, priority support
Team
For teams that need advanced optimization and collaboration.
$/ seat / month
0
01234567890
01234567890
0123456789Min 3 seats. 10% token discount at 100M+/mo
Everything in Pro, plus:
Always-on endpoints (zero cold start)
NVIDIA TensorRT-LLM integration
Speculative decoding
Advanced routing (weighted, multi-model)
Custom model uploads
Scaling config (up to 32 replicas)
1-year metrics retention
SSO (coming soon), audit logs, RBAC
99.95% SLA
Shared Slack support
Enterprise
Dedicated infrastructure, compliance, and volume pricing.
Custom
Everything in Team, plus:
Dedicated GPU infrastructure with reserved capacity
Private model onboarding (fine-tuned weights)
Custom SLAs (up to 99.99% uptime)
Volume token pricing (up to 40% off)
Unlimited metrics retention
SOC 2 and HIPAA compliance
Dedicated CSM and private Slack
| Compare plans | Starter $0/ month | Pro $99/ month | Team $249/ seat / month | Enterprise Custom |
|---|---|---|---|---|
| Chat-driven builder | ||||
| Optimization sessions | 3 / month | Unlimited | Unlimited | Unlimited |
| Active pipelines | 3 | Unlimited | Unlimited | Unlimited |
| Model catalog | Full HF catalog | Full HF catalog | Full HF catalog | HF + private models |
| Deployed API endpoint | ||||
| Forge GPU kernel optimization | ||||
| Optimization methods | AWQ | AWQ, GPTQ, FP8, RunQuant | AWQ, GPTQ, FP8, RunQuant | All + custom |
| NVIDIA TensorRT-LLM | ||||
| Speculative decoding | ||||
| Smart routing | Basic | Full | Advanced (weighted) | Custom |
| Scaling replicas | Up to 8 | Up to 32 | Custom | |
| Stress testing and preflight | ||||
| Pipeline versioning | ||||
| Inference pricing | Playground only | Per million tokens | 10% off at 100M+ | Up to 40% off |
| Metrics retention | 7 days | 90 days | 1 year | Unlimited |
| SLA guarantee | - | 99.9% | 99.95% | 99.99% |
| SSO (coming soon) and audit logs | ||||
| Support | Community | Priority email | Shared Slack | Dedicated CSM |
What is RunInfra?
RunInfra is a GPU optimization platform for open-source LLMs. You pick a model from Hugging Face, and RunInfra benchmarks it across GPU tiers, generates custom Triton kernels, and deploys an optimized production API. No YAML, no DevOps.
RunInfra
Own your AI. We benchmark GPUs, optimize kernels, and deploy open-source models as production APIs.
Start building