RunInfra is now public.See what's new
Pricing

Simple, transparent pricing

Start free and scale as you grow. Only pay for the GPU compute you use.

Starter

Build and test pipelines, no deployment.

$
/ month

 

Chat-driven pipeline builder
3 optimization sessions / month
Full Hugging Face model catalog
AWQ optimized model search
Smart routing (complexity, cost, latency)
Pipeline playground (100 req/day)
3 active pipelines
7-day metrics retention
Community support

Pro

Deploy optimized endpoints. Pay per token.

$
/ month

Inference billed per million tokens

Everything in Starter, plus:
Unlimited optimization sessions
Deployed API endpoints (scale-to-zero)
Unlimited active pipelines
Forge GPU kernel optimization
Full optimization suite (AWQ, GPTQ, FP8)
RunQuant: custom quantization engine
Pipeline versioning with comparison
Stress testing and preflight checks
Scaling config (up to 8 replicas)
Fast cold starts (under 2s)
90-day metrics with cost analytics
99.9% SLA, priority support

Team

For teams that need advanced optimization and collaboration.

$
/ seat / month

Min 3 seats. 10% token discount at 100M+/mo

Everything in Pro, plus:
Always-on endpoints (zero cold start)
NVIDIA TensorRT-LLM integration
Speculative decoding
Advanced routing (weighted, multi-model)
Custom model uploads
Scaling config (up to 32 replicas)
1-year metrics retention
SSO (coming soon), audit logs, RBAC
99.95% SLA
Shared Slack support

Enterprise

Dedicated infrastructure, compliance, and volume pricing.

Custom

 

Everything in Team, plus:
Dedicated GPU infrastructure with reserved capacity
Private model onboarding (fine-tuned weights)
Custom SLAs (up to 99.99% uptime)
Volume token pricing (up to 40% off)
Unlimited metrics retention
SOC 2 and HIPAA compliance
Dedicated CSM and private Slack
Compare plans
Starter
$0/ month
Pro
$99/ month
Team
$249/ seat / month
Enterprise
Custom
Chat-driven builder
Optimization sessions3 / monthUnlimitedUnlimitedUnlimited
Active pipelines3UnlimitedUnlimitedUnlimited
Model catalogFull HF catalogFull HF catalogFull HF catalogHF + private models
Deployed API endpoint
Forge GPU kernel optimization
Optimization methodsAWQAWQ, GPTQ, FP8, RunQuantAWQ, GPTQ, FP8, RunQuantAll + custom
NVIDIA TensorRT-LLM
Speculative decoding
Smart routingBasicFullAdvanced (weighted)Custom
Scaling replicasUp to 8Up to 32Custom
Stress testing and preflight
Pipeline versioning
Inference pricingPlayground onlyPer million tokens10% off at 100M+Up to 40% off
Metrics retention7 days90 days1 yearUnlimited
SLA guarantee-99.9%99.95%99.99%
SSO (coming soon) and audit logs
SupportCommunityPriority emailShared SlackDedicated CSM
FAQ

Common questions

Can't find what you're looking for? Get in touch

What is RunInfra?

RunInfra is a GPU optimization platform for open-source LLMs. You pick a model from Hugging Face, and RunInfra benchmarks it across GPU tiers, generates custom Triton kernels, and deploys an optimized production API. No YAML, no DevOps.

Deploy your first optimized model
in under 5 minutes

Start Building for Free
RunInfra

Own your AI. We benchmark GPUs, optimize kernels, and deploy open-source models as production APIs.

Start building

© 2026 RunInfra. All rights reserved.