Models

Any Hugging Face model, plus custom uploads.

RunInfra currently supports large language models (LLMs) from Hugging Face. Thousands of text generation models work out of the box. Just give the agent a model name or Hugging Face ID and it handles the rest.

Vision models, speech-to-text, text-to-speech, image generation, and embedding models are coming soon. The same chat-driven workflow will apply to the full AI stack.

Browse popular models at Models or ask the agent:

Find a good 7B model for code generation

Popular model families

These are some of the most commonly used models on RunInfra, but you're not limited to this list:

Family	Sizes	Good for
Llama 3.1/3.2/3.3/4 (Meta)	1B-405B	General purpose, chat, reasoning
Qwen 2.5 (Alibaba)	0.5B-72B	Multilingual, code, math
Mistral / Mixtral (Mistral AI)	7B-123B	Instruction following, code
DeepSeek	V2, V3, R1	Long context, reasoning
Gemma 2 (Google)	2B-27B	Lightweight, edge deployment
Phi-3/Phi-4 (Microsoft)	3.8B-14B	Small, fast, cost-effective
Cohere	Command-R/R+	RAG, enterprise search

Any transformer-based LLM on Hugging Face works. If the agent detects a compatibility issue with a specific model, it tells you before consuming any GPU time.

Don't know which model to use? Describe your use case and the agent recommends one:

I need a cheap, fast model for simple Q&A. What do you suggest?

Selecting models

By name

Use Llama 3.1 8B for this pipeline

With @ mention

Optimize @Qwen-2.5-14B with FP8

By Hugging Face ID

Deploy microsoft/Phi-3-mini-4k-instruct optimized for latency

Token pricing by model size

Estimated starting rates. Actual cost depends on your full pipeline configuration.

Size	Input (from)	Output (from)
Small (1-8B)	$0.08/MTok	$0.20/MTok
Medium (8-30B)	$0.20/MTok	$0.80/MTok
Large (30-70B)	$0.45/MTok	$1.50/MTok
XL (70B+)	$0.80/MTok	$2.50/MTok

See GPU and Pricing for details on what affects your cost.

Custom model uploads

Custom model uploads require Team plan or higher.

Upload your own fine-tuned models at Models. Supported formats: SafeTensors, PyTorch, GGUF, ONNX. Max 50GB.

Uploaded models go through the same optimization pipeline as catalog models. Use them in chat just like any other model:

Use my uploaded model "fine-tuned-llama-7b" for this pipeline

How is this guide?

Popular model families

These are some of the most commonly used models on RunInfra, but you're not limited to this list:

Family	Sizes	Good for
Llama 3.1/3.2/3.3/4 (Meta)	1B-405B	General purpose, chat, reasoning
Qwen 2.5 (Alibaba)	0.5B-72B	Multilingual, code, math
Mistral / Mixtral (Mistral AI)	7B-123B	Instruction following, code
DeepSeek	V2, V3, R1	Long context, reasoning
Gemma 2 (Google)	2B-27B	Lightweight, edge deployment
Phi-3/Phi-4 (Microsoft)	3.8B-14B	Small, fast, cost-effective
Cohere	Command-R/R+	RAG, enterprise search

Any transformer-based LLM on Hugging Face works. If the agent detects a compatibility issue with a specific model, it tells you before consuming any GPU time.

Don't know which model to use? Describe your use case and the agent recommends one:

I need a cheap, fast model for simple Q&A. What do you suggest?

Size

Input (from)

Output (from)

Small (1-8B)

$0.08/MTok

$0.20/MTok

Medium (8-30B)

$0.20/MTok

$0.80/MTok

Large (30-70B)

$0.45/MTok

$1.50/MTok

XL (70B+)