RunInfra supports large language models, embeddings, vision-language models, speech-to-text (ASR), and text-to-speech (TTS) models sourced from Hugging Face. Thousands of models work out of the box, give the agent a model name or Hugging Face ID and it handles the rest, including optimization and deployment. You can also chain multiple models into a single pipeline; for example, Whisper feeding into Llama feeding into XTTS for a complete voice assistant. Browse popular models at runinfra.ai/models or describe your use case and let the agent recommend one:Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Reranker endpoints are roadmap. Embedding and vision-language availability depends on the model catalog, selected runtime, and deployment support. Image generation (FLUX, SDXL, SD3.5) shipped in I0.5; see Image generation.
LLMs (text generation)
| Family | Sizes | Good for |
|---|---|---|
| Llama 3.1 / 3.2 / 3.3 / 4 (Meta) | 1B-405B | General purpose, chat, reasoning |
| Qwen 2.5 (Alibaba) | 0.5B-72B | Multilingual, code, math |
| Mistral / Mixtral (Mistral AI) | 7B-123B | Instruction following, code |
| DeepSeek | V2, V3, R1 | Long context, reasoning |
| Gemma 2 (Google) | 2B-27B | Lightweight, edge deployment |
| Phi-3 / Phi-4 (Microsoft) | 3.8B-14B | Small, fast, cost-effective |
| Cohere | Command-R / R+ | RAG, enterprise search |
Speech-to-text (ASR)
| Model | Provider | Parameters | Good for |
|---|---|---|---|
| Whisper Large V3 | OpenAI | 1.5B | Best accuracy, multilingual |
| Whisper Large V3 Turbo | OpenAI | 809M | Fast transcription |
| Distil-Whisper Large V3 | Hugging Face | 756M | Low-latency, distilled |
| Whisper Medium | OpenAI | 769M | Balanced speed and accuracy |
| Whisper Small / Base | OpenAI | 244M / 74M | Edge, real-time |
Text-to-speech (TTS)
| Model | Provider | Parameters | Good for |
|---|---|---|---|
| XTTS v2 | Coqui | 467M | Voice cloning, multilingual |
| Bark | Suno | 1.2B | Expressive, multi-speaker |
| SpeechT5 TTS | Microsoft | 143M | Lightweight |
| Parler TTS Large | Parler | 2.3B | Expressive, high quality |
Embeddings
Embedding models can be deployed behind the OpenAI-compatible/v1/embeddings endpoint for RAG, semantic search, clustering, and retrieval pipelines. Availability depends on the catalog entry and selected serving backend.
Vision-language
Vision-language pipelines are supported when the selected model and runtime accept image inputs. The agent checks the model/runtime combination before provisioning so unsupported combinations do not consume GPU time.Selecting models
You can specify a model three ways:- By name
- With @ mention
- By Hugging Face ID
Cost
Your per-request cost depends on your pipeline configuration: model, quantization, GPU tier, and deployment mode. The Deploy tab shows the projected cost for your exact setup before you deploy. See Plans for plan-level details.Custom model uploads
Upload your own fine-tuned models at runinfra.ai/models. Supported formats are SafeTensors, PyTorch, GGUF, and ONNX, with a maximum file size of 50 GB. Uploaded models go through the same optimization pipeline as catalog models. Reference them in the agent by the name you gave them at upload time:Known limitations
- Maximum context window is model-dependent. Llama 3.1/3.3 and DeepSeek V3 are 128K. Mistral Small 3 is 32K. Phi-3 Mini is 4K. Exceeding the window returns 400
context_length_exceeded. - Multi-modal inputs: up to 30 images per vision request, 25 MiB per audio upload.
- Custom model uploads: 50 GB max per file. Weights above 200 GB are not cached in Instant Start; deploy on Active replicas.
- Image generation (Stable Diffusion, FLUX) and reranker endpoints are roadmap.
- Fine-tuning is external today (upload an externally-trained model). Managed fine-tuning is roadmap.
Common questions
How do I find the exact Hugging Face ID for a model?
How do I find the exact Hugging Face ID for a model?
Either browse huggingface.co, or ask the agent: “What’s the HF ID for Llama 3.1 8B?” The agent resolves names and can list common aliases.
Can I serve the same model with different quantizations in one pipeline?
Can I serve the same model with different quantizations in one pipeline?
Not in a single pipeline. Create two pipelines (one AWQ, one FP8) and A/B them at the app layer. Or deploy the same pipeline twice with different variants and route client-side.
What happens if a model weighs more than the GPU VRAM?
What happens if a model weighs more than the GPU VRAM?
The agent flags the combination before consuming GPU time. You get a recommendation to quantize (AWQ 4-bit fits most 70B models on 80 GB GPUs) or move to a larger tier (H200, B200).
Are there export restrictions?
Are there export restrictions?
Some model licenses (Gemma 2, Meta Llama 3) have commercial-use or user-count gates. RunInfra passes these through; you agree to them on your HF account. RunInfra does not add its own restrictions.
Can I use GGUF models?
Can I use GGUF models?
Yes for custom uploads (Team+). GGUF is supported alongside SafeTensors, PyTorch, and ONNX. The optimizer picks the best serving backend per format.
Does the catalog update automatically?
Does the catalog update automatically?
Yes. New Hugging Face releases become discoverable within 24 hours. Breakthrough models (new Llama, DeepSeek, Qwen series) are surfaced as featured picks in the Pipes chat.
Next steps
Which model?
Decision table by use case and priority.
Optimize your model
GPU profiling, quantization search, Forge kernels.
GPU tiers and pricing
How model size picks a GPU.
Deploy your pipeline
Flex and Active modes.