Models - RunInfra

RunInfra supports large language models, embeddings, rerankers, vision-language models, speech-to-text (ASR), and text-to-speech (TTS) models sourced from Hugging Face. Give the agent a compatible model name or Hugging Face ID and it checks optimization and deployment support before provisioning. You can also chain multiple models into a single pipeline; for example, Whisper feeding into Llama feeding into XTTS for a complete voice assistant. Browse popular models at runinfra.ai/models or describe your use case and let the agent recommend one:

I need a cheap, fast model for simple Q&A. What do you suggest?

Reranker endpoints are available for verified rerank deployments through POST /v1/rerank and POST /v1/{pipelineId}/rerank. TEI rerank deployments accept text documents. Compatible vLLM vision rerank deployments can accept document content parts for text, images, and videos when the selected model supports them. Embedding, vision-language, and image-generation availability depends on the model catalog, selected runtime, and deployment support. See Image generation for FLUX, SDXL, and SD3.5 workflows.

LLMs (text generation)

Family	Sizes	Good for
Llama 3.1 / 3.2 / 3.3 / 4 (Meta)	1B-405B	General purpose, chat, reasoning
Qwen 2.5 (Alibaba)	0.5B-72B	Multilingual, code, math
Mistral / Mixtral (Mistral AI)	7B-123B	Instruction following, code
DeepSeek	V2, V3, R1	Long context, reasoning
Gemma 2 (Google)	2B-27B	Lightweight, edge deployment
Phi-3 / Phi-4 (Microsoft)	3.8B-14B	Small, fast, cost-effective
Cohere	Command-R / R+	RAG, enterprise search

Speech-to-text (ASR)

Model	Provider	Parameters	Good for
Whisper Large V3	OpenAI	1.5B	Best accuracy, multilingual
Whisper Large V3 Turbo	OpenAI	809M	Fast transcription
Distil-Whisper Large V3	Hugging Face	756M	Low-latency, distilled
Whisper Medium	OpenAI	769M	Balanced speed and accuracy
Whisper Small / Base	OpenAI	244M / 74M	Edge, real-time

Text-to-speech (TTS)

Model	Provider	Parameters	Good for
XTTS v2	Coqui	467M	Voice cloning, multilingual
Bark	Suno	1.2B	Expressive, multi-speaker
SpeechT5 TTS	Microsoft	143M	Lightweight
Parler TTS Large	Parler	2.3B	Expressive, high quality

Any compatible model on Hugging Face works. If the agent detects a compatibility issue, it tells you before consuming any GPU time.

Embeddings

Embedding models can be deployed behind the OpenAI-compatible /v1/embeddings endpoint for RAG, semantic search, clustering, and retrieval pipelines. Availability depends on the catalog entry and selected serving backend.

Vision-language

Vision-language pipelines are supported when the selected model and runtime accept image inputs. The agent checks the model/runtime combination before provisioning so unsupported combinations do not consume GPU time.

Selecting models

You can specify a model three ways:

By name
With @ mention
By Hugging Face ID

Use Llama 3.1 8B for this pipeline

The agent resolves the name to the correct Hugging Face model ID automatically.

Optimize @Qwen-2.5-14B with FP8

Use @ to reference a specific model inline within a longer instruction.

Deploy microsoft/Phi-3-mini-4k-instruct optimized for latency

Pass the full org/model-name ID from Hugging Face for precise control.

Cost

Your per-request cost depends on your pipeline configuration: model, quantization, GPU tier, and deployment mode. The Deploy tab shows the projected cost for your exact setup before you deploy. See Plans for plan-level details.

Custom model uploads

Custom model uploads require a paid Core plan.

Upload your own fine-tuned models at runinfra.ai/models. Supported formats are SafeTensors, PyTorch, GGUF, and ONNX, with a maximum file size of 50 GB. Uploaded models go through the same optimization pipeline as catalog models. Reference them in the agent by the name you gave them at upload time:

Use my uploaded model "fine-tuned-llama-7b" for this pipeline

Don’t know which model to start with? Use the Which model? decision table, or describe your use case and latency/budget and let the agent recommend.

Known limitations

Hugging Face license-gated models (Llama 3, Mistral, some Qwen) require accepting the license on your HF account first. The agent warns before provisioning if the gate is not yet accepted.

Maximum context window is model-dependent, set by the model card on Hugging Face. Exceeding the window returns 400 context_length_exceeded.
Multi-modal inputs: up to 30 images per vision request, 25 MiB per audio upload.
Custom model uploads: 50 GB max per file. Very large weights pull more slowly on first request under Flex (scale-to-zero) deployments. For these, deploy on an Active (always-warm) replica to avoid the cold-start weight pull.
Reranker endpoints require a verified rerank deployment and use POST /v1/rerank or POST /v1/{pipelineId}/rerank, not the chat-completions route. TEI rerank deployments are text-only; compatible vLLM vision rerank deployments can accept multimodal document content arrays.
Fine-tuning is external today (upload an externally-trained model). Managed fine-tuning is roadmap.

Common questions

How do I find the exact Hugging Face ID for a model?

Either browse huggingface.co, or ask the agent: “What’s the HF ID for Llama 3.1 8B?” The agent resolves names and can list common aliases.

Can I serve the same model with different quantizations in one pipeline?

Not in a single pipeline. Create two pipelines, for example one 4-bit variant and one FP8 variant, and A/B them at the app layer. Or deploy the same pipeline twice with different variants and route client-side.

What happens if a model weighs more than the GPU VRAM?

The agent flags the combination before consuming GPU time. You get a recommendation to try a compatible 4-bit or FP8 variant, reduce context or batch size, or move to a larger tier such as H200 or B200.

Are there export restrictions?

Some model licenses (Gemma 2, Meta Llama 3) have commercial-use or user-count gates. RunInfra passes these through; you agree to them on your HF account. RunInfra does not add its own restrictions.

Can I use GGUF models?

Yes for custom uploads (Core). GGUF is supported alongside SafeTensors, PyTorch, and ONNX. The optimizer picks the best serving backend per format.

Does the catalog update automatically?

Yes. New Hugging Face releases become discoverable within 24 hours. Breakthrough models (new Llama, DeepSeek, Qwen series) are surfaced as featured picks in the dashboard.

Next steps

Which model?

Decision table by use case and priority.

Optimize your model

GPU profiling, quantization search, Forge kernels.

GPU tiers and pricing

How model size picks a GPU.

Deploy your pipeline

Flex and Active modes.

​LLMs (text generation)

​Speech-to-text (ASR)

​Text-to-speech (TTS)

​Embeddings

​Vision-language

​Selecting models

​Cost

​Custom model uploads

​Known limitations

​Common questions

​Next steps

Which model?

Optimize your model

GPU tiers and pricing

Deploy your pipeline

LLMs (text generation)

Speech-to-text (ASR)

Text-to-speech (TTS)

Embeddings

Vision-language

Selecting models

Cost

Custom model uploads

Known limitations

Common questions

Next steps