Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra supports large language models, embeddings, vision-language models, speech-to-text (ASR), and text-to-speech (TTS) models sourced from Hugging Face. Thousands of models work out of the box, give the agent a model name or Hugging Face ID and it handles the rest, including optimization and deployment. You can also chain multiple models into a single pipeline; for example, Whisper feeding into Llama feeding into XTTS for a complete voice assistant. Browse popular models at runinfra.ai/models or describe your use case and let the agent recommend one:
I need a cheap, fast model for simple Q&A. What do you suggest?
Reranker endpoints are roadmap. Embedding and vision-language availability depends on the model catalog, selected runtime, and deployment support. Image generation (FLUX, SDXL, SD3.5) shipped in I0.5; see Image generation.

LLMs (text generation)

FamilySizesGood for
Llama 3.1 / 3.2 / 3.3 / 4 (Meta)1B-405BGeneral purpose, chat, reasoning
Qwen 2.5 (Alibaba)0.5B-72BMultilingual, code, math
Mistral / Mixtral (Mistral AI)7B-123BInstruction following, code
DeepSeekV2, V3, R1Long context, reasoning
Gemma 2 (Google)2B-27BLightweight, edge deployment
Phi-3 / Phi-4 (Microsoft)3.8B-14BSmall, fast, cost-effective
CohereCommand-R / R+RAG, enterprise search

Speech-to-text (ASR)

ModelProviderParametersGood for
Whisper Large V3OpenAI1.5BBest accuracy, multilingual
Whisper Large V3 TurboOpenAI809MFast transcription
Distil-Whisper Large V3Hugging Face756MLow-latency, distilled
Whisper MediumOpenAI769MBalanced speed and accuracy
Whisper Small / BaseOpenAI244M / 74MEdge, real-time

Text-to-speech (TTS)

ModelProviderParametersGood for
XTTS v2Coqui467MVoice cloning, multilingual
BarkSuno1.2BExpressive, multi-speaker
SpeechT5 TTSMicrosoft143MLightweight
Parler TTS LargeParler2.3BExpressive, high quality
Any compatible model on Hugging Face works. If the agent detects a compatibility issue, it tells you before consuming any GPU time.

Embeddings

Embedding models can be deployed behind the OpenAI-compatible /v1/embeddings endpoint for RAG, semantic search, clustering, and retrieval pipelines. Availability depends on the catalog entry and selected serving backend.

Vision-language

Vision-language pipelines are supported when the selected model and runtime accept image inputs. The agent checks the model/runtime combination before provisioning so unsupported combinations do not consume GPU time.

Selecting models

You can specify a model three ways:
Use Llama 3.1 8B for this pipeline
The agent resolves the name to the correct Hugging Face model ID automatically.

Cost

Your per-request cost depends on your pipeline configuration: model, quantization, GPU tier, and deployment mode. The Deploy tab shows the projected cost for your exact setup before you deploy. See Plans for plan-level details.

Custom model uploads

Custom model uploads require Team plan or higher.
Upload your own fine-tuned models at runinfra.ai/models. Supported formats are SafeTensors, PyTorch, GGUF, and ONNX, with a maximum file size of 50 GB. Uploaded models go through the same optimization pipeline as catalog models. Reference them in the agent by the name you gave them at upload time:
Use my uploaded model "fine-tuned-llama-7b" for this pipeline
Don’t know which model to start with? Use the Which model? decision table, or describe your use case and latency/budget and let the agent recommend.

Known limitations

Hugging Face license-gated models (Llama 3, Mistral, some Qwen) require accepting the license on your HF account first. The agent warns before provisioning if the gate is not yet accepted.
  • Maximum context window is model-dependent. Llama 3.1/3.3 and DeepSeek V3 are 128K. Mistral Small 3 is 32K. Phi-3 Mini is 4K. Exceeding the window returns 400 context_length_exceeded.
  • Multi-modal inputs: up to 30 images per vision request, 25 MiB per audio upload.
  • Custom model uploads: 50 GB max per file. Weights above 200 GB are not cached in Instant Start; deploy on Active replicas.
  • Image generation (Stable Diffusion, FLUX) and reranker endpoints are roadmap.
  • Fine-tuning is external today (upload an externally-trained model). Managed fine-tuning is roadmap.

Common questions

Either browse huggingface.co, or ask the agent: “What’s the HF ID for Llama 3.1 8B?” The agent resolves names and can list common aliases.
Not in a single pipeline. Create two pipelines (one AWQ, one FP8) and A/B them at the app layer. Or deploy the same pipeline twice with different variants and route client-side.
The agent flags the combination before consuming GPU time. You get a recommendation to quantize (AWQ 4-bit fits most 70B models on 80 GB GPUs) or move to a larger tier (H200, B200).
Some model licenses (Gemma 2, Meta Llama 3) have commercial-use or user-count gates. RunInfra passes these through; you agree to them on your HF account. RunInfra does not add its own restrictions.
Yes for custom uploads (Team+). GGUF is supported alongside SafeTensors, PyTorch, and ONNX. The optimizer picks the best serving backend per format.
Yes. New Hugging Face releases become discoverable within 24 hours. Breakthrough models (new Llama, DeepSeek, Qwen series) are surfaced as featured picks in the Pipes chat.

Next steps

Which model?

Decision table by use case and priority.

Optimize your model

GPU profiling, quantization search, Forge kernels.

GPU tiers and pricing

How model size picks a GPU.

Deploy your pipeline

Flex and Active modes.