Models and inference

Which model families are supported?

Llama 3.x/4, Qwen 2.5, Mistral and Mixtral, DeepSeek V2/V3/R1, Gemma 2, Phi-3/4, Cohere Command-R/R+, Whisper (all sizes), XTTS v2, Bark. Plus any compatible model on Hugging Face. See Models.

Can I use a custom fine-tuned model?

Yes. Upload SafeTensors, PyTorch, GGUF, or ONNX up to 50 GB at runinfra.ai/models. The optimizer runs on custom models just like catalog models.

What context window does each model have?

Depends on the model. Llama 3.1 is 128K, Qwen 2.5 is 128K, Mistral Small 3 is 32K, DeepSeek V3 is 128K. The /v1/models endpoint returns each model’s context window. Exceeding the window returns 400 context_length_exceeded.

Which quantization should I pick?

Let the optimizer rank compatible methods for your model, runtime, GPU, and quality target. 4-bit LLM variants reduce VRAM and FP8 can preserve more fidelity on compatible GPUs. See Optimization.

Does streaming work?

Yes for chat and Responses requests when the selected deployment supports streaming. Set stream: true and iterate chunks. See the streaming cookbook.

Does tool/function calling work?

Yes. Use the OpenAI-shaped tools array and tool_choice parameter on compatible chat deployments. See the tool calling cookbook.

Does structured output work?

Yes. response_format with json_object or json_schema (strict mode). Compatible with Pydantic and Zod helpers in the OpenAI SDK. See the structured output cookbook.

Does vision work?

Yes on VLM-backed pipelines (Llama 3.2 Vision, Qwen 2.5 VL, etc.). Send image_url or base64 content blocks. Up to 30 images per request.

Does RunInfra do fine-tuning?

Supervised fine-tuning and LoRA are on the roadmap. Today, fine-tune externally (HF Transformers, Axolotl, Unsloth) and upload the result as a custom model.

What's the maximum `max_tokens` I can request?

Up to the model’s context window minus your prompt length. Practical limits: 8K for most chat flows, 16-32K for long summaries. If you need more, split into streaming requests to avoid 504 timeouts.

Is there batch inference?

Managed batch inference is not a public API today. The recommended pattern is async parallel requests with AsyncOpenAI and your own queue.

Can I compare models side-by-side?

Yes, two ways. Ask the agent to compare 2 to 5 models and RunInfra benchmarks them sequentially on the same GPU with the same serving configuration, then shows a measured side-by-side table with a winner; results appear as a comparison card in chat and in the Compare Results panel in the session workspace, and survive reload. See Compare models, engines, and techniques. You can also compare two variants manually in the playground under the Deploy tab by sending the same prompt to both.

How do I know which model my pipeline is serving?

GET /v1/models returns the callable model IDs visible to your key. The response of any chat completion also includes the model value that served the request.

Can I pin a specific model version?

Yes. Pass the full Hugging Face ID (meta-llama/Llama-3.1-8B-Instruct) in your request, or use a deployed alias returned by GET /v1/models. Pipelines support explicit aliases that map names to specific commits.

Models catalog

Full supported list with parameters.

Optimization

How quantization and Forge work.

API reference

Endpoint parameter details.

Which model?

Decision table for picking a starting point.

​Not here?

Models catalog

Optimization

API reference

Which model?

Not here?