Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Which model families are supported?
Which model families are supported?
Llama 3.x/4, Qwen 2.5, Mistral and Mixtral, DeepSeek V2/V3/R1, Gemma 2, Phi-3/4, Cohere Command-R/R+, Whisper (all sizes), XTTS v2, Bark. Plus any compatible model on Hugging Face on Team plan. See Models.
Can I use a custom fine-tuned model?
Can I use a custom fine-tuned model?
Team plan and above. Upload SafeTensors, PyTorch, GGUF, or ONNX up to 50 GB at runinfra.ai/models. The optimizer runs on custom models just like catalog models.
What context window does each model have?
What context window does each model have?
Depends on the model. Llama 3.1 is 128K, Qwen 2.5 is 128K, Mistral Small 3 is 32K, DeepSeek V3 is 128K. The
/v1/models endpoint returns each model’s context window. Exceeding the window returns 400 context_length_exceeded.Which quantization should I pick?
Which quantization should I pick?
AWQ 4-bit is the best default on 7 to 70B models. FP8 is best on H100/H200 when quality matters. GPTQ gives you size flexibility at 3, 4, or 8 bits. Let the optimizer rank them against your priority; see Optimization.
Does streaming work?
Does streaming work?
Yes on every chat and completion endpoint. Set
stream: true and iterate chunks. See the streaming cookbook.Does tool/function calling work?
Does tool/function calling work?
Yes. Use the OpenAI-shaped
tools array and tool_choice parameter on compatible chat deployments. See the tool calling cookbook.Does structured output work?
Does structured output work?
Yes.
response_format with json_object or json_schema (strict mode). Compatible with Pydantic and Zod helpers in the OpenAI SDK. See the structured output cookbook.Does vision work?
Does vision work?
Yes on VLM-backed pipelines (Llama 3.2 Vision, Qwen 2.5 VL, etc.). Send
image_url or base64 content blocks. Up to 30 images per request.Does RunInfra do fine-tuning?
Does RunInfra do fine-tuning?
Supervised fine-tuning and LoRA are roadmap for Team plans. Today, fine-tune externally (HF Transformers, Axolotl, Unsloth) and upload the result as a custom model.
What's the maximum `max_tokens` I can request?
What's the maximum `max_tokens` I can request?
Up to the model’s context window minus your prompt length. Practical limits: 8K for most chat flows, 16-32K for long summaries. If you need more, split into streaming requests to avoid 504 timeouts.
Is there batch inference?
Is there batch inference?
Team plan. Submit up to 50,000 requests via
POST /v1/batch, results ready within 24 hours at a 50 percent discount. Today the recommended pattern is async parallel with AsyncOpenAI and your own queue.Can I compare models side-by-side?
Can I compare models side-by-side?
Yes in the playground under the Deploy tab. Select two variants after an optimization run, send the same prompt to both, compare outputs and metrics.
How do I know which model my pipeline is serving?
How do I know which model my pipeline is serving?
GET /v1/YOUR_PIPELINE_ID/models returns the model id. The response of any chat completion includes model: "meta-llama/Llama-3.1-8B-Instruct" so you always know what served the request.Can I pin a specific model version?
Can I pin a specific model version?
Yes. Instead of
model: "default", pass the full Hugging Face ID (meta-llama/Llama-3.1-8B-Instruct) in your request. Pipelines support explicit aliases that map names to specific commits.Not here?
Models catalog
Full supported list with parameters.
Optimization
How quantization and Forge work.
API reference
Endpoint parameter details.
Which model?
Decision table for picking a starting point.