Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Llama 3.x/4, Qwen 2.5, Mistral and Mixtral, DeepSeek V2/V3/R1, Gemma 2, Phi-3/4, Cohere Command-R/R+, Whisper (all sizes), XTTS v2, Bark. Plus any compatible model on Hugging Face on Team plan. See Models.
Team plan and above. Upload SafeTensors, PyTorch, GGUF, or ONNX up to 50 GB at runinfra.ai/models. The optimizer runs on custom models just like catalog models.
Depends on the model. Llama 3.1 is 128K, Qwen 2.5 is 128K, Mistral Small 3 is 32K, DeepSeek V3 is 128K. The /v1/models endpoint returns each model’s context window. Exceeding the window returns 400 context_length_exceeded.
AWQ 4-bit is the best default on 7 to 70B models. FP8 is best on H100/H200 when quality matters. GPTQ gives you size flexibility at 3, 4, or 8 bits. Let the optimizer rank them against your priority; see Optimization.
Yes on every chat and completion endpoint. Set stream: true and iterate chunks. See the streaming cookbook.
Yes. Use the OpenAI-shaped tools array and tool_choice parameter on compatible chat deployments. See the tool calling cookbook.
Yes. response_format with json_object or json_schema (strict mode). Compatible with Pydantic and Zod helpers in the OpenAI SDK. See the structured output cookbook.
Yes on VLM-backed pipelines (Llama 3.2 Vision, Qwen 2.5 VL, etc.). Send image_url or base64 content blocks. Up to 30 images per request.
Supervised fine-tuning and LoRA are roadmap for Team plans. Today, fine-tune externally (HF Transformers, Axolotl, Unsloth) and upload the result as a custom model.
Up to the model’s context window minus your prompt length. Practical limits: 8K for most chat flows, 16-32K for long summaries. If you need more, split into streaming requests to avoid 504 timeouts.
Team plan. Submit up to 50,000 requests via POST /v1/batch, results ready within 24 hours at a 50 percent discount. Today the recommended pattern is async parallel with AsyncOpenAI and your own queue.
Yes in the playground under the Deploy tab. Select two variants after an optimization run, send the same prompt to both, compare outputs and metrics.
GET /v1/YOUR_PIPELINE_ID/models returns the model id. The response of any chat completion includes model: "meta-llama/Llama-3.1-8B-Instruct" so you always know what served the request.
Yes. Instead of model: "default", pass the full Hugging Face ID (meta-llama/Llama-3.1-8B-Instruct) in your request. Pipelines support explicit aliases that map names to specific commits.

Not here?

Models catalog

Full supported list with parameters.

Optimization

How quantization and Forge work.

API reference

Endpoint parameter details.

Which model?

Decision table for picking a starting point.