Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

The short answer: tell the agent your use case and priority, let it pick. The long answer, so you can sanity-check what the agent suggests:

By use case

Use caseRecommended starting pointWhy
Customer-support chatbotLlama 3.1 8B + AWQ on L40SCheapest sub-200ms latency, solid instruction following
Document summarizationQwen 2.5 14B + AWQ on A100Long context (128K), strong at compression
Code generationDeepSeek Coder V2 on H100Best open coder, FP8 on H100 is the sweet spot
Multilingual chatQwen 2.5 7B on L40SNative multilingual training; outperforms Llama on non-English
Reasoning / mathDeepSeek R1 on H100Best open reasoning model
Fast extraction / classificationPhi-3 Mini on L4Cheapest, still good enough for JSON extraction
Voice assistantWhisper Large V3 Turbo + Llama 8B + XTTS v23-node pipeline, sub-500ms end-to-end
RAG backendbge-m3 embeddings + Llama 8Bbge-m3 for retrieval, any instruction model for generation

By priority

Pick the smallest model that still passes your quality bar. Quantize with AWQ 4-bit. Deploy on L4 or L40S. Enable speculation on Team plan.Target: P99 under 200 ms for 1 to 8B, under 400 ms for 14 to 30B.

By model size

SizeTypical costQuality ceilingBest for
1 to 3BCheapestSimple extraction, classificationInternal tools, FAQ bots
7 to 8BLowGood chat, basic tool useProduction chat, customer support
14BMediumStrong general-purposeMost SaaS features
30 to 32BHighGreat reasoning and codeAgent backbones, code copilots
70B+HighestSOTA open performanceFlagship products, research

When to ask the agent to recommend

If you don’t know, just ask:
I need a chatbot for an e-commerce site. Budget $200/month, under 200ms latency,
traffic is ~50 RPM. What model do you recommend?
The agent will suggest, explain its reasoning, and let you adjust. See Best practices for how to phrase these asks.

Next steps

Models catalog

Full list of supported models.

GPUs and pricing

Which GPUs match which model sizes.

Optimization

AWQ, GPTQ, FP8, TensorRT-LLM.

Example prompts

Copy-ready prompts for every shape.