The short answer: tell the agent your use case and priority, let it pick. The long answer, so you can sanity-check what the agent suggests:Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
By use case
| Use case | Recommended starting point | Why |
|---|---|---|
| Customer-support chatbot | Llama 3.1 8B + AWQ on L40S | Cheapest sub-200ms latency, solid instruction following |
| Document summarization | Qwen 2.5 14B + AWQ on A100 | Long context (128K), strong at compression |
| Code generation | DeepSeek Coder V2 on H100 | Best open coder, FP8 on H100 is the sweet spot |
| Multilingual chat | Qwen 2.5 7B on L40S | Native multilingual training; outperforms Llama on non-English |
| Reasoning / math | DeepSeek R1 on H100 | Best open reasoning model |
| Fast extraction / classification | Phi-3 Mini on L4 | Cheapest, still good enough for JSON extraction |
| Voice assistant | Whisper Large V3 Turbo + Llama 8B + XTTS v2 | 3-node pipeline, sub-500ms end-to-end |
| RAG backend | bge-m3 embeddings + Llama 8B | bge-m3 for retrieval, any instruction model for generation |
By priority
- Latency
- Cost
- Throughput
- Quality
Pick the smallest model that still passes your quality bar. Quantize with AWQ 4-bit. Deploy on L4 or L40S. Enable speculation on Team plan.Target: P99 under 200 ms for 1 to 8B, under 400 ms for 14 to 30B.
By model size
| Size | Typical cost | Quality ceiling | Best for |
|---|---|---|---|
| 1 to 3B | Cheapest | Simple extraction, classification | Internal tools, FAQ bots |
| 7 to 8B | Low | Good chat, basic tool use | Production chat, customer support |
| 14B | Medium | Strong general-purpose | Most SaaS features |
| 30 to 32B | High | Great reasoning and code | Agent backbones, code copilots |
| 70B+ | Highest | SOTA open performance | Flagship products, research |
When to ask the agent to recommend
If you don’t know, just ask:Next steps
Models catalog
Full list of supported models.
GPUs and pricing
Which GPUs match which model sizes.
Optimization
AWQ, GPTQ, FP8, TensorRT-LLM.
Example prompts
Copy-ready prompts for every shape.