Which model should I use?

The short answer: tell the agent your use case and priority, let it pick. The long answer, so you can sanity-check what the agent suggests:

By use case

Use case	Recommended starting point	Why
Customer-support chatbot	Llama 3.1 8B with a compatible 4-bit variant on L40S	Low latency, solid instruction following
Document summarization	Qwen 2.5 14B with a compatible 4-bit or FP8 variant	Long context (128K), strong at compression
Code generation	DeepSeek Coder V2 on a high-throughput GPU	Strong open coder, use FP8 only where the compatibility check passes
Multilingual chat	Qwen 2.5 7B on L40S	Native multilingual training; outperforms Llama on non-English
Reasoning / math	DeepSeek R1 on H100	Best open reasoning model
Fast extraction / classification	Phi-3 Mini on L4	Cheapest, still good enough for JSON extraction
Voice assistant	Whisper Large V3 Turbo + Llama 8B + XTTS v2	3-node pipeline, sub-500ms end-to-end
RAG backend	bge-m3 embeddings + Llama 8B	bge-m3 for retrieval, any instruction model for generation

By priority

Latency
Cost
Throughput
Quality

Pick the smallest model that still passes your quality bar. Let the optimizer try compatible 4-bit or FP8 variants. Deploy on the lowest GPU tier that clears your latency target. Enable speculation on a paid Core plan.Target: P99 under 200 ms for 1 to 8B, under 400 ms for 14 to 30B.

By model size

Size	Typical cost	Quality ceiling	Best for
1 to 3B	Cheapest	Simple extraction, classification	Internal tools, FAQ bots
7 to 8B	Low	Good chat, basic tool use	Production chat, customer support
14B	Medium	Strong general-purpose	Most SaaS features
30 to 32B	High	Great reasoning and code	Agent backbones, code copilots
70B+	Highest	SOTA open performance	Flagship products, research

If you don’t know, just ask:

I need a chatbot for an e-commerce site. Budget $200/month, under 200ms latency,
traffic is ~50 RPM. What model do you recommend?

The agent will suggest, explain its reasoning, and let you adjust. See Best practices for how to phrase these asks.

Next steps

Models catalog

Full list of supported models.

GPUs and pricing

Which GPUs match which model sizes.

Optimization

Quantization, FP8, TensorRT-LLM, and GPU selection.

Example prompts

Copy-ready prompts for every shape.

​By use case

​By priority

​By model size

​When to ask the agent to recommend

​Next steps