What this does. Takes a user query, retrieves the top-k most relevant documents from a vector store using RunInfra embeddings, then generates an answer with a RunInfra chat pipeline grounded in that context. When to use it. Any knowledge-base Q&A, doc search, customer-support bot that needs to cite internal data, or any chatbot that should not hallucinate outside a fixed corpus.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Prereqs
- One embeddings pipeline deployed (e.g.
bge-m3). See Models. - One chat pipeline deployed (any LLM).
- A vector store. This recipe uses Postgres +
pgvector; swap for Pinecone, Qdrant, Weaviate, or Chroma freely.
Minimal code
Ingestion
Before the loop above works you need embeddings in the store. One-time or incremental:VECTOR(N) dimension must match the embeddings model. bge-m3 emits 1024, bge-small-en-v1.5 emits 384. See Models.
What to tune
| Knob | Effect |
|---|---|
k in retrieve | 3 to 10 is typical. More = more context but more tokens and slower |
| Chunk size at ingestion | 300 to 800 tokens per chunk. Smaller = more precise, larger = more context per hit |
| Chunk overlap | 10 to 20 percent. Preserves meaning across chunk boundaries |
| Distance operator | <-> (L2), <=> (cosine), <#> (inner product). Cosine is safest for normalized embeddings |
| Reranker | Add a reranker pass (e.g. bge-reranker) after retrieval for a quality bump |
Common mistakes
- Forgetting to normalize embeddings. If your model emits L2-normalized vectors, use cosine distance (
<=>). Mixing normalized vectors with raw L2 distance produces garbage ordering. - Letting the model answer from pretraining. The system prompt above explicitly says “only from the context”. Without that, the model will fall back to parametric knowledge and hallucinate.
- Over-chunking. Splitting on fixed character counts tears sentences apart. Split on paragraph or section boundaries first, then split further only if chunks exceed your token budget.
- Ingesting without dedup. If your source changes, deleting-and-reinserting beats “upserting” because you avoid stale embeddings. Key by document hash.
Next steps
Embeddings API
The endpoint this recipe calls.
Models
Which embedding models RunInfra serves.
Tool calling
Wrap retrieval as a tool the model can invoke.
Structured output
Return RAG answers in a guaranteed JSON shape.