Retrieval-augmented generation

What this does. Takes a user query, retrieves the top-k most relevant documents from a vector store using RunInfra embeddings, then generates an answer with a RunInfra chat pipeline grounded in that context. When to use it. Any knowledge-base Q&A, doc search, customer-support bot that needs to cite internal data, or any chatbot that should not hallucinate outside a fixed corpus.

Prereqs

One embeddings pipeline deployed (e.g. bge-m3). See Models.
One chat pipeline deployed (any LLM).
A vector store. This recipe uses Postgres + pgvector; swap for Pinecone, Qdrant, Weaviate, or Chroma freely.

Minimal code

import os
import psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI

EMBED = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key=os.environ["RUNINFRA_GATEWAY_KEY"],
)
CHAT = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key=os.environ["RUNINFRA_GATEWAY_KEY"],
)

def embed(text: str) -> list[float]:
    return EMBED.embeddings.create(model="bge-m3", input=text).data[0].embedding

def retrieve(query: str, k: int = 5) -> list[str]:
    vec = embed(query)
    with psycopg.connect(os.environ["POSTGRES_URL"]) as conn:
        register_vector(conn)
        rows = conn.execute(
            "SELECT content FROM docs ORDER BY embedding <-> %s LIMIT %s",
            (vec, k),
        ).fetchall()
    return [r[0] for r in rows]

def answer(query: str) -> str:
    context = "\n\n---\n\n".join(retrieve(query))
    response = CHAT.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content":
                "Answer only from the context below. If the context does not contain the answer, say 'I don't know.'"},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
    )
    return response.choices[0].message.content

print(answer("What is RunInfra's cold start time?"))

Ingestion

Before the loop above works you need embeddings in the store. One-time or incremental:

def ingest(documents: list[str]):
    with psycopg.connect(os.environ["POSTGRES_URL"]) as conn:
        register_vector(conn)
        for doc in documents:
            vec = embed(doc)
            conn.execute(
                "INSERT INTO docs (content, embedding) VALUES (%s, %s)",
                (doc, vec),
            )
        conn.commit()

Schema:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE docs (
  id        SERIAL PRIMARY KEY,
  content   TEXT NOT NULL,
  embedding VECTOR(1024) NOT NULL
);
CREATE INDEX ON docs USING hnsw (embedding vector_l2_ops);

The VECTOR(N) dimension must match the embeddings model. bge-m3 emits 1024, bge-small-en-v1.5 emits 384. See Models.

What to tune

Knob	Effect
`k` in retrieve	3 to 10 is typical. More = more context but more tokens and slower
Chunk size at ingestion	300 to 800 tokens per chunk. Smaller = more precise, larger = more context per hit
Chunk overlap	10 to 20 percent. Preserves meaning across chunk boundaries
Distance operator	`<->` (L2), `<=>` (cosine), `<#>` (inner product). Cosine is safest for normalized embeddings
Reranker	Add a reranker pass (e.g. bge-reranker) after retrieval for a quality bump

Common mistakes

Forgetting to normalize embeddings. If your model emits L2-normalized vectors, use cosine distance (<=>). Mixing normalized vectors with raw L2 distance produces garbage ordering.
Letting the model answer from pretraining. The system prompt above explicitly says “only from the context”. Without that, the model will fall back to parametric knowledge and hallucinate.
Over-chunking. Splitting on fixed character counts tears sentences apart. Split on paragraph or section boundaries first, then split further only if chunks exceed your token budget.
Ingesting without dedup. If your source changes, deleting-and-reinserting beats “upserting” because you avoid stale embeddings. Key by document hash.

Next steps

Embeddings API

The endpoint this recipe calls.

Models

Which embedding models RunInfra serves.

Tool calling

Wrap retrieval as a tool the model can invoke.

Structured output

Return RAG answers in a guaranteed JSON shape.

​Prereqs

​Minimal code

​Ingestion

​What to tune

​Common mistakes

​Next steps