RAG search - RunInfra

A RAG (retrieval-augmented generation) pipeline takes a question, retrieves relevant chunks from your corpus, reranks them, generates a grounded answer with the LLM, and returns the answer plus the exact citation spans used to produce it. RunInfra ships the recipe end-to-end so the citation evidence is auditable, not just a vibes-check.

Architecture

Question
  -> Embedding (BGE)
  -> Hybrid retrieval (vector + BM25, top 50)
  -> Cross-encoder reranker (BGE reranker, top 8)
  -> LLM with chunks + citation instructions (Llama 3.1 8B FP8)
  -> Answer + citation spans (which chunk, which character range)

The default stack is BGE for embeddings and reranking, Llama 3.1 8B FP8 for generation. Every component is configurable: swap to GTE-Qwen2 for non-English corpora, swap to a 70B model for tougher questions, or swap to Cohere reranker if your contract requires it.

What you get out of the box

Hybrid retrieval (dense + sparse) with configurable weights
Cross-encoder reranking so the LLM sees genuinely relevant chunks
Citation spans in the response: which chunk and which character range
Eval harness hook so you can score against your own gold set
One HTTP endpoint that does retrieve + rerank + generate end-to-end

Example prompt

In the dashboard:

Build a RAG pipeline over our internal docs corpus.
Use BGE for embeddings + reranking, Llama 3.1 8B for the generator.
Return citations as character spans in the source chunk.
Optimize for answer quality, not throughput.

Cookbook

For full code that shows ingestion, embedding, retrieval, and generation against the OpenAI-compatible API, see the RAG cookbook.

Eval pattern

The RAG agent expects you to bring your own eval set. Three columns are enough:

column	meaning
`question`	The user question
`expected_answer`	The reference answer for human or LLM-judge scoring
`expected_citations`	The chunk ids the model should cite

Score against the deployed pipeline before promoting from Flex to Active.

Deeper details

See runinfra.ai/use-cases/rag-search for the marketing page with retrieval recall numbers and end-to-end latency budgets.

​Architecture

​What you get out of the box

​Example prompt

​Cookbook

​Eval pattern