Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

A RAG (retrieval-augmented generation) pipeline takes a question, retrieves relevant chunks from your corpus, reranks them, generates a grounded answer with the LLM, and returns the answer plus the exact citation spans used to produce it. RunInfra ships the recipe end-to-end so the citation evidence is auditable, not just a vibes-check.

Architecture

Question
  -> Embedding (BGE)
  -> Hybrid retrieval (vector + BM25, top 50)
  -> Cross-encoder reranker (BGE reranker, top 8)
  -> LLM with chunks + citation instructions (Llama 3.1 8B FP8)
  -> Answer + citation spans (which chunk, which character range)
The default stack is BGE for embeddings and reranking, Llama 3.1 8B FP8 for generation. Every component is configurable: swap to GTE-Qwen2 for non-English corpora, swap to a 70B model for tougher questions, or swap to Cohere reranker if your contract requires it.

What you get out of the box

  • Hybrid retrieval (dense + sparse) with configurable weights
  • Cross-encoder reranking so the LLM sees genuinely relevant chunks
  • Citation spans in the response: which chunk and which character range
  • Eval harness hook so you can score against your own gold set
  • One HTTP endpoint that does retrieve + rerank + generate end-to-end

Example prompt

In Pipes:
Build a RAG pipeline over our internal docs corpus.
Use BGE for embeddings + reranking, Llama 3.1 8B for the generator.
Return citations as character spans in the source chunk.
Optimize for answer quality, not throughput.

Cookbook

For full code that shows ingestion, embedding, retrieval, and generation against the OpenAI-compatible API, see the RAG cookbook.

Eval pattern

The RAG agent expects you to bring your own eval set. Three columns are enough:
columnmeaning
questionThe user question
expected_answerThe reference answer for human or LLM-judge scoring
expected_citationsThe chunk ids the model should cite
Score against the deployed pipeline before promoting from Flex to Active.

Deeper details

See runinfra.ai/use-cases/rag-search for the marketing page with retrieval recall numbers and end-to-end latency budgets.