Why hybrid retrieval instead of pure vector?

Dense embeddings miss jargon, internal codes, and rare terminology. Adding BM25 lexical search alongside dense recovers those misses. The cross-encoder rerank fuses both retriever outputs into one ranked list, so the LLM sees the best of both.

How does the eval harness work?

Bring a gold-set of question and reference-answer pairs (or generate one from your docs). The harness scores recall at k, faithfulness against the cited passages, and citation coverage. Compare two retrieval or LLM configurations side by side before you deploy.

Which embedding model is the default?

BGE-Large-EN for English-heavy corpora, BGE-M3 for multilingual. Any compatible Hugging Face embedding model can replace it, and the agent rebenchmarks retrieval on your GPU.

Can I bring my own vector store?

Yes. The pipeline integrates with Qdrant, Weaviate, and pgvector. BM25 runs as a sidecar (Tantivy or Elastic). You can also use the in-memory hybrid index for smaller corpora.

RunInfraby RightNow

Dashboard Sign in Get started

Cited Q&A you can audit on your own corpus.

Hybrid retrieval, grounded generation, citation spans. Eval on your gold set, not the vendor's marketing page.

Deploy this pipeline Read the stack

Q&A

Citations

Audits

Briefs

What you actually own

The optimization knobs, the codebase, the model choice. None of it locked away.

Citations you can audit.

Every answer carries source-doc spans per claim. When legal or finance asks for the source, you have it. Hallucinations stop being invisible.

Hybrid retrieval beats pure vector.

Dense embeddings miss jargon, codes, and rare terms. Dense plus BM25 plus rerank lifts recall on the queries that actually fail in production.

Eval on your gold set, not theirs.

Built-in eval harness scores recall, faithfulness, and citation coverage against your own Q&A. Compare two stacks side by side before you deploy.

Three ways to ship cited Q&A

Most teams pick between speed and control. RunInfra keeps both in one workflow.

Deployment comparison for rag search across RunInfra, closed APIs, and DIY self-hosting.
What matters	RunInfraRecommendedFast path with model control and export.	Stacked APIsVector DB plus rerank plus embeddings.	DIY self-hostingFull control, heavy operations.
01Launch	Pick model, optimize, deploy Start quickly and keep the production path open.	Call provider endpoint Fast first demo, but the runtime stays rented.	Build serving stack first Infrastructure work comes before product learning.
02Model control	Bring the model ID Keep model choice and serving decisions visible.	Provider catalog You use what the provider exposes.	Your model Full control if your team maintains the runtime.
03Tuning	Measured latency and GPU cost Compare serving choices before deployment.	Opaque Latency and batching stay behind the API.	Manual profiling Your team owns tuning and regressions.
04Export	Managed now, export when needed Use the endpoint first and take the deploy package later.	Locked endpoint You keep calling the provider.	Already owned Export exists because you built everything yourself.
05Operations	Low until you choose to own it Operate managed, then export with the same measured plan.	Low, with lock-in Less infra work, less production control.	High You own infra, failures, upgrades, and serving changes.
06Security	SOC 2 Type 2 Audited controls across access, logging, and incident response.	Varies by vendor Compliance depends on the third party sitting in the request path.	You build it Your team owns the audit trail, logging, and access controls.

RunInfra

Recommended

Fast path with model control and export.

Launch

Pick model, optimize, deploy

Start quickly and keep the production path open.

Model control

Bring the model ID

Keep model choice and serving decisions visible.

Tuning

Measured latency and GPU cost

Compare serving choices before deployment.

Export

Managed now, export when needed

Use the endpoint first and take the deploy package later.

Operations

Low until you choose to own it

Operate managed, then export with the same measured plan.

Security

SOC 2 Type 2

Audited controls across access, logging, and incident response.

Code you own. Deploy anywhere.

The full recipe ships with you. Codebase, kernels, engine config, weights. Run it anywhere.

Live

Build cited Q&A over our internal docs. Hybrid retrieval with @BGE-Large-en-v1.5 plus BM25, cross-encoder reranker, @Llama-3.1-8B grounded with citation spans. Score against our gold set on a single @L40S.

Agent

On it. I'll wire dense plus BM25 plus rerank plus grounded Llama on one L40S, then score against your gold set for recall, faithfulness, and citation coverage.

Ingested 10k docs, hybrid index

BGE-Large to Qdrant, BM25 to Tantivy sidecar

Tuned dense vs sparse weights

fused recall up on jargon and codes

Reranker over both retrievers

k=20 each, rerank to top-5

Compiled grounded Llama 3.1 8B

AWQ INT4, citation-span decoder

Ran gold-set eval, 500 queries

p95 620ms, faithfulness scored per claim

Raise BM25 weight and rerun the gold-set eval...

RAG adapters3

runinfra-rag-search/

corpus/

evals/

Managed RunInfra

Our GPUs, per-million-tokens billing from L4 to B200.

Your infrastructure

AWS, GCP, RunPod, bare metal. Same Dockerfile, your cluster.

Local workstation

docker compose up. Full pipeline on a single GPU.

Supported HF retrieval models work

Supported retrieval-compatible models on Hugging Face run through the compatible recipe. Search the live catalog above. The examples below are just a starting view.

HF

Llama 3.3 70B

Meta

70BLLM

Llama 3.1 8B

Meta

8BLLM

Llama 3.2 3B

Meta

3BEdge LLM

Qwen 2.5 72B

Alibaba

72BLLM

Qwen 2.5 32B

Alibaba

32BLLM

Qwen 2.5 7B

Alibaba

7BLLM

QwQ 32B

Alibaba

32BReasoning

DeepSeek R1 Distill 70B

DeepSeek

70BReasoning

DeepSeek R1 Distill 8B

DeepSeek

8BReasoning

What RunInfra tunes

Every stage of the pipeline, retuned per model and GPU.

Hybrid retrieval

Dense embeddings plus BM25 lexical, fused. Catches jargon and rare terms that pure vector misses.

Reranker fusion

Cross-encoder reranks dense and sparse hits together in one server. No extra API hop.

Citation spans

Every claim resolves back to source-doc spans. Hallucinations stay traceable, audits stay cheap.

Eval harness

Built-in eval against your gold-set Q&A. Recall, faithfulness, citation coverage, all measured before deploy.

Grounded LLM

Llama 3.1 8B with retrieval-augmented prompts. FlashAttention v2 plus PagedAttention KV.

Vector and lexical store

Qdrant for dense, BM25 sidecar for sparse, pgvector or Weaviate also wired. No network hop.

Try this pipeline

Edit the model, engine, or GPU inline. Send to retune the stack in the dashboard.

Common questions

Can't find what you're looking for? Get in touch

How does citation work?

Every answer carries source-doc spans per claim. The LLM is prompted to ground each sentence in retrieved passages, and the decoder tags each span with the source document and offset range. When legal or compliance asks for the source, you have it.

Deploy your first optimized model, measured before you ship

Describe the goal. RunInfra builds and optimizes the stack.

Start Building View Pricing

End-to-end encryption

Isolated GPU infrastructure

Zero data retention

SOC 2 Type II

RunInfraby RightNow

All systems operational

Backed by

Combinator

AICPA Type II

SOC 2

Ask AI about RunInfra

Part of RightNow

Cited Q&A you can audit on your own corpus.

Hybrid retrieval, grounded generation, citation spans. Eval on your gold set, not the vendor's marketing page.

Deploy this pipeline Read the stack

Q&A

Citations

Audits

Briefs

What you actually own

The optimization knobs, the codebase, the model choice. None of it locked away.

Citations you can audit.

Every answer carries source-doc spans per claim. When legal or finance asks for the source, you have it. Hallucinations stop being invisible.

Hybrid retrieval beats pure vector.

Dense embeddings miss jargon, codes, and rare terms. Dense plus BM25 plus rerank lifts recall on the queries that actually fail in production.

Eval on your gold set, not theirs.

Built-in eval harness scores recall, faithfulness, and citation coverage against your own Q&A. Compare two stacks side by side before you deploy.

Three ways to ship cited Q&A

Most teams pick between speed and control. RunInfra keeps both in one workflow.

Deployment comparison for rag search across RunInfra, closed APIs, and DIY self-hosting.
What matters	RunInfraRecommendedFast path with model control and export.	Stacked APIsVector DB plus rerank plus embeddings.	DIY self-hostingFull control, heavy operations.
01Launch	Pick model, optimize, deploy Start quickly and keep the production path open.	Call provider endpoint Fast first demo, but the runtime stays rented.	Build serving stack first Infrastructure work comes before product learning.
02Model control	Bring the model ID Keep model choice and serving decisions visible.	Provider catalog You use what the provider exposes.	Your model Full control if your team maintains the runtime.
03Tuning	Measured latency and GPU cost Compare serving choices before deployment.	Opaque Latency and batching stay behind the API.	Manual profiling Your team owns tuning and regressions.
04Export	Managed now, export when needed Use the endpoint first and take the deploy package later.	Locked endpoint You keep calling the provider.	Already owned Export exists because you built everything yourself.
05Operations	Low until you choose to own it Operate managed, then export with the same measured plan.	Low, with lock-in Less infra work, less production control.	High You own infra, failures, upgrades, and serving changes.
06Security	SOC 2 Type 2 Audited controls across access, logging, and incident response.	Varies by vendor Compliance depends on the third party sitting in the request path.	You build it Your team owns the audit trail, logging, and access controls.

RunInfra

Recommended

Fast path with model control and export.

Launch

Pick model, optimize, deploy

Start quickly and keep the production path open.

Model control

Bring the model ID

Keep model choice and serving decisions visible.

Tuning

Measured latency and GPU cost

Compare serving choices before deployment.

Export

Managed now, export when needed

Use the endpoint first and take the deploy package later.

Operations

Low until you choose to own it

Operate managed, then export with the same measured plan.

Security

SOC 2 Type 2

Audited controls across access, logging, and incident response.

Code you own. Deploy anywhere.

The full recipe ships with you. Codebase, kernels, engine config, weights. Run it anywhere.

Live

Agent

On it. I'll wire dense plus BM25 plus rerank plus grounded Llama on one L40S, then score against your gold set for recall, faithfulness, and citation coverage.

Ingested 10k docs, hybrid index

BGE-Large to Qdrant, BM25 to Tantivy sidecar

Tuned dense vs sparse weights

fused recall up on jargon and codes

Reranker over both retrievers

k=20 each, rerank to top-5

Compiled grounded Llama 3.1 8B

AWQ INT4, citation-span decoder

Ran gold-set eval, 500 queries

p95 620ms, faithfulness scored per claim

Raise BM25 weight and rerun the gold-set eval...

RAG adapters3

runinfra-rag-search/

corpus/

evals/

Managed RunInfra

Our GPUs, per-million-tokens billing from L4 to B200.

Your infrastructure

AWS, GCP, RunPod, bare metal. Same Dockerfile, your cluster.

Local workstation

docker compose up. Full pipeline on a single GPU.

Supported HF retrieval models work

Supported retrieval-compatible models on Hugging Face run through the compatible recipe. Search the live catalog above. The examples below are just a starting view.

HF