Embeddings and reranking

An embeddings pipeline takes a list of texts and returns vector representations, optionally followed by a cross-encoder reranking pass over candidate documents, all in one HTTP round-trip. RunInfra ships the recipe with BGE encoders for embeddings and BGE or Cohere-style cross-encoders for reranking, fused on a single GPU.

Architecture

POST /v1/embeddings { input: [texts...] }
  -> BGE encoder (FP16 or FP8, batched)
  -> 1024-d vectors per text

POST /v1/{pipelineId}/rerank { query, texts }    # optional pipeline-scoped second hop
  -> BGE cross-encoder reranker
  -> Sorted documents with relevance scores

Both models live on the same GPU and share a CUDA stream. If you fire encoder + reranker in the same request (via a custom pipeline route), they execute back-to-back without an HTTP hop.

What you get out of the box

OpenAI-compatible /v1/embeddings with batched input, billing per input token
Pipeline-scoped /v1/{pipelineId}/rerank endpoint with a text array and scored output
Pooled inference sharing one GPU across both models when traffic is bursty
Tens of thousands of embeddings per second on L40S with FP8 batching

Example prompt

In the dashboard:

Build me an embeddings pipeline for English documents.
Use BGE-large-en-v1.5 plus the BGE reranker. Optimize for throughput.

Quick example

from openai import OpenAI

client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

resp = client.embeddings.create(
    model="your-pipeline-id",
    input=["RunInfra is a chat-native AI infrastructure platform.", "BGE is an embedding model."],
)
print(resp.data[0].embedding[:5])

Models in the catalog

BGE (BAAI): bge-large-en-v1.5, bge-m3 (multilingual), bge-reranker-large
E5 (Microsoft): e5-large-v2, e5-mistral-7b-instruct
GTE (Alibaba): gte-large, gte-Qwen2-7B-instruct
Nomic: nomic-embed-text-v1.5

Deeper details

See the models catalog for the full embedding model list, dimensions, and license summaries, and runinfra.ai/use-cases/embeddings for benchmark numbers.

AI assistants

RAG search

⌘I

​Architecture

​What you get out of the box

​Example prompt

​Quick example

​Models in the catalog