Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

An embeddings pipeline takes a list of texts and returns vector representations, optionally followed by a cross-encoder reranking pass over candidate documents, all in one HTTP round-trip. RunInfra ships the recipe with BGE encoders for embeddings and BGE or Cohere-style cross-encoders for reranking, fused on a single GPU.

Architecture

POST /v1/embeddings { input: [texts...] }
  -> BGE encoder (FP16 or FP8, batched)
  -> 1024-d vectors per text

POST /v1/rerank { query, documents }    # optional second hop
  -> BGE cross-encoder reranker
  -> Sorted documents with relevance scores
Both models live on the same GPU and share a CUDA stream. If you fire encoder + reranker in the same request (via a custom pipeline route), they execute back-to-back without an HTTP hop.

What you get out of the box

  • OpenAI-compatible /v1/embeddings with batched input, billing per input token
  • Custom /v1/rerank endpoint with a documents array and scored output
  • Pooled inference sharing one GPU across both models when traffic is bursty
  • Tens of thousands of embeddings per second on L40S with FP8 batching

Example prompt

In Pipes:
Build me an embeddings pipeline for English documents.
Use BGE-large-en-v1.5 plus the BGE reranker. Optimize for throughput.

Quick example

from openai import OpenAI

client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

resp = client.embeddings.create(
    model="your-pipeline-id",
    input=["RunInfra is a chat-native AI infrastructure platform.", "BGE is an embedding model."],
)
print(resp.data[0].embedding[:5])

Models in the catalog

  • BGE (BAAI): bge-large-en-v1.5, bge-m3 (multilingual), bge-reranker-large
  • E5 (Microsoft): e5-large-v2, e5-mistral-7b-instruct
  • GTE (Alibaba): gte-large, gte-Qwen2-7B-instruct
  • Nomic: nomic-embed-text-v1.5

Deeper details

See the models catalog for the full embedding model list, dimensions, and license summaries, and runinfra.ai/use-cases/embeddings for benchmark numbers.