An embeddings pipeline takes a list of texts and returns vector representations, optionally followed by a cross-encoder reranking pass over candidate documents, all in one HTTP round-trip. RunInfra ships the recipe with BGE encoders for embeddings and BGE or Cohere-style cross-encoders for reranking, fused on a single GPU.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Architecture
What you get out of the box
- OpenAI-compatible
/v1/embeddingswith batched input, billing per input token - Custom
/v1/rerankendpoint with a documents array and scored output - Pooled inference sharing one GPU across both models when traffic is bursty
- Tens of thousands of embeddings per second on L40S with FP8 batching
Example prompt
In Pipes:Quick example
Models in the catalog
- BGE (BAAI): bge-large-en-v1.5, bge-m3 (multilingual), bge-reranker-large
- E5 (Microsoft): e5-large-v2, e5-mistral-7b-instruct
- GTE (Alibaba): gte-large, gte-Qwen2-7B-instruct
- Nomic: nomic-embed-text-v1.5