How does this compare to OpenAI Embeddings?

At high volume, cost per million tokens drops 5 to 10x because GPU is amortized across batches. Latency is similar to ada-002. MTEB benchmarks place BGE-Large at parity with top closed encoders.

Can I combine encoder and reranker?

Yes. vLLM serves both on the same GPU, and you can rerank the same top-k in one round-trip. The agent benchmarks the combined stack on your queries.

What about multilingual?

BGE-M3 covers 100+ languages with the same recipe. The agent applies the same embedding-serving and retrieval-quality plan as the English path.

Can I push directly into a vector DB?

Yes. The serve script has an optional sink for Pinecone, Qdrant, Weaviate, or pgvector. Bring credentials, the agent wires the writer.

RunInfraby RightNow

Dashboard Sign in Get started

Embeddings and rerank, one round-trip.

BGE, E5, GTE, Nomic. Encoder and cross-encoder reranker fused on a single GPU.

Deploy this pipeline Read the stack

Recs

Classification

Reranking

What you actually own

The optimization knobs, the codebase, the model choice. None of it locked away.

Encoder plus reranker, fused.

Cross-encoder reranks the same top-k in the same vLLM server. One round-trip beats the stacked Pinecone-plus-Cohere setup.

100+ languages, one recipe.

BGE-M3 covers Chinese, English, French, Hindi, and 96+ more with the same embedding-serving recipe. No model-per-language sprawl.

5-10x cheaper at scale.

Closed APIs bill per call. RunInfra amortizes GPU compute across batches. At 1M docs a day, you save thousands a month.

Three ways to ship embeddings

Most teams pick between speed and control. RunInfra keeps both in one workflow.

Deployment comparison for embeddings across RunInfra, closed APIs, and DIY self-hosting.
What matters	RunInfraRecommendedFast path with model control and export.	Closed embedding APIsPer-call, hosted.	DIY self-hostingFull control, heavy operations.
01Launch	Pick model, optimize, deploy Start quickly and keep the production path open.	Call provider endpoint Fast first demo, but the runtime stays rented.	Build serving stack first Infrastructure work comes before product learning.
02Model control	Bring the model ID Keep model choice and serving decisions visible.	Provider catalog You use what the provider exposes.	Your model Full control if your team maintains the runtime.
03Tuning	Measured latency and GPU cost Compare serving choices before deployment.	Opaque Latency and batching stay behind the API.	Manual profiling Your team owns tuning and regressions.
04Export	Managed now, export when needed Use the endpoint first and take the deploy package later.	Locked endpoint You keep calling the provider.	Already owned Export exists because you built everything yourself.
05Operations	Low until you choose to own it Operate managed, then export with the same measured plan.	Low, with lock-in Less infra work, less production control.	High You own infra, failures, upgrades, and serving changes.
06Security	SOC 2 Type 2 Audited controls across access, logging, and incident response.	Varies by vendor Compliance depends on the third party sitting in the request path.	You build it Your team owns the audit trail, logging, and access controls.

RunInfra

Recommended

Fast path with model control and export.

Launch

Pick model, optimize, deploy

Start quickly and keep the production path open.

Model control

Bring the model ID

Keep model choice and serving decisions visible.

Tuning

Measured latency and GPU cost

Compare serving choices before deployment.

Export

Managed now, export when needed

Use the endpoint first and take the deploy package later.

Operations

Low until you choose to own it

Operate managed, then export with the same measured plan.

Security

SOC 2 Type 2

Audited controls across access, logging, and incident response.

Code you own. Deploy anywhere.

The full recipe ships with you. Codebase, kernels, engine config, weights. Run it anywhere.

Live

Build an embeddings API. @BGE-M3 with the cross-encoder reranker. Serve with @vLLM on a single @L4.

Agent

On it. I'll profile BGE on the L4, configure vLLM with encoder plus reranker, then benchmark a 10k-doc index.

Profiled BGE-M3 on L4

12k tokens/s peak, batch 32

Selected vLLM serving engine

encoder plus reranker on one server

Tuned embedding batching

higher vectors/sec at recall parity

Enabled batch padding fusion

82% padding waste removed

Ran indexing harness, 10k docs

1.2s end-to-end, $0.62 / hr

Switch to E5 Mistral and rerun the bench...

Vector adapters3

runinfra-embeddings/

models/

evals/

Managed RunInfra

Our GPUs, per-million-tokens billing from L4 to B200.

Your infrastructure

AWS, GCP, RunPod, bare metal. Same Dockerfile, your cluster.

Local workstation

docker compose up. Full pipeline on a single GPU.

Search the Hugging Face catalog

Encoders and rerankers, live from huggingface.co. Click through to inspect, or paste any compatible ID into the dashboard.

HF

BGE-M3

BAAI

568MEmbed

BGE-Large EN v1.5

BAAI

335MEmbed

BGE Reranker v2-m3

BAAI

568MRerank

Jina Embeddings v3

Jina AI

570MEmbed

Mixedbread Embed Large

Mixedbread AI

335MEmbed

Arctic Embed L

Snowflake

335MEmbed

GTE-Qwen2 1.5B

Alibaba

1.5BEmbed

E5 Mistral 7B

Intfloat

7BEmbed

Nomic Embed v1.5

Nomic AI

137MEmbed

What RunInfra tunes

Every stage of the pipeline, retuned per model and GPU.

Batch indexing

Continuous batching for offline indexing. 12k tokens/s peak on L4.

Streaming embed

Low-latency single-query embedding for live search. p50 4ms.

Reranker fusion

Cross-encoder reranker in the same server. One round-trip for top-k.

Embedding serving sweep

Tune batch size, token budget, and padding policy against recall and cosine gates.

Vector DB sync

Push to Pinecone, Qdrant, Weaviate, or pgvector. Built-in sink.

Per-stage scheduling

Encode plus rerank interleaved. Late-interaction pooling on retrieval.

Try this pipeline

Edit the model, engine, or GPU inline. Send to retune the stack in the dashboard.

Common questions

Can't find what you're looking for? Get in touch

Which embedding models work?

Supported Hugging Face encoders that load through vLLM or sentence-transformers. BGE family is the recommended starting point for English, BGE-M3 for multilingual.

Deploy your first optimized model, measured before you ship

Describe the goal. RunInfra builds and optimizes the stack.

Start Building View Pricing

End-to-end encryption

Isolated GPU infrastructure

Zero data retention

SOC 2 Type II

RunInfraby RightNow

All systems operational

Backed by

Combinator

AICPA Type II

SOC 2

Ask AI about RunInfra

Part of RightNow

Embeddings and rerank, one round-trip.

BGE, E5, GTE, Nomic. Encoder and cross-encoder reranker fused on a single GPU.

Deploy this pipeline Read the stack

Recs

Classification

Reranking

What you actually own

The optimization knobs, the codebase, the model choice. None of it locked away.

Encoder plus reranker, fused.

Cross-encoder reranks the same top-k in the same vLLM server. One round-trip beats the stacked Pinecone-plus-Cohere setup.

100+ languages, one recipe.

BGE-M3 covers Chinese, English, French, Hindi, and 96+ more with the same embedding-serving recipe. No model-per-language sprawl.

5-10x cheaper at scale.

Closed APIs bill per call. RunInfra amortizes GPU compute across batches. At 1M docs a day, you save thousands a month.

Three ways to ship embeddings

Most teams pick between speed and control. RunInfra keeps both in one workflow.

Deployment comparison for embeddings across RunInfra, closed APIs, and DIY self-hosting.
What matters	RunInfraRecommendedFast path with model control and export.	Closed embedding APIsPer-call, hosted.	DIY self-hostingFull control, heavy operations.
01Launch	Pick model, optimize, deploy Start quickly and keep the production path open.	Call provider endpoint Fast first demo, but the runtime stays rented.	Build serving stack first Infrastructure work comes before product learning.
02Model control	Bring the model ID Keep model choice and serving decisions visible.	Provider catalog You use what the provider exposes.	Your model Full control if your team maintains the runtime.
03Tuning	Measured latency and GPU cost Compare serving choices before deployment.	Opaque Latency and batching stay behind the API.	Manual profiling Your team owns tuning and regressions.
04Export	Managed now, export when needed Use the endpoint first and take the deploy package later.	Locked endpoint You keep calling the provider.	Already owned Export exists because you built everything yourself.
05Operations	Low until you choose to own it Operate managed, then export with the same measured plan.	Low, with lock-in Less infra work, less production control.	High You own infra, failures, upgrades, and serving changes.
06Security	SOC 2 Type 2 Audited controls across access, logging, and incident response.	Varies by vendor Compliance depends on the third party sitting in the request path.	You build it Your team owns the audit trail, logging, and access controls.

RunInfra

Recommended

Fast path with model control and export.

Launch

Pick model, optimize, deploy

Start quickly and keep the production path open.

Model control

Bring the model ID

Keep model choice and serving decisions visible.

Tuning

Measured latency and GPU cost

Compare serving choices before deployment.

Export

Managed now, export when needed

Use the endpoint first and take the deploy package later.

Operations

Low until you choose to own it

Operate managed, then export with the same measured plan.

Security

SOC 2 Type 2

Audited controls across access, logging, and incident response.

Code you own. Deploy anywhere.

The full recipe ships with you. Codebase, kernels, engine config, weights. Run it anywhere.

Live

Build an embeddings API. @BGE-M3 with the cross-encoder reranker. Serve with @vLLM on a single @L4.

Agent

On it. I'll profile BGE on the L4, configure vLLM with encoder plus reranker, then benchmark a 10k-doc index.

Profiled BGE-M3 on L4

12k tokens/s peak, batch 32

Selected vLLM serving engine

encoder plus reranker on one server

Tuned embedding batching

higher vectors/sec at recall parity

Enabled batch padding fusion

82% padding waste removed

Ran indexing harness, 10k docs

1.2s end-to-end, $0.62 / hr

Switch to E5 Mistral and rerun the bench...

Vector adapters3

runinfra-embeddings/

models/

evals/

Managed RunInfra

Our GPUs, per-million-tokens billing from L4 to B200.

Your infrastructure

AWS, GCP, RunPod, bare metal. Same Dockerfile, your cluster.

Local workstation

docker compose up. Full pipeline on a single GPU.

Search the Hugging Face catalog

Encoders and rerankers, live from huggingface.co. Click through to inspect, or paste any compatible ID into the dashboard.

HF

BGE-M3

BAAI

568MEmbed

BGE-Large EN v1.5

BAAI

335MEmbed

BGE Reranker v2-m3

BAAI

568MRerank

Jina Embeddings v3

Jina AI

570MEmbed

Mixedbread Embed Large

Mixedbread AI

335MEmbed

Arctic Embed L

Snowflake

335MEmbed

GTE-Qwen2 1.5B

Alibaba

1.5BEmbed

E5 Mistral 7B

Intfloat

7BEmbed

Nomic Embed v1.5

Nomic AI

137MEmbed

What RunInfra tunes

Every stage of the pipeline, retuned per model and GPU.

Batch indexing

Continuous batching for offline indexing. 12k tokens/s peak on L4.

Streaming embed

Low-latency single-query embedding for live search. p50 4ms.

Reranker fusion

Cross-encoder reranker in the same server. One round-trip for top-k.

Embedding serving sweep

Tune batch size, token budget, and padding policy against recall and cosine gates.

Vector DB sync

Push to Pinecone, Qdrant, Weaviate, or pgvector. Built-in sink.

Per-stage scheduling

Encode plus rerank interleaved. Late-interaction pooling on retrieval.

Try this pipeline

Edit the model, engine, or GPU inline. Send to retune the stack in the dashboard.

Common questions

Can't find what you're looking for? Get in touch

Which embedding models work?

Supported Hugging Face encoders that load through vLLM or sentence-transformers. BGE family is the recommended starting point for English, BGE-M3 for multilingual.

Deploy your first optimized model, measured before you ship

Describe the goal. RunInfra builds and optimizes the stack.

Start Building View Pricing

End-to-end encryption

Isolated GPU infrastructure

Zero data retention

SOC 2 Type II

RunInfraby RightNow

All systems operational

Backed by

Combinator

AICPA Type II

SOC 2

Ask AI about RunInfra

Part of RightNow