Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
{ data: [{embedding, index}] } shape. Billing is per input token only, no output tokens exist for embeddings.
Request
Parameters
The embedding model id. Must be deployed in your workspace (e.g.
"bge-m3", "mxbai-embed-large").Single string or array of strings to embed. Arrays are processed in a single batched GPU call.
"float" (array of numbers) or "base64" (compact, use when wire size matters).Response
Typical use cases
RAG pipeline
Embed your documents once, store in a vector DB (pgvector, Pinecone, Weaviate), retrieve by cosine similarity at query time.
Semantic dedup
Cluster near-duplicate tickets, support emails, or product reviews.
Classification
Embed labels and queries in the same space, pick nearest-neighbor label.
Hybrid search
Blend BM25 + embedding similarity for better recall than either alone.
Billing
Embeddings are billed per input token only. Rates depend on your plan and the model’s parameter size, see Plans for the current rate card. Active-tier deployments carry a lower per-token rate in exchange for the reserved-GPU hourly fee.Batch size limits
Theinput array can carry up to 2048 strings per request, with a combined token budget that depends on the model:
| Model class | Max strings | Max combined tokens | Notes |
|---|---|---|---|
BGE small (bge-small-en, 384-d) | 2048 | 262,144 | Highest batch tolerance |
BGE base / large (bge-large-en-v1.5, bge-m3, 1024-d) | 1024 | 131,072 | Sweet spot for most workloads |
7B-class embedders (e5-mistral-7b, gte-Qwen2-7B) | 256 | 32,768 | Smaller batches; LLM-sized models |
usage block.
Pooling and normalization
The vector returned is the model’s canonical pooled output:- BGE family: CLS-token pooling, L2-normalized
- E5 family: mean pooling over the final hidden state, L2-normalized
- Nomic / GTE: mean pooling, L2-normalized
vector_cosine_ops, Pinecone, Weaviate) handle this transparently.
Retry semantics
The endpoint is idempotent at the input level: passing the sameinput and model returns deterministic vectors. To make a retry safe across network failures, supply an idempotency key:
max_retries and the SDK handles 429 / 503 with exponential backoff automatically:
Next steps
Embeddings + rerank use case
Fuse encoder + cross-encoder reranker on one GPU.
RAG cookbook
End-to-end retrieval with citations.
Rate limits
Per-key budgets and burst behavior.
Errors
Full error code reference.