Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

POST https://api.runinfra.ai/v1/embeddings
Generate vector embeddings for one or more inputs. Returned in OpenAI’s { data: [{embedding, index}] } shape. Billing is per input token only, no output tokens exist for embeddings.

Request

from openai import OpenAI
client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

# Single input
resp = client.embeddings.create(
    model="bge-m3",
    input="Quantum entanglement is correlated quantum states.",
)
vector = resp.data[0].embedding  # 1024-dim float array

# Batch input
resp = client.embeddings.create(
    model="bge-m3",
    input=["text one", "text two", "text three"],
)
for item in resp.data:
    print(item.index, len(item.embedding))

Parameters

model
string
required
The embedding model id. Must be deployed in your workspace (e.g. "bge-m3", "mxbai-embed-large").
input
string | string[]
required
Single string or array of strings to embed. Arrays are processed in a single batched GPU call.
encoding_format
string
default:"float"
"float" (array of numbers) or "base64" (compact, use when wire size matters).

Response

{
  "object": "list",
  "data": [
    { "object": "embedding", "embedding": [0.012, -0.043, ...], "index": 0 }
  ],
  "model": "BAAI/bge-m3",
  "usage": { "prompt_tokens": 8, "total_tokens": 8 }
}

Typical use cases

RAG pipeline

Embed your documents once, store in a vector DB (pgvector, Pinecone, Weaviate), retrieve by cosine similarity at query time.

Semantic dedup

Cluster near-duplicate tickets, support emails, or product reviews.

Classification

Embed labels and queries in the same space, pick nearest-neighbor label.

Hybrid search

Blend BM25 + embedding similarity for better recall than either alone.

Billing

Embeddings are billed per input token only. Rates depend on your plan and the model’s parameter size, see Plans for the current rate card. Active-tier deployments carry a lower per-token rate in exchange for the reserved-GPU hourly fee.

Batch size limits

The input array can carry up to 2048 strings per request, with a combined token budget that depends on the model:
Model classMax stringsMax combined tokensNotes
BGE small (bge-small-en, 384-d)2048262,144Highest batch tolerance
BGE base / large (bge-large-en-v1.5, bge-m3, 1024-d)1024131,072Sweet spot for most workloads
7B-class embedders (e5-mistral-7b, gte-Qwen2-7B)25632,768Smaller batches; LLM-sized models
Each input string is also subject to the model’s per-input context limit (typically 512 or 8192 tokens). Inputs longer than the limit are truncated with a warning in the response’s usage block.

Pooling and normalization

The vector returned is the model’s canonical pooled output:
  • BGE family: CLS-token pooling, L2-normalized
  • E5 family: mean pooling over the final hidden state, L2-normalized
  • Nomic / GTE: mean pooling, L2-normalized
Because vectors arrive L2-normalized, cosine similarity reduces to a dot product in downstream code. Most vector databases (pgvector with vector_cosine_ops, Pinecone, Weaviate) handle this transparently.

Retry semantics

The endpoint is idempotent at the input level: passing the same input and model returns deterministic vectors. To make a retry safe across network failures, supply an idempotency key:
curl https://api.runinfra.ai/v1/embeddings \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3","input":["hello"]}'
If a retry with the same key arrives within 24 hours of a successful first call, the cached response is returned, no inference runs, no billing event fires. For OpenAI SDK clients, set max_retries and the SDK handles 429 / 503 with exponential backoff automatically:
client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
    max_retries=5,
)

Next steps

Embeddings + rerank use case

Fuse encoder + cross-encoder reranker on one GPU.

RAG cookbook

End-to-end retrieval with citations.

Rate limits

Per-key budgets and burst behavior.

Errors

Full error code reference.