Embeddings - RunInfra

POST https://api.runinfra.ai/v1/embeddings

Generate vector embeddings for one or more inputs. Returned in OpenAI’s { data: [{embedding, index}] } shape. Billing is per input token only, no output tokens exist for embeddings.

Request

from openai import OpenAI
client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

# Single input
resp = client.embeddings.create(
    model="bge-m3",
    input="Quantum entanglement is correlated quantum states.",
)
vector = resp.data[0].embedding  # 1024-dim float array

# Batch input
resp = client.embeddings.create(
    model="bge-m3",
    input=["text one", "text two", "text three"],
)
for item in resp.data:
    print(item.index, len(item.embedding))

const resp = await client.embeddings.create({
  model: "bge-m3",
  input: ["text one", "text two"],
});
resp.data.forEach((d) => console.log(d.index, d.embedding.length));

curl https://api.runinfra.ai/v1/embeddings \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3","input":"hello world"}'

Parameters

model

string

required

The embedding model id. Must be deployed in your workspace (e.g. "bge-m3", "mxbai-embed-large").

input

string | string[]

required

Single string or array of strings to embed. Arrays are processed in a single batched GPU call.

encoding_format

string

default:"float"

"float" (array of numbers) or "base64" (compact, use when wire size matters).

Response

{
  "object": "list",
  "data": [
    { "object": "embedding", "embedding": [0.012, -0.043, ...], "index": 0 }
  ],
  "model": "BAAI/bge-m3",
  "usage": { "prompt_tokens": 8, "total_tokens": 8 }
}

Typical use cases

RAG pipeline

Embed your documents once, store in a vector DB (pgvector, Pinecone, Weaviate), retrieve by cosine similarity at query time.

Semantic dedup

Cluster near-duplicate tickets, support emails, or product reviews.

Classification

Embed labels and queries in the same space, pick nearest-neighbor label.

Hybrid search

Blend BM25 + embedding similarity for better recall than either alone.

Billing

Embeddings are billed per input token only. Rates depend on your plan and the model’s parameter size, see Plans for the current rate card. Active-tier deployments carry a lower per-token rate in exchange for the reserved-GPU hourly fee.

Batch size limits

The input array can carry up to 2048 strings per request, with a combined token budget that depends on the model:

Model class	Max strings	Max combined tokens	Notes
BGE small (`bge-small-en`, 384-d)	2048	262,144	Highest batch tolerance
BGE base / large (`bge-large-en-v1.5`, `bge-m3`, 1024-d)	1024	131,072	Sweet spot for most workloads
7B-class embedders (`e5-mistral-7b`, `gte-Qwen2-7B`)	256	32,768	Smaller batches; LLM-sized models

Each input string is also subject to the model’s per-input context limit (typically 512 or 8192 tokens). Inputs longer than the limit are truncated with a warning in the response’s usage block.

Pooling and normalization

The vector returned is the model’s canonical pooled output:

BGE family: CLS-token pooling, L2-normalized
E5 family: mean pooling over the final hidden state, L2-normalized
Nomic / GTE: mean pooling, L2-normalized

Because vectors arrive L2-normalized, cosine similarity reduces to a dot product in downstream code. Most vector databases (pgvector with vector_cosine_ops, Pinecone, Weaviate) handle this transparently.

Idempotency and retries

The endpoint is deterministic for the same deployed model and input, but native SDK clients do not automatically retry embeddings. The SDK sends embeddings once even when you provide an idempotency key because embeddings are not replay-cached operations today. Use X-Client-Request-Id when you need to correlate a batch with logs or support tickets:

curl https://api.runinfra.ai/v1/embeddings \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "X-Client-Request-Id: $(uuidgen)" \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3","input":["hello"]}'

If you manually retry, assume the repeated request may run again. For cost-sensitive ingestion jobs, keep SDK maxRetries / max_retries at 0 unless your own application layer can deduplicate completed batches. For OpenAI SDK clients, use explicit retry policy instead of relying on broad defaults for large embedding batches:

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
    max_retries=0,
)

Next steps

Embeddings + rerank use case

Fuse encoder + cross-encoder reranker on one GPU.

RAG cookbook

End-to-end retrieval with citations.

Rate limits

Per-key budgets and burst behavior.

Errors

Full error code reference.

​Request

​Parameters

​Response

​Typical use cases