RunInfra SDK - RunInfra

The RunInfra SDK is the native access layer for optimized deployments. It keeps OpenAI-compatible routes available when the selected deployment supports them, then adds RunInfra-specific safety around pipeline IDs, scoped API keys, request IDs, idempotency headers for proven replay-safe helpers, typed errors, deployment-gated audio, image, and voice helpers, and local webhook signature verification.

Use the dashboard-generated snippet for the operation your deployment supports. The dashboard hides unsupported helpers for a deployment, includes the right endpoint scope, and shows the model ID returned by GET /v1/models.

When to use it

Use the RunInfra SDK

You want scoped RunInfra keys, pipeline IDs, typed errors, request IDs, replay-safe retries, streaming helpers, deployment-gated audio/image helpers, and webhook verification in one client.

Use the OpenAI SDK

You already have OpenAI-compatible code and only need to swap base_url, API key, and model ID.

Both paths call the same verified public gateway. The dashboard shows only the operations your deployment supports. After an optimization run finishes, open the Deploy view from the dashboard. Deploy only shows SDK operations that the verified endpoint supports, so copy the native or OpenAI-compatible snippet from there instead of guessing a route.

Install

npm install @runinfra/sdk

pip install runinfra

Base URL and pipeline ID

Use a workspace-scoped key when possible. It reaches every verified deployment in your workspace and selects the target with model. Use a pipeline-scoped key or pipelineId when you want one client locked to one optimized pipeline.

import { RunInfra } from "@runinfra/sdk";

const apiKey = process.env.RUNINFRA_GATEWAY_KEY;
if (!apiKey) throw new Error("Set RUNINFRA_GATEWAY_KEY.");

const client = new RunInfra({
  apiKey,
  baseURL: "https://api.runinfra.ai/v1",
  pipelineId: "your-optimized-pipeline-id",
});

import os
from runinfra import RunInfra

api_key = os.environ.get("RUNINFRA_GATEWAY_KEY")
if not api_key:
    raise RuntimeError("Set RUNINFRA_GATEWAY_KEY.")

client = RunInfra(
    api_key=api_key,
    base_url="https://api.runinfra.ai/v1",
    pipeline_id="your-optimized-pipeline-id",
)

If your dashboard snippet shows a different production base URL, keep the generated value. The pipeline ID should be the optimized inference pipeline ID from RunInfra, not the Hugging Face model ID.

Supported operations

Modality	SDK operation	OpenAI-compatible route	Notes
LLM and vision-language	`chat.completions.create()`	`POST /v1/chat/completions`	Streaming, tools, structured output.
LLM and vision-language	`responses.create()`	`POST /v1/responses`	Chat-completions compatibility adapter for supported Responses-style input.
Embeddings	`embeddings.create()`	`POST /v1/embeddings`	Batched vector response. Sent once by the SDK even when you provide an idempotency key.
Text to speech	`audio.speech.create()`	`POST /v1/audio/speech`	Returns binary audio. Configure the deployment-supported voice or reference-audio mode. Sent once by the SDK even when you provide an idempotency key.
Speech to text	`audio.transcriptions.create()`	`POST /v1/audio/transcriptions`	Multipart audio upload. Sent once by the SDK even when you provide an idempotency key.
Voice pipeline	`voice.pipeline.create()`	Native `/pipeline` helper	Pipeline-scoped helper for co-located audio-to-response deployments. Not an OpenAI-compatible route.
Rerank	`rerank.create()`	Native `POST /v1/rerank` or `POST /v1/{pipelineId}/rerank` helper	Reranker helper for verified TEI text deployments and compatible vLLM vision rerank deployments. Flat workspace `/v1/rerank` requests are replay-safe with an idempotency key; pipeline-scoped rerank snippets are sent once.
Image generation	`images.generate()`	`POST /v1/images/generations`	Returns OpenAI-shaped image data from verified image deployments. Sent once by the SDK even when you provide an idempotency key.
Discovery	`models.list()` / `models.retrieve()`	`GET /v1/models` / `GET /v1/models/{model}`	Free model discovery for verified active deployments.
Webhooks	TypeScript `webhooks.verifySignature()` / `constructEvent()`, Python `webhooks.verify_signature()` / `construct_event()`	Local helper only	Delivery routes are not public yet. Verification helpers are available now.

Image, audio, rerank, and voice helpers are preview surfaces and require a deployment that exposes the matching route. Unsupported operations are hidden from dashboard deployment snippets or shown with a reason.

Streaming chat

const stream = await client.chat.completions.create({
  model: "your-model-id",
  messages: [{ role: "user", content: "Hello" }],
  max_tokens: 512,
  stream: true,
});

console.error("request id:", stream.requestId);
for await (const event of stream) {
  process.stdout.write(event.choices?.[0]?.delta?.content ?? "");
}

stream = client.chat.completions.create(
    model="your-model-id",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=512,
    stream=True,
)

print(stream.request_id)
for event in stream:
    print(event["choices"][0]["delta"].get("content", ""), end="")

Streaming POST requests are not retried automatically because the partial stream may already have reached your app. Use non-streaming responses.create() or chat.completions.create() with an idempotency key when you need automatic replay-safe retries.

Responses adapter

RunInfra /v1/responses is a chat-completions compatibility adapter. The gateway converts supported input and instructions fields into chat messages, forwards the request through the chat-completions serving path, then wraps the result in a Responses-shaped envelope. It does not implement full OpenAI Responses state, include, reasoning, tool, conversation-item, or background-job semantics.

Idempotency headers

import { randomUUID } from "node:crypto";

const result = await client.responses.create(
  {
    model: "your-model-id",
    input: "Summarize this deployment in one sentence.",
  },
  {
    clientRequestId: randomUUID(),
    idempotencyKey: randomUUID(),
  },
);

console.log(result._request_id, result.output_text);

import uuid

result = client.responses.create(
    model="your-model-id",
    input="Summarize this deployment in one sentence.",
    request_options={
        "client_request_id": str(uuid.uuid4()),
        "idempotency_key": str(uuid.uuid4()),
    },
)

print(result.get("_request_id"), result.get("output_text"))

Only responses.create(), non-streaming chat.completions.create(), and flat workspace rerank.create() calls are currently auto-retry replay-safe when you provide an idempotency key. Audio, embeddings, image generation, streaming, and voice-pipeline requests are sent once even when you provide an idempotency key. Pipeline-scoped rerank requests are also sent once. Keep SDK maxRetries / max_retries at 0 for any cost-sensitive operation outside the replay-safe chat, Responses, and flat rerank helpers.

Audio and images

Text to speech
Speech to text
Images
Rerank

TypeScript

import { randomUUID } from "node:crypto";
import { writeFile } from "node:fs/promises";

const voice = process.env.RUNINFRA_TTS_VOICE?.trim();
const refAudio = process.env.RUNINFRA_TTS_REF_AUDIO?.trim();
const refText = process.env.RUNINFRA_TTS_REF_TEXT?.trim();
const taskType = process.env.RUNINFRA_TTS_TASK_TYPE?.trim() || "Base";
const speechVoice = voice
  ? { voice }
  : refAudio && refText
    ? { ref_audio: refAudio, ref_text: refText, task_type: taskType }
    : null;

if (!speechVoice) {
  throw new Error("Set RUNINFRA_TTS_VOICE, or RUNINFRA_TTS_REF_AUDIO and RUNINFRA_TTS_REF_TEXT.");
}

const audio = await client.audio.speech.create(
  {
    model: "your-tts-model-id",
    input: "Hello from your optimized RunInfra endpoint.",
    ...speechVoice,
  },
  {
    clientRequestId: randomUUID(),
    idempotencyKey: randomUUID(),
  },
);

await writeFile("output.wav", Buffer.from(await audio.arrayBuffer()));
console.log(audio.requestId, audio.contentType);

TypeScript

import { randomUUID } from "node:crypto";
import { readFile } from "node:fs/promises";

const clip = await readFile("clip.wav");
if (clip.byteLength === 0) throw new Error("ASR file must be non-empty.");

const transcript = await client.audio.transcriptions.create(
  {
    model: "your-asr-model-id",
    file: new Blob([clip], { type: "audio/wav" }),
    filename: "clip.wav",
  },
  {
    clientRequestId: randomUUID(),
    idempotencyKey: randomUUID(),
  },
);

console.log(transcript._request_id, transcript.text);

TypeScript

const image = await client.images.generate({
  model: "your-image-model-id",
  prompt: "A clean product render of a green GPU server rack",
  n: 1,
  size: "1024x1024",
});

console.log(image._request_id, image.data[0].b64_json ?? image.data[0].url);

TypeScript

const ranked = await client.rerank.create({
  model: "your-reranker-model-id",
  query: "What is RunInfra Instant Start?",
  texts: [
    "Instant Start stages model weights before a Flex replica serves traffic.",
    "A GPU kernel is a compiled CUDA function.",
  ],
  top_n: 2,
});

console.log(ranked._request_id, ranked.results[0]);

For compatible vLLM vision rerank deployments, pass documents entries with content parts such as { type: "text", text: "..." }, { type: "image_url", image_url: { url: "https://..." } }, or { type: "video_url", video_url: { url: "https://..." } }. TEI rerank deployments are text-only.

Errors and retries

SDK errors carry status, type, and requestId when the gateway provides one. Gateway errors expose requestId, type, code, and param when the API returns OpenAI-style code and param metadata. Automatic retries are limited to transient failures and replay-safe requests. Safe GET requests can retry transient failures. Charge-bearing POST inference requests retry only when an idempotency key is present and the helper is known replay-safe. Only responses.create(), non-streaming chat.completions.create(), and flat workspace rerank.create() are currently auto-retry replay-safe. The SDK does not blindly retry streaming requests, embeddings, image generation, pipeline-scoped rerank, binary TTS responses, multipart ASR uploads, or binary voice-pipeline requests. ASR file must be non-empty before the SDK builds a multipart upload. Do not rely on Idempotency-Key to deduplicate audio, image, embedding, pipeline-scoped rerank, streaming, or voice-pipeline requests. For those operations, use X-Client-Request-Id for tracing and design client retries so a repeated request may run again.

Webhook verification

Public webhook delivery is not enabled yet, but local verification helpers are available so your handlers can be written before delivery is turned on.

import { constructWebhookEvent } from "@runinfra/sdk";

const secret = process.env.RUNINFRA_WEBHOOK_SECRET;
if (!secret) throw new Error("Set RUNINFRA_WEBHOOK_SECRET.");

const event = constructWebhookEvent({
  payload: rawBody,
  signatureHeader: request.headers.get("runinfra-signature") ?? "",
  secret,
});

import os
from runinfra import construct_webhook_event

secret = os.environ.get("RUNINFRA_WEBHOOK_SECRET")
if not secret:
    raise RuntimeError("Set RUNINFRA_WEBHOOK_SECRET.")

event = construct_webhook_event(
    payload=raw_body,
    signature_header=request.headers["runinfra-signature"],
    secret=secret,
)

Next steps

OpenAI compatibility

Use the OpenAI SDK against the same gateway.

Authentication

Create scoped keys and understand workspace versus pipeline access.

API reference

Endpoint-by-endpoint parameters and response fields.

Rate limits

Per-key limits and retry headers.

​When to use it

Use the RunInfra SDK

Use the OpenAI SDK

​Install

​Base URL and pipeline ID

​Supported operations

​Streaming chat

​Responses adapter

​Idempotency headers

​Audio and images

​Errors and retries

​Webhook verification

​Next steps

OpenAI compatibility

Authentication

API reference

Rate limits

When to use it

Install

Base URL and pipeline ID

Supported operations

Streaming chat

Responses adapter

Idempotency headers

Audio and images

Errors and retries

Webhook verification

Next steps