Chat completions - RunInfra

POST https://api.runinfra.ai/v1/chat/completions

OpenAI-compatible chat requests support streaming, tools, structured output, temperature, top-p, and common sampling controls when the selected serving backend supports them. Backend-specific parameters may be ignored or rejected if the backend cannot honor them.

Minimal request

from openai import OpenAI
client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Summarize quantum entanglement."}],
)
print(resp.choices[0].message.content)

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.runinfra.ai/v1", apiKey: "YOUR_RUNINFRA_API_KEY" });

const resp = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [{ role: "user", content: "Summarize quantum entanglement." }],
});
console.log(resp.choices[0].message.content);

curl https://api.runinfra.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b",
    "messages": [{"role": "user", "content": "Summarize quantum entanglement."}]
  }'

Streaming

Set stream: true. Response is SSE; each event is a JSON delta, terminated by data: [DONE].

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Count to 5."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

const stream = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [{ role: "user", content: "Count to 5." }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Function / tool calling

Pass tools + tool_choice. The model can decide to call tools; your code executes them and sends results back as follow-up messages. Identical shape to the OpenAI API.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
)
tool_call = resp.choices[0].message.tool_calls[0]
# Execute get_weather(tool_call.function.arguments) ...

Structured output

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "Reply only in valid JSON."},
        {"role": "user", "content": "Name 3 planets with diameters in km."},
    ],
    response_format={"type": "json_object"},
)

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Name 3 planets."}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "Planets",
            "schema": {
                "type": "object",
                "properties": {
                    "planets": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "diameter_km": {"type": "number"},
                            },
                            "required": ["name", "diameter_km"],
                        },
                    },
                },
                "required": ["planets"],
            },
            "strict": True,
        },
    },
)

Request parameters

model

string

required

The model id (e.g. "llama-3.3-70b"). Must be a model deployed in your workspace. Get the list via GET /v1/models.

messages

object[]

required

Array of messages. Each has role ("system", "user", "assistant", "tool") and content (string). Tool calls use the same format as the OpenAI API.

stream

boolean

default:"false"

Return an SSE stream of delta chunks.

max_tokens

integer

default:"512"

Maximum tokens to generate.

temperature

number

default:"1"

Sampling temperature, 0-2. Lower is more deterministic.

top_p

number

default:"1"

Nucleus sampling cutoff.

stop

string | string[]

Up to 4 sequences that halt generation.

tools

object[]

Function-calling tool definitions. Forwarded unchanged.

tool_choice

string | object

"auto" | "none" | "required" | { type: "function", function: { name } }.

response_format

object

{ type: "json_object" } or { type: "json_schema", json_schema: {...} }.

seed

integer

For deterministic sampling on supported models.

logprobs

boolean

Return token-level log probabilities.

top_logprobs

integer

Number of top logprobs per token (0-5).

integer

default:"1"

Number of completions to generate.

presence_penalty

number

default:"0"

Penalty for repeating tokens already in the output, -2 to 2.

frequency_penalty

number

default:"0"

Penalty for frequent-token usage, -2 to 2.

Response headers

Every successful response includes:

X-Request-Id: <uuid>           , correlation id for support
X-Model-Used: <upstream-model> , resolved HF model id
X-Route-Model: <requested>     , model you asked for
X-RateLimit-Limit: 500         , your effective per-minute budget
X-RateLimit-Remaining: 498
X-RateLimit-Tier: core         , your plan tier
X-Backend: runpod-vllm         , serving framework
X-Deployment-Id: <uuid>        , the deployment that served this

​Minimal request

​Streaming

​Function / tool calling

​Structured output

​Request parameters

​Response headers

Minimal request

Streaming

Function / tool calling

Structured output

Request parameters

Response headers