Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

POST https://api.runinfra.ai/v1/chat/completions
Full OpenAI compat, tools, streaming, structured output, logprobs, n > 1, temperature, top-p, presence/frequency penalties, all pass straight through to the upstream vLLM / SGLang / TRT-LLM server.

Minimal request

from openai import OpenAI
client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Summarize quantum entanglement."}],
)
print(resp.choices[0].message.content)

Streaming

Set stream: true. Response is SSE; each event is a JSON delta, terminated by data: [DONE].
stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Count to 5."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Function / tool calling

Pass tools + tool_choice. The model can decide to call tools; your code executes them and sends results back as follow-up messages. Identical shape to the OpenAI API.
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools,
    tool_choice="auto",
)
tool_call = resp.choices[0].message.tool_calls[0]
# Execute get_weather(tool_call.function.arguments) ...

Structured output

resp = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "Reply only in valid JSON."},
        {"role": "user", "content": "Name 3 planets with diameters in km."},
    ],
    response_format={"type": "json_object"},
)

Request parameters

model
string
required
The model id (e.g. "llama-3.3-70b"). Must be a model deployed in your workspace. Get the list via GET /v1/models.
messages
object[]
required
Array of messages. Each has role ("system", "user", "assistant", "tool") and content (string). Tool calls use the same format as the OpenAI API.
stream
boolean
default:"false"
Return an SSE stream of delta chunks.
max_tokens
integer
default:"512"
Maximum tokens to generate.
temperature
number
default:"1"
Sampling temperature, 0-2. Lower is more deterministic.
top_p
number
default:"1"
Nucleus sampling cutoff.
stop
string | string[]
Up to 4 sequences that halt generation.
tools
object[]
Function-calling tool definitions. Forwarded unchanged.
tool_choice
string | object
"auto" | "none" | "required" | { type: "function", function: { name } }.
response_format
object
{ type: "json_object" } or { type: "json_schema", json_schema: {...} }.
seed
integer
For deterministic sampling on supported models.
logprobs
boolean
Return token-level log probabilities.
top_logprobs
integer
Number of top logprobs per token (0-5).
n
integer
default:"1"
Number of completions to generate.
presence_penalty
number
default:"0"
Penalty for repeating tokens already in the output,, 2 to 2.
frequency_penalty
number
default:"0"
Penalty for frequent-token usage,, 2 to 2.

Response headers

Every successful response includes:
X-Request-Id: <uuid>           , correlation id for support
X-Model-Used: <upstream-model> , resolved HF model id
X-Route-Model: <requested>     , model you asked for
X-RateLimit-Limit: 500         , your effective per-minute budget
X-RateLimit-Remaining: 498
X-RateLimit-Tier: pro          , your plan tier
X-Backend: runpod-vllm         , serving framework
X-Deployment-Id: <uuid>        , the deployment that served this