Streaming responses

What this does. Streams model tokens as they are generated, so users see the response appear word-by-word instead of waiting for the full answer. When to use it. Any chat UI. Long-running summaries. Agents that need to show reasoning progress.

Minimal code

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Tell me a short story"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.runinfra.ai/v1",
  apiKey: "YOUR_RUNINFRA_API_KEY",
});

const stream = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [{ role: "user", content: "Tell me a short story" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0].delta.content ?? "");
}

curl https://api.runinfra.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b",
    "messages": [{"role":"user","content":"Tell me a short story"}],
    "stream": true
  }'

What to tune

Parameter	Effect
`stream_options.include_usage`	Emits a final `usage` chunk with token counts for billing
`temperature`	Higher = more surprise in each token
`max_tokens`	Hard cap on generated length
`stop`	Up to four stop sequences. Stream ends early on match

Common mistakes

Forgetting flush=True in Python. Without it, stdout buffers and tokens appear in chunks.
Breaking the loop too early. The final chunk has choices[0].finish_reason != null and an empty delta.content; keep reading until the stream closes.
Streaming inside a CDN with buffering. Some edges buffer SSE. Either stream direct from the client or disable buffering on the proxy layer.
Mixing n > 1 with streaming. Allowed, but every delta carries a choices[0..n].index. Route tokens by index or you will interleave.

Next steps

SSE event reference

Every event type RunInfra emits, with heartbeat and reconnection rules.

Tool calling

Stream assistant messages that also invoke tools.

Structured output

Stream JSON that parses incrementally.

OpenAI compatibility

The full contract for the API you’re using.

​Minimal code

​What to tune

​Common mistakes

​Next steps