Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

What this does. Streams model tokens as they are generated, so users see the response appear word-by-word instead of waiting for the full answer. When to use it. Any chat UI. Long-running summaries. Agents that need to show reasoning progress.

Minimal code

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Tell me a short story"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

What to tune

ParameterEffect
stream_options.include_usageEmits a final usage chunk with token counts for billing
temperatureHigher = more surprise in each token
max_tokensHard cap on generated length
stopUp to four stop sequences. Stream ends early on match

Common mistakes

  • Forgetting flush=True in Python. Without it, stdout buffers and tokens appear in chunks.
  • Breaking the loop too early. The final chunk has choices[0].finish_reason != null and an empty delta.content; keep reading until the stream closes.
  • Streaming inside a CDN with buffering. Some edges buffer SSE. Either stream direct from the client or disable buffering on the proxy layer.
  • Mixing n > 1 with streaming. Allowed, but every delta carries a choices[0..n].index. Route tokens by index or you will interleave.

Next steps

SSE event reference

Every event type RunInfra emits, with heartbeat and reconnection rules.

Tool calling

Stream assistant messages that also invoke tools.

Structured output

Stream JSON that parses incrementally.

OpenAI compatibility

The full contract for the API you’re using.