What this does. Streams model tokens as they are generated, so users see the response appear word-by-word instead of waiting for the full answer. When to use it. Any chat UI. Long-running summaries. Agents that need to show reasoning progress.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Minimal code
What to tune
| Parameter | Effect |
|---|---|
stream_options.include_usage | Emits a final usage chunk with token counts for billing |
temperature | Higher = more surprise in each token |
max_tokens | Hard cap on generated length |
stop | Up to four stop sequences. Stream ends early on match |
Common mistakes
- Forgetting
flush=Truein Python. Without it, stdout buffers and tokens appear in chunks. - Breaking the loop too early. The final chunk has
choices[0].finish_reason != nulland an emptydelta.content; keep reading until the stream closes. - Streaming inside a CDN with buffering. Some edges buffer SSE. Either stream direct from the client or disable buffering on the proxy layer.
- Mixing
n > 1with streaming. Allowed, but every delta carries achoices[0..n].index. Route tokens by index or you will interleave.
Next steps
SSE event reference
Every event type RunInfra emits, with heartbeat and reconnection rules.
Tool calling
Stream assistant messages that also invoke tools.
Structured output
Stream JSON that parses incrementally.
OpenAI compatibility
The full contract for the API you’re using.