Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra streams long-running operations (optimization sessions, chat completions, runbook execution) over Server-Sent Events (SSE). This page documents every event type the engine emits and how clients should handle them.

Transport

All SSE responses follow the standard wire format:
event: <event-name>
data: <json-payload>

Required headers on the response:
HeaderValue
Content-Typetext/event-stream; charset=utf-8
Cache-Controlno-cache, no-transform
Connectionkeep-alive
X-Accel-Bufferingno (when behind a proxy)
The engine sends a heartbeat every 15 seconds to keep idle connections alive during long profiling and optimization waits. See Heartbeats below.

Chat completions stream

These events flow when stream: true is set on POST /v1/chat/completions. The transport is OpenAI-compatible.

data (chunk)

Standard OpenAI streaming chunk. Each event delivers a choices[].delta partial token.
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}

data: [DONE]

Sent as the final event when the stream is complete.
data: [DONE]

Optimization stream

These events flow when an optimization session is running. Subscribe via the chat panel’s SSE channel during a /start_optimization tool call, or directly against the optimization stream endpoint.

activity_step

A discrete tool step in the agent’s activity timeline. Used for tools that don’t render their own card (file edits, intake updates, status checks).
{
  "id": "step_abc123",
  "tool": "update_intake",
  "summary": "Updated constraints: max latency 200ms",
  "toolInput": { "constraints": { "p99_latency_ms": 200 } },
  "toolOutput": { "intake": { "constraints": "..." } },
  "status": "completed"
}

tool_card

A rich tool result that renders as its own card (kernel agent, quantization variants, KV cache, speculation, judge, serving, hardware, compatibility). The card type is in tool and the full payload is in result.
{
  "tool": "kernel_agent",
  "result": {
    "kernels_applied": ["flash_attention_2", "rms_norm_fused"],
    "speedup_pct": 18.4
  }
}

EXPERIMENT

Emitted at the end of an optimization session, one event per model variant tested. Each event carries the full benchmark result for that variant. Frontend persists these to rebuild the optimization summary on reload.
{
  "model_id": "meta-llama/Llama-3.1-8B-Instruct",
  "variant": "fp8",
  "gpu": "L40S",
  "p50_latency_ms": 142,
  "p99_latency_ms": 198,
  "tokens_per_sec": 64.2,
  "cost_per_million": 0.22
}

infra_log

Forwarded log lines from the engine and provisioning layer (Modal, RunPod). Used to render the live terminal in the chat panel during long-running profiling waits.
{
  "level": "info",
  "source": "modal",
  "message": "Pulling image meta-llama/Llama-3.1-8B-Instruct..."
}

Heartbeats

The engine sends a comment-style heartbeat every 15 seconds while a session is idle (waiting on GPU profiling, optimization runs, or external systems):
: heartbeat 2026-05-18T03:45:00Z

The leading : is the SSE comment syntax. Clients should:
  • Not parse the heartbeat as an event
  • Reset their idle timer when one arrives
  • Detect a stuck connection if three consecutive heartbeats are missed (45 s without traffic)

Reconnection

If the connection drops mid-session, the client should:
  1. Re-issue the request to the same session/job id endpoint
  2. Pass Last-Event-ID if the server emitted ids
  3. The server replays events from that id forward; if it can’t, it replays from the beginning of the still-running step
In the RunInfra dashboard, the active-job recovery system polls running optimization jobs after page reload and reconnects automatically. You only need to handle reconnection in custom SDK code.

Error events

If something fails mid-stream, the server emits an error event and closes the connection:
event: error
data: {"code":"GPU_UNAVAILABLE","message":"L40S quota exhausted for region IAD","retryable":true}

retryable: true means the client can resubmit. retryable: false means the input is rejected (bad request, auth failure, plan limit).

Reference example

import requests, json

with requests.post(
    "https://api.runinfra.ai/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_RUNINFRA_API_KEY", "Accept": "text/event-stream"},
    json={"model": "your-pipeline-id", "messages": [...], "stream": True},
    stream=True,
) as r:
    for raw in r.iter_lines(decode_unicode=True):
        if not raw or raw.startswith(":"):  # heartbeat
            continue
        if raw.startswith("data: "):
            payload = raw[6:]
            if payload == "[DONE]":
                break
            chunk = json.loads(payload)
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)