SSE events - RunInfra

RunInfra streams long-running operations (optimization sessions, chat completions, runbook execution) over Server-Sent Events (SSE). This page documents every event type the engine emits and how clients should handle them.

Transport

All SSE responses follow the standard wire format:

event: <event-name>
data: <json-payload>

Required headers on the response:

Header	Value
`Content-Type`	`text/event-stream; charset=utf-8`
`Cache-Control`	`no-cache, no-transform`
`Connection`	`keep-alive`
`X-Accel-Buffering`	`no` (when behind a proxy)

The engine sends a heartbeat every 15 seconds to keep idle connections alive during long profiling and optimization waits. See Heartbeats below.

Chat completions stream

These events flow when stream: true is set on POST /v1/chat/completions. The transport is OpenAI-compatible.

`data` (chunk)

Standard OpenAI streaming chunk. Each event delivers a choices[].delta partial token.

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}

`data: [DONE]`

Sent as the final event when the stream is complete.

data: [DONE]

Optimization stream

These events flow when an optimization session is running. Subscribe via the chat panel’s SSE channel during a /start_optimization tool call, or directly against the optimization stream endpoint.

`activity_step`

A discrete tool step in the agent’s activity timeline. Used for tools that don’t render their own card (file edits, intake updates, status checks).

{
  "id": "step_abc123",
  "tool": "update_intake",
  "summary": "Updated constraints: max latency 200ms",
  "toolInput": { "constraints": { "p99_latency_ms": 200 } },
  "toolOutput": { "intake": { "constraints": "..." } },
  "status": "completed"
}

`tool_card`

A rich tool result that renders as its own card (kernel agent, quantization variants, KV cache, speculation, judge, serving, hardware, compatibility). The card type is in tool and the full payload is in result.

{
  "tool": "kernel_agent",
  "result": {
    "kernels_applied": ["flash_attention_2", "rms_norm_fused"],
    "speedup_pct": 18.4
  }
}

`EXPERIMENT`

Emitted at the end of an optimization session, one event per model variant tested. Each event carries the full benchmark result for that variant. Frontend persists these to rebuild the optimization summary on reload.

{
  "model_id": "meta-llama/Llama-3.1-8B-Instruct",
  "variant": "fp8",
  "gpu": "L40S",
  "p50_latency_ms": 142,
  "p99_latency_ms": 198,
  "tokens_per_sec": 64.2,
  "cost_per_million": 0.22
}

`optimization_result`

Canonical per-engine verdict emitted when an optimization lane completes. When the verdict is produced by an accepted runbook execution, the payload includes executionId; clients should use it to scope Optimization-tab verdicts, hydrated snapshots, Test-tab targets, and deploy/export artifact provenance to the same run.

{
  "engine": "quant",
  "modelId": "meta-llama/Llama-3.1-8B-Instruct",
  "outcome": "win",
  "planPhaseId": "phase-quant",
  "executionId": "runbook-plan-123-1770000000000",
  "verdict": "FP8 dynamic passed.",
  "measuredAll": true
}

`optimization_verified`

Measured post-apply verification result emitted after an optimization is applied and a short GPU verification run completes. When scoped to an accepted runbook execution, the payload includes executionId; clients should only patch a matching Optimization-tab verdict when the verification executionId matches the verdict/run currently in view.

{
  "engine": "quant",
  "modelId": "meta-llama/Llama-3.1-8B-Instruct",
  "verifyId": "vf-mabc123",
  "executionId": "runbook-plan-123-1770000000000",
  "measured": true,
  "baselineMs": 240,
  "optimizedMs": 110,
  "speedup": 2.18,
  "coldLoad": false,
  "verdict": "FP8 verified on GPU."
}

`version_created`

Emitted when the engine persists a new optimization version. When the version is produced by an accepted runbook execution, the payload includes the same executionId stored on the optimization version row. Clients should use that id when hydrating saved version metrics, regenerating Code-tab artifacts, and selecting deploy/export provenance.

{
  "id": "7f6c4b8a-829f-47d9-8c9a-1c5a77f5a921",
  "version": 3,
  "priority": "balanced",
  "isActive": true,
  "executionId": "runbook-plan-123-1770000000000"
}

`infra_log`

Forwarded log lines from the engine and provisioning layer (Modal, RunPod). Used to render the live terminal in the chat panel during long-running profiling waits.

{
  "level": "info",
  "source": "modal",
  "message": "Pulling image meta-llama/Llama-3.1-8B-Instruct..."
}

Heartbeats

The engine sends a comment-style heartbeat every 15 seconds while a session is idle (waiting on GPU profiling, optimization runs, or external systems):

: heartbeat 2026-05-18T03:45:00Z

The leading : is the SSE comment syntax. Clients should:

Not parse the heartbeat as an event
Reset their idle timer when one arrives
Detect a stuck connection if three consecutive heartbeats are missed (45 s without traffic)

Reconnection

If the connection drops mid-session, the client should:

Re-issue the request to the same session/job id endpoint
Pass Last-Event-ID if the server emitted ids
The server replays events from that id forward; if it can’t, it replays from the beginning of the still-running step

In the RunInfra dashboard, reloading mid-execution re-attaches automatically. A read-only runbook execution stream resumes phases, status, events, and persisted cost telemetry within about a second of the page loading, replaying from the last event sequence the page already has, and a background snapshot poll covers the case where the stream cannot connect. You only need to handle reconnection in custom SDK code.

Error events

If something fails mid-stream, the server emits an error event and closes the connection:

event: error
data: {"code":"GPU_UNAVAILABLE","message":"L40S quota exhausted for region IAD","retryable":true}

retryable: true means the client can resubmit. retryable: false means the input is rejected (bad request, auth failure, plan limit).

Reference example

import requests, json

with requests.post(
    "https://api.runinfra.ai/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_RUNINFRA_API_KEY", "Accept": "text/event-stream"},
    json={"model": "your-pipeline-id", "messages": [...], "stream": True},
    stream=True,
) as r:
    for raw in r.iter_lines(decode_unicode=True):
        if not raw or raw.startswith(":"):  # heartbeat
            continue
        if raw.startswith("data: "):
            payload = raw[6:]
            if payload == "[DONE]":
                break
            chunk = json.loads(payload)
            delta = chunk["choices"][0]["delta"].get("content", "")
            print(delta, end="", flush=True)

​Transport

​Chat completions stream

​data (chunk)

​data: [DONE]

​Optimization stream

​activity_step

​tool_card

​EXPERIMENT

​optimization_result

​optimization_verified

​version_created

​infra_log

​Heartbeats

​Reconnection

​Error events

​Reference example