Monitoring - RunInfra

RunInfra gives you full visibility into your deployed endpoints through two dedicated pages: the Deployments page for real-time metrics and endpoint management, and the Cost page for cost breakdowns. Together they cover everything you need to understand performance, control spend, and respond to issues without leaving the RunInfra interface.

Metrics overview

The Deployments page opens with an overview of real-time metrics for all your endpoints in one place. You can filter by time period (7 days, 30 days, or 90 days) and drill into per-endpoint breakdowns.

Requests

Total request count, success rate, and error rate. Spot reliability issues at a glance.

Latency

Average, P50, P95, and P99 latency. Use percentiles to understand tail latency in production.

Throughput

Requests per second. Monitor traffic spikes and verify your replica count is sufficient.

Tokens

Input and output token counts. Useful for capacity planning and cost attribution.

Cost

Per-request and aggregate cost. Track spend against your budget in real time.

Cost

The Cost page gives you a detailed cost breakdown over the last 30 days.

Daily cost chart. Visualize spend trends and identify unexpected spikes
Request volume. Track how much traffic drove the spend
Average request cost. See blended cost per request across the selected period
Cost by model. See which models contribute most to your bill
Credit activity. Review top-ups, holds, charges, settlements, and refunds

Use this page to understand where your budget is going and which pipelines are worth optimizing further.

Deployments page

The Deployments page shows every endpoint at a glance so you can manage the full lifecycle without using the agent. Each row displays:

Pipeline name and model
GPU type and deployment mode (Flex or Active)
Current status (active, stopped, or provisioning)
Request count and month-to-date cost
Quick actions to stop, start, or reconfigure the endpoint

Metrics retention

How far back you can query metrics depends on your plan:

Plan	Retention
Core	1 year
Enterprise	Unlimited

Metrics older than your plan’s retention window are permanently deleted and cannot be recovered by upgrading later. Export data before downgrading if you need to preserve historical records.

Exporting metrics

CSV from dashboard
Workspace exports
Observability tools

The Deployments overview and Cost pages have a Download button that exports the current filter selection to CSV.

RunInfra traffic is OpenAI-compatible, so Arize, Langfuse, Helicone, and Datadog pick up RunInfra calls transparently when configured against your existing OpenAI trace pipeline. Point their base_url hook at the RunInfra URL and the entire call chain (prompt, response, latency, token counts) streams into your existing dashboards.

Known limitations

Metrics aggregate at 1-minute granularity. Per-request trace logs are not currently exposed in the dashboard.
Request/response body logging is opt-in and plan-gated (Core). Default retention for bodies is 24 hours; metadata (status code, tokens, latency) follows the table above.
Custom alerting (PagerDuty, Slack, email) is Enterprise. For Prometheus or Grafana workflows, contact RunInfra support for the supported integration path on Core and Enterprise plans.

Common questions

What's the difference between P50, P95, and P99?

P50 is the median latency (half of requests were faster). P95 is the 95th percentile (95% were faster). P99 shows tail latency (99% were faster). Optimize against P99 for user-facing chat, P50 for background processing.

Why are my cost numbers slightly higher than what the Deploy tab predicted?

Predicted cost uses your configured model + GPU + mode; actual cost adds the real input/output token counts from traffic, which depend on prompt length and response length. Long conversations and verbose system prompts drive the delta.

Can I see costs per API key or per user?

Per-API-key breakdowns are on Core. Per-end-user attribution requires you to pass the user field in your requests; RunInfra groups usage by that ID in the deployments metrics overview.

How quickly do metrics appear?

Requests and latency show within 10 to 30 seconds. Cost numbers refresh hourly to account for rate adjustments and overage calculations.

Does cold-start time show up in latency?

Yes, as part of the first request’s latency after a replica spin-up. The metrics overview has a separate Cold start p99 panel that isolates this for Flex deployments.

Can I alert on 5xx rate?

Enterprise plans get built-in PagerDuty and Slack integrations. All plans can set a Slack webhook manually via the dashboard’s Notifications settings.

Next steps

Deploy an endpoint

Flex scale-to-zero and Active always-on.

Re-optimize after traffic

Tighten constraints once you see real-world latency.

Rate limits

Per-key budgets and 429 behavior.

Troubleshoot

Diagnose 4xx, 5xx, and cold-start issues.

​Metrics overview

Requests

Latency

Throughput

Tokens

Cost

​Cost

​Deployments page

​Metrics retention

​Exporting metrics

​Known limitations

​Common questions

​Next steps

Deploy an endpoint

Re-optimize after traffic

Rate limits

Troubleshoot

Metrics overview

Cost

Deployments page

Metrics retention

Exporting metrics

Known limitations

Common questions

Next steps