Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra gives you full visibility into your deployed endpoints through three dedicated pages: the Observe dashboard for real-time metrics, the Usage & Credits page for cost breakdowns, and the Deployments page for endpoint management. Together they cover everything you need to understand performance, control spend, and respond to issues without leaving the RunInfra interface.

Observe dashboard

The Observe dashboard shows real-time metrics for all your endpoints in one place. You can filter by time period (7 days, 30 days, or 90 days) and drill into per-endpoint breakdowns.

Requests

Total request count, success rate, and error rate. Spot reliability issues at a glance.

Latency

Average, P50, P95, and P99 latency. Use percentiles to understand tail latency in production.

Throughput

Requests per second. Monitor traffic spikes and verify your replica count is sufficient.

Tokens

Input and output token counts. Useful for capacity planning and cost attribution.

Cost

Per-request and aggregate cost. Track spend against your budget in real time.

Usage & Credits

The Usage & Credits page gives you a detailed cost breakdown over the last 30 days.
  • Daily cost chart. Visualize spend trends and identify unexpected spikes
  • Token breakdown. Input vs. output token split across all endpoints
  • Request breakdown. Successful requests vs. errors
  • Cost by model. See which models contribute most to your bill
  • Per-model table. Request count, total cost, and average latency for each model
Use this page to understand where your budget is going and which pipelines are worth optimizing further.

Deployments page

The Deployments page shows every endpoint at a glance so you can manage the full lifecycle without using the agent. Each row displays:
  • Pipeline name and model
  • GPU type and deployment mode (Flex or Active)
  • Current status (active, stopped, or provisioning)
  • Request count and month-to-date cost
  • Quick actions to stop, start, or reconfigure the endpoint

Metrics retention

How far back you can query metrics depends on your plan:
PlanRetention
Starter7 days
Pro90 days
Team1 year
EnterpriseUnlimited
Metrics older than your plan’s retention window are permanently deleted and cannot be recovered by upgrading later. Export data before downgrading if you need to preserve historical records.

Exporting metrics

Observe and Usage pages have a Download button that exports the current filter selection to CSV.

Known limitations

  • Metrics aggregate at 1-minute granularity. Per-request traces are available via the Logs tab for the last 24 hours (unlimited on Enterprise).
  • Request/response body logging is opt-in and plan-gated (Team+). Default retention for bodies is 24 hours; metadata (status code, tokens, latency) follows the table above.
  • Custom alerting (PagerDuty, Slack, email) is Enterprise. Self-hosted Prometheus / Grafana scraping is available via the observability API on Team+.

Common questions

P50 is the median latency (half of requests were faster). P95 is the 95th percentile (95% were faster). P99 shows tail latency (99% were faster). Optimize against P99 for user-facing chat, P50 for background processing.
Predicted cost uses your configured model + GPU + mode; actual cost adds the real input/output token counts from traffic, which depend on prompt length and response length. Long conversations and verbose system prompts drive the delta.
Per-API-key breakdowns are on Team+. Per-end-user attribution requires you to pass the user field in your requests; RunInfra groups usage by that ID in the Observe dashboard.
Requests and latency show within 10 to 30 seconds. Cost numbers refresh hourly to account for rate adjustments and overage calculations.
Yes, as part of the first request’s latency after a replica spin-up. The Observe dashboard has a separate Cold start p99 panel that isolates this for Flex deployments.
Enterprise plans get built-in PagerDuty and Slack integrations. All plans can set a Slack webhook manually via the dashboard’s Notifications settings.

Next steps

Deploy an endpoint

Flex scale-to-zero and Active always-on.

Re-optimize after traffic

Tighten constraints once you see real-world latency.

Rate limits

Per-key budgets and 429 behavior.

Troubleshoot

Diagnose 4xx, 5xx, and cold-start issues.