RunInfra gives you full visibility into your deployed endpoints through three dedicated pages: the Observe dashboard for real-time metrics, the Usage & Credits page for cost breakdowns, and the Deployments page for endpoint management. Together they cover everything you need to understand performance, control spend, and respond to issues without leaving the RunInfra interface.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Observe dashboard
The Observe dashboard shows real-time metrics for all your endpoints in one place. You can filter by time period (7 days, 30 days, or 90 days) and drill into per-endpoint breakdowns.Requests
Total request count, success rate, and error rate. Spot reliability issues at a glance.
Latency
Average, P50, P95, and P99 latency. Use percentiles to understand tail latency in production.
Throughput
Requests per second. Monitor traffic spikes and verify your replica count is sufficient.
Tokens
Input and output token counts. Useful for capacity planning and cost attribution.
Cost
Per-request and aggregate cost. Track spend against your budget in real time.
Usage & Credits
The Usage & Credits page gives you a detailed cost breakdown over the last 30 days.- Daily cost chart. Visualize spend trends and identify unexpected spikes
- Token breakdown. Input vs. output token split across all endpoints
- Request breakdown. Successful requests vs. errors
- Cost by model. See which models contribute most to your bill
- Per-model table. Request count, total cost, and average latency for each model
Deployments page
The Deployments page shows every endpoint at a glance so you can manage the full lifecycle without using the agent. Each row displays:- Pipeline name and model
- GPU type and deployment mode (Flex or Active)
- Current status (active, stopped, or provisioning)
- Request count and month-to-date cost
- Quick actions to stop, start, or reconfigure the endpoint
Metrics retention
How far back you can query metrics depends on your plan:| Plan | Retention |
|---|---|
| Starter | 7 days |
| Pro | 90 days |
| Team | 1 year |
| Enterprise | Unlimited |
Exporting metrics
- CSV from dashboard
- Programmatic via SDK
- Observability tools
Observe and Usage pages have a Download button that exports the current filter selection to CSV.
Known limitations
- Metrics aggregate at 1-minute granularity. Per-request traces are available via the Logs tab for the last 24 hours (unlimited on Enterprise).
- Request/response body logging is opt-in and plan-gated (Team+). Default retention for bodies is 24 hours; metadata (status code, tokens, latency) follows the table above.
- Custom alerting (PagerDuty, Slack, email) is Enterprise. Self-hosted Prometheus / Grafana scraping is available via the observability API on Team+.
Common questions
What's the difference between P50, P95, and P99?
What's the difference between P50, P95, and P99?
P50 is the median latency (half of requests were faster). P95 is the 95th percentile (95% were faster). P99 shows tail latency (99% were faster). Optimize against P99 for user-facing chat, P50 for background processing.
Why are my cost numbers slightly higher than what the Deploy tab predicted?
Why are my cost numbers slightly higher than what the Deploy tab predicted?
Predicted cost uses your configured model + GPU + mode; actual cost adds the real input/output token counts from traffic, which depend on prompt length and response length. Long conversations and verbose system prompts drive the delta.
Can I see costs per API key or per user?
Can I see costs per API key or per user?
Per-API-key breakdowns are on Team+. Per-end-user attribution requires you to pass the
user field in your requests; RunInfra groups usage by that ID in the Observe dashboard.How quickly do metrics appear?
How quickly do metrics appear?
Requests and latency show within 10 to 30 seconds. Cost numbers refresh hourly to account for rate adjustments and overage calculations.
Does cold-start time show up in latency?
Does cold-start time show up in latency?
Yes, as part of the first request’s latency after a replica spin-up. The Observe dashboard has a separate Cold start p99 panel that isolates this for Flex deployments.
Can I alert on 5xx rate?
Can I alert on 5xx rate?
Enterprise plans get built-in PagerDuty and Slack integrations. All plans can set a Slack webhook manually via the dashboard’s Notifications settings.
Next steps
Deploy an endpoint
Flex scale-to-zero and Active always-on.
Re-optimize after traffic
Tighten constraints once you see real-world latency.
Rate limits
Per-key budgets and 429 behavior.
Troubleshoot
Diagnose 4xx, 5xx, and cold-start issues.