Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
A chat assistant pipeline takes user messages, applies your system prompt and tool registry, and streams responses back. RunInfra ships the recipe with Llama, Hermes, or Qwen on a single GPU, FP8 quantized, with full OpenAI-compatible chat completions including tool calling, structured output, and streaming.
Architecture
User -> Chat endpoint (POST /v1/chat/completions, stream=true)
-> System prompt + tool registry
-> LLM (Llama 3.1 8B / Hermes 3 8B / Qwen 2.5 7B, FP8 + compiled)
-> Tool execution loop (if tools called)
-> Streamed response
The pipeline is OpenAI-compatible by design. Drop in your existing OpenAI SDK client, change the base URL to https://api.runinfra.ai/v1, swap your key, and your existing chat app works against an open model on your infrastructure.
What you get out of the box
- Full OpenAI compat: tools, streaming, structured output, logprobs,
n > 1, temperature, top-p
- Tool calling with parallel tool calls and forced-call modes
- Streaming JSON for structured output with a JSON schema you supply
- Per-key rate limits and audit logs (Team+)
- Sub-200ms P99 on Llama 3.1 8B FP8 with 32-token outputs on an L40S
Example prompt
In Pipes:
Build a chat assistant for our customer success team. Use Llama 3.1 8B with FP8.
I want streaming, tool calling (for our internal CRM lookup function),
and a system prompt that keeps responses under 3 sentences.
from openai import OpenAI
client = OpenAI(
base_url="https://api.runinfra.ai/v1",
api_key="YOUR_RUNINFRA_API_KEY",
)
resp = client.chat.completions.create(
model="your-pipeline-id",
messages=[{"role": "user", "content": "What's the order status for jaber@runinfra.ai?"}],
tools=[{
"type": "function",
"function": {
"name": "lookup_order",
"parameters": {"type": "object", "properties": {"email": {"type": "string"}}},
}
}],
stream=True,
)
Deeper details
For the marketing page with the full model matrix and use case examples, see runinfra.ai/use-cases/ai-assistant.