AI assistants - RunInfra

A chat assistant pipeline takes user messages, applies your system prompt and tool registry, and streams responses back. RunInfra ships the recipe with Llama, Hermes, or Qwen on a single GPU, with OpenAI-compatible chat completions for supported tool calling, structured output, and streaming fields.

Architecture

User -> Chat endpoint (POST /v1/chat/completions, stream=true)
     -> System prompt + tool registry
     -> LLM (Llama 3.1 8B / Hermes 3 8B / Qwen 2.5 7B, FP8 + compiled)
     -> Tool execution loop (if tools called)
     -> Streamed response

The pipeline exposes OpenAI-compatible chat completions for supported fields. Configure the base URL as https://api.runinfra.ai/v1, swap in your RunInfra key, and pass a deployed model ID served by RunInfra.

What you get out of the box

OpenAI-compatible chat fields: tools, streaming, structured output, temperature, and top-p when the selected backend supports them
Tool calling with OpenAI-shaped tool definitions on compatible deployments
Structured output with JSON response formats when the selected backend supports them
Per-key rate limits (Core) and audit logs (Enterprise)
Sub-200ms P99 on Llama 3.1 8B FP8 with 32-token outputs on an L40S

Example prompt

In the dashboard:

Build a chat assistant for our customer success team. Use Llama 3.1 8B with FP8.
I want streaming, tool calling (for our internal CRM lookup function),
and a system prompt that keeps responses under 3 sentences.

Tool calling pattern

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

resp = client.chat.completions.create(
    model="your-model-id",
    messages=[{"role": "user", "content": "What's the order status for jaber@runinfra.ai?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "lookup_order",
            "parameters": {"type": "object", "properties": {"email": {"type": "string"}}},
        }
    }],
    stream=True,
)

Deeper details

For the marketing page with the full model matrix and use case examples, see runinfra.ai/use-cases/ai-assistant.

Voice agents

Embeddings and reranking

⌘I

​Architecture

​What you get out of the box

​Example prompt

​Tool calling pattern

​Deeper details

Architecture

What you get out of the box

Example prompt

Tool calling pattern

Deeper details