Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

A chat assistant pipeline takes user messages, applies your system prompt and tool registry, and streams responses back. RunInfra ships the recipe with Llama, Hermes, or Qwen on a single GPU, FP8 quantized, with full OpenAI-compatible chat completions including tool calling, structured output, and streaming.

Architecture

User -> Chat endpoint (POST /v1/chat/completions, stream=true)
     -> System prompt + tool registry
     -> LLM (Llama 3.1 8B / Hermes 3 8B / Qwen 2.5 7B, FP8 + compiled)
     -> Tool execution loop (if tools called)
     -> Streamed response
The pipeline is OpenAI-compatible by design. Drop in your existing OpenAI SDK client, change the base URL to https://api.runinfra.ai/v1, swap your key, and your existing chat app works against an open model on your infrastructure.

What you get out of the box

  • Full OpenAI compat: tools, streaming, structured output, logprobs, n > 1, temperature, top-p
  • Tool calling with parallel tool calls and forced-call modes
  • Streaming JSON for structured output with a JSON schema you supply
  • Per-key rate limits and audit logs (Team+)
  • Sub-200ms P99 on Llama 3.1 8B FP8 with 32-token outputs on an L40S

Example prompt

In Pipes:
Build a chat assistant for our customer success team. Use Llama 3.1 8B with FP8.
I want streaming, tool calling (for our internal CRM lookup function),
and a system prompt that keeps responses under 3 sentences.

Tool calling pattern

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runinfra.ai/v1",
    api_key="YOUR_RUNINFRA_API_KEY",
)

resp = client.chat.completions.create(
    model="your-pipeline-id",
    messages=[{"role": "user", "content": "What's the order status for jaber@runinfra.ai?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "lookup_order",
            "parameters": {"type": "object", "properties": {"email": {"type": "string"}}},
        }
    }],
    stream=True,
)

Deeper details

For the marketing page with the full model matrix and use case examples, see runinfra.ai/use-cases/ai-assistant.