Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

A voice agent pipeline takes a microphone stream, transcribes it on the fly, runs the LLM, and streams synthesized speech back, all on a single GPU. RunInfra ships the recipe with Whisper for ASR, Llama or Qwen for the LLM, and an open TTS model (XTTS, Bark, or Kokoro depending on language and voice budget).

Architecture

Mic stream  -> Whisper (streaming ASR, real-time factor < 1)
            -> LLM (Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct, FP8)
            -> TTS (XTTS-v2 streaming, chunked synthesis)
            -> Speaker
All three models live on one L40S with FP8 quantization on the LLM and compiled kernels on the TTS vocoder. The pipeline is throughput-balanced: ASR streams partial transcripts, the LLM starts decoding before ASR finishes, and TTS streams audio while the LLM is still emitting tokens.

What you get out of the box

  • Sub-600ms turn-taking measured first-byte-to-first-audio on L40S
  • Streaming ASR with VAD, partial transcripts, and barge-in detection
  • Token streaming from the LLM (OpenAI-compatible chat completions API)
  • Chunked TTS synthesis so audio starts within ~150ms of the first LLM token
  • OpenAI-compatible HTTP endpoint plus a WebSocket transport for real-time clients

Example prompt

In Pipes:
Build me a voice agent pipeline. I want to handle customer support calls in English.
The agent should be helpful and concise. Optimize for latency on a single GPU.

What the optimization run does

  1. Profiles Whisper variants (large-v3, distil-large-v3, turbo) on the chosen GPU
  2. Searches Llama and Qwen quantizations (AWQ, GPTQ, FP8) for the best speed at the configured quality threshold
  3. Picks the TTS model based on the languages you specify and the latency budget
  4. Tunes batch size, KV cache, and speculative decoding for the LLM
  5. Produces an editable receipt with the full stack and projected per-minute cost

Deeper details

For the full marketing page with benchmark tables, the model list, and the latest pricing per minute of audio, see runinfra.ai/use-cases/voice-agent.