Voice agents - RunInfra

A voice agent pipeline takes a microphone stream, transcribes it on the fly, runs the LLM, and streams synthesized speech back, all on a single GPU. RunInfra ships the recipe with Whisper for ASR, Llama or Qwen for the LLM, and an open TTS model (XTTS, Bark, or Kokoro depending on language and voice budget).

Architecture

Mic stream  -> Whisper (streaming ASR, real-time factor under 1)
            -> LLM (Llama 3.1 8B Instruct or Qwen 2.5 7B Instruct, FP8)
            -> TTS (XTTS-v2 streaming, chunked synthesis)
            -> Speaker

All three models live on one L40S with FP8 quantization on the LLM and compiled kernels on the TTS vocoder. The pipeline is throughput-balanced: ASR streams partial transcripts, the LLM starts decoding before ASR finishes, and TTS streams audio while the LLM is still emitting tokens.

What you get out of the box

Sub-600ms turn-taking measured first-byte-to-first-audio on L40S
Streaming ASR with VAD, partial transcripts, and barge-in detection
Token streaming from the LLM (OpenAI-compatible chat completions API)
Chunked TTS synthesis so audio starts within ~150ms of the first LLM token
OpenAI-compatible HTTP endpoint plus a WebSocket transport for real-time clients

Example prompt

In the dashboard:

Build me a voice agent pipeline. I want to handle customer support calls in English.
The agent should be helpful and concise. Optimize for latency on a single GPU.

What the optimization run does

Profiles Whisper variants (large-v3, distil-large-v3, turbo) on the chosen GPU
Searches Llama and Qwen quantizations (AWQ, GPTQ, FP8) for the best speed at the configured quality threshold
Picks the TTS model based on the languages you specify and the latency budget
Tunes batch size, KV cache, and speculative decoding for the LLM
Produces an editable receipt with the full stack and projected per-minute cost

Deeper details

For the full marketing page with benchmark tables, the model list, and the latest pricing per minute of audio, see runinfra.ai/use-cases/voice-agent.

Use cases

AI assistants

⌘I

​Architecture

​What you get out of the box

​Example prompt

​What the optimization run does

​Deeper details

Architecture

What you get out of the box

Example prompt

What the optimization run does

Deeper details