Can I swap Llama 3.2 for a bigger model?

Yes. Any compatible Hugging Face LLM that runs in the selected serving engine can be used. Llama 3.2 3B is the starter model for the sub-600ms L4 target. For higher quality you can move to Mistral, Qwen, Phi, Gemma, or another Hugging Face model ID and let the agent retune the stack.

What about voice cloning and custom voices?

Chatterbox supports zero-shot voice cloning from a 5 to 10 second sample. Bring your own voices, no per-voice license. If you need a specific TTS engine for licensed voices, you can swap Chatterbox for any compatible open TTS model that exposes a streaming path.

How does this compare to OpenAI Realtime or ElevenLabs Conversational AI?

OpenAI Realtime and ElevenLabs Conversational AI are closed black boxes priced per minute. You cannot tune the model, the voice, or the routing. The pipeline shown here gives you a voice stack with every component swappable. RunInfra can run it on managed GPUs, or export the full stack so you can run it yourself.

Does scale-to-zero hurt cold-start latency?

First request after a cold start has a 4 to 8 second warm-up because the model weights have to load to VRAM. After that, every subsequent call is sub-600ms. For 24/7 phone-line workloads, set a minimum of one warm replica so cold starts never happen during business hours.

RunInfra is now public.See what's new

RunInfraby RightNow

Voice agent

Voice agents at sub-600ms on a single GPU.

Speech, chat, and TTS open models on one L4. Swap in any compatible Hugging Face model ID.

End-to-end p50: 540ms
Throughput lift: 3.4x
Starter GPU: 1 L4
Accuracy budget: 0%

Starter values. Retunes per HF model and GPU.

Deploy this pipeline Read the stack

What you actually own

The optimization knobs, the codebase, and the model choice. None of them locked away.

Tune every optimization on your terms.

Kernels, quantization, KV cache, speculation, serving config. The agent runs the full sweep and surfaces every trade-off.

You own the full codebase, end to end.

Export the Dockerfile, serve script, and config. Run on managed RunInfra or your own GPUs. No proprietary runtime.

Any Hugging Face model fits.

Paste any compatible HF model ID. The agent retunes the kernels, quantization, and serving config around it.

What RunInfra tunes for every model

Six engines, every run, retuned per model and GPU.

Kernel sweep

FlashAttention, FlashInfer, Marlin, custom fusion. Correctness + speedup gates on every rewrite.

Quantization

FP8, AWQ, GPTQ, HQQ, INT8 SmoothQuant, NVFP4. Quality scored against your accuracy floor.

KV cache

FP8, INT4, and TurboQuant 3-4 bit KV compression. 60 to 75% VRAM savings.

Speculative decoding

EAGLE3, MTP, n-gram lookup, draft model. 1.3 to 2x decode speedup, weights untouched.

Serving config

Continuous batching, chunked prefill, PagedAttention, prefix cache. Tuned across vLLM, SGLang, TensorRT-LLM.

Multi-cloud capacity

Pareto GPU selection across L4 to B200. Managed or exportable.

Any HF voice model works

Every voice-compatible model on Hugging Face runs through the same recipe. Search the live catalog above. The examples below are just a starting view.

HF

Whisper Large V3

OpenAI

1.5BSTT

Whisper Large V3 Turbo

OpenAI

809MSTT

Distil-Whisper Large V3

Distil-Whisper

756MSTT

Qwen3 TTS

Alibaba

0.6BTTS

XTTS v2

Coqui

467MTTS

Orpheus TTS

Canopy Labs

3BTTS

Qwen2-Audio 7B

Alibaba

7BAudio

Llama 3.2 3B

Llama 3.2 1B

Three ways to ship voice

Most teams choose between speed and control. RunInfra keeps both in one workflow.

Voice deployment comparison for voice-agent across RunInfra, closed APIs, and DIY self-hosting.
What matters	RunInfraFast path with model control and export.	Closed APIsFast start, locked runtime.	DIY self-hostingFull control, heavy operations.
01Launch	Pick model, optimize, deploy Start quickly and keep the production path open.	Call provider endpoint Fast first demo, but the runtime stays rented.	Build serving stack first Infrastructure work comes before product learning.
02Model control	Bring the model ID Keep model choice and serving decisions visible.	Provider catalog You use what the provider exposes.	Your model Full control if your team maintains the runtime.
03Tuning	Measured latency and GPU cost Compare serving choices before deployment.	Opaque Latency and batching stay behind the API.	Manual profiling Your team owns tuning and regressions.
04Export	Managed now, export when needed Use the endpoint first and take the deploy package later.	Locked endpoint You keep calling the provider.	Already owned Export exists because you built everything yourself.
05Operations	Low until you choose to own it Operate managed, then export with the same measured plan.	Low, with lock-in Less infra work, less production control.	High You own infra, failures, upgrades, and serving changes.

RunInfra

Fast path with model control and export.

Launch

Pick model, optimize, deploy

Start quickly and keep the production path open.

Model control

Bring the model ID

Keep model choice and serving decisions visible.

Tuning

Measured latency and GPU cost

Compare serving choices before deployment.

Export

Managed now, export when needed

Use the endpoint first and take the deploy package later.

Operations

Low until you choose to own it

Operate managed, then export with the same measured plan.

Try this pipeline

Edit the model, engine, or GPU inline. Send to retune the stack in the dashboard.

Questions founders usually ask

Need a custom configuration? Talk to our team.

Does this really fit on a single L4?

Yes. Whisper Large V3 in streaming mode uses about 3 GB VRAM. Llama 3.2 3B Instruct in AWQ-int4 uses about 2.5 GB. Chatterbox vocoder is about 2 GB. The L4's 24 GB has plenty of headroom for KV cache and concurrent calls. For higher concurrency move to L40S (48 GB).