RunInfra exposes both halves of the OpenAI audio API.Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Audio requests are charge-bearing. Send
X-Client-Request-Id for tracing and
Idempotency-Key for manual retry protection. If the first request settles,
a retry with the same key will not run or charge a second inference.POST /v1/audio/speech (TTS)
Generate audio from text. Binary response, the body is raw audio bytes in the format the upstream model produced.Parameters
TTS model id (e.g.
"kokoro", "xtts-v2"). Must be deployed in your workspace.Text to synthesize. Character-count drives billing.
Voice preset. Varies per model,
alloy, echo, fable, onyx, nova, shimmer for OpenAI-compat models; model-specific speaker ids for others.mp3 | wav | flac | opus. Upstream support varies.Playback speed multiplier, 0.25-4.0.
POST /v1/audio/transcriptions (ASR)
Speech-to-text. Multipart request,file is the audio blob, model + optional fields as form fields.
Parameters
Audio file. Supported formats per upstream, Whisper handles mp3/mp4/mpeg/mpga/m4a/wav/webm.
ASR model id (e.g.
"whisper-large-v3", "whisper-large-v3-turbo").BCP-47 code (e.g.
"en", "es"). If omitted, model auto-detects.Short text hint to bias decoding, useful for domain vocabulary.
json | text | srt | vtt | verbose_json.Sampling temperature for the ASR decoder.
Response
response_format=verbose_json:
Long-form audio
Files over a few minutes are chunked internally with overlap, transcribed in parallel batches on the GPU, then stitched back together with timestamp alignment. The OpenAI SDK contract is unchanged, you upload one file and get one transcript.| Audio length | Behavior |
|---|---|
| Under 30 s | Single forward pass |
| 30 s to 30 min | Chunked with 5 s overlap, stitched |
| 30 min to 2 hr | Chunked + streamed to disk during decode |
| Over 2 hr | Recommended to pre-split client-side, then concatenate transcripts |
Diarization and PII redaction
Pipelines built from the Transcription use case accept two extra body fields beyond the OpenAI contract:| Field | Type | Effect |
|---|---|---|
diarize | boolean | Adds speaker labels to each segment (Speaker 1, Speaker 2, …) |
redact | boolean | Replaces detected emails, phones, and names with [REDACTED_*] tokens |
extra_body on the OpenAI SDK or as form fields on raw HTTP:
Reference-audio TTS
Some TTS models (XTTS-v2, Bark) accept a short reference audio clip to clone the voice. Pass a base64-encoded audio sample via thereference_audio field on a pipeline deployed from the voice-agent use case:
Idempotency and retries
Audio requests are billed inference, so retries should be guarded with anIdempotency-Key header. A retry with the same key within 24 hours of a successful first call returns the cached response, no second inference runs.
Next steps
Voice agent use case
Streaming STT + LLM + TTS at sub-600ms turn-taking.
Transcription use case
Diarization, PII redaction, long-form export.
Models catalog
Whisper variants, XTTS, Kokoro, Bark.
Rate limits
Per-key budgets and the
Retry-After header.