Audio - RunInfra

RunInfra exposes both halves of the OpenAI audio API.

Audio requests are charge-bearing. Send X-Client-Request-Id for tracing. Audio requests are sent once. Do not rely on Idempotency-Key to deduplicate audio requests.

POST /v1/audio/speech (TTS)

Generate audio from text. Binary response, the body is raw audio bytes in the format the upstream model produced.

from openai import OpenAI
client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="alloy",
    input="Welcome to RunInfra.",
) as resp:
    resp.stream_to_file("hello.mp3")

import fs from "fs";
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.runinfra.ai/v1", apiKey: "YOUR_RUNINFRA_API_KEY" });

const resp = await client.audio.speech.create({
  model: "kokoro",
  voice: "alloy",
  input: "Welcome to RunInfra.",
});
const buf = Buffer.from(await resp.arrayBuffer());
fs.writeFileSync("hello.mp3", buf);

curl https://api.runinfra.ai/v1/audio/speech \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "X-Client-Request-Id: $(uuidgen)" \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","voice":"alloy","input":"Welcome to RunInfra."}' \
  --output hello.mp3

Parameters

model

string

required

TTS model id (e.g. "kokoro", "xtts-v2"). Must be deployed in your workspace.

input

string

required

Text to synthesize. Character-count drives billing.

voice

string

Voice preset. Varies per model, alloy, echo, fable, onyx, nova, shimmer for OpenAI-compat models; model-specific speaker ids for others.

response_format

string

default:"mp3"

mp3 | wav | flac | opus. Upstream support varies.

speed

number

default:"1.0"

Playback speed multiplier, 0.25-4.0.

POST /v1/audio/transcriptions (ASR)

Speech-to-text. Multipart request, file is the audio blob, model + optional fields as form fields.

with open("meeting.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        language="en",
    )
print(resp.text)

import fs from "fs";
const resp = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("meeting.mp3"),
  language: "en",
});
console.log(resp.text);

curl https://api.runinfra.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "X-Client-Request-Id: $(uuidgen)" \
  -F "model=whisper-large-v3" \
  -F "file=@meeting.mp3" \
  -F "language=en"

Parameters

file

binary

required

Non-empty audio file. Supported formats per upstream, Whisper handles mp3/mp4/mpeg/mpga/m4a/wav/webm. Native SDK clients reject empty ASR files before building the multipart upload.

model

string

required

ASR model id (e.g. "whisper-large-v3", "whisper-large-v3-turbo").

language

string

BCP-47 code (e.g. "en", "es"). If omitted, model auto-detects.

prompt

string

Short text hint to bias decoding, useful for domain vocabulary.

response_format

string

default:"json"

json | text | srt | vtt | verbose_json.

temperature

number

default:"0"

Sampling temperature for the ASR decoder.

Response

{ "text": "Welcome to the meeting. Today we'll cover..." }

With response_format=verbose_json:

{
  "task": "transcribe",
  "language": "english",
  "duration": 47.3,
  "segments": [
    { "id": 0, "start": 0.0, "end": 3.2, "text": "Welcome to the meeting." },
    ...
  ],
  "text": "..."
}

Long-form audio

Files over a few minutes are chunked internally with overlap, transcribed in parallel batches on the GPU, then stitched back together with timestamp alignment. The endpoint keeps the OpenAI-style upload shape: upload one file and get one transcript.

Audio length	Behavior
Under 30 s	Single forward pass
30 s to 30 min	Chunked with 5 s overlap, stitched
30 min to 2 hr	Chunked + streamed to disk during decode
Over 2 hr	Recommended to pre-split client-side, then concatenate transcripts

Diarization and PII redaction

Pipelines built from the Transcription use case accept two extra body fields beyond the OpenAI contract:

Field	Type	Effect
`diarize`	`boolean`	Adds `speaker` labels to each segment (Speaker 1, Speaker 2, …)
`redact`	`boolean`	Replaces detected emails, phones, and names with `[REDACTED_*]` tokens

Pass them via extra_body on the OpenAI SDK or as form fields on raw HTTP:

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("call.mp3", "rb"),
    response_format="verbose_json",
    extra_body={"diarize": True, "redact": True},
)

Reference-audio TTS

Some TTS models (XTTS-v2, Bark) accept a short reference audio clip to clone the voice. Pass a base64-encoded audio sample via the reference_audio field on a pipeline deployed from the voice-agent use case:

import base64

with open("speaker_sample.wav", "rb") as f:
    sample_b64 = base64.b64encode(f.read()).decode()

audio = client.audio.speech.create(
    model="xtts-v2",
    voice="cloned",
    input="Hello from a voice that did not exist five seconds ago.",
    extra_body={"reference_audio": sample_b64},
)

The reference clip should be 3 to 10 seconds of clean speech in the target voice. Longer clips are truncated; noisy clips degrade the output.

Retries

Audio requests are billed inference and are not replay-cached by the gateway. Native SDK clients send TTS and ASR requests once even when you provide an idempotency key. Do not rely on Idempotency-Key to deduplicate audio requests. Use X-Client-Request-Id for tracing, and make manual retries only when your application can tolerate a second inference.

curl https://api.runinfra.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "X-Client-Request-Id: $(uuidgen)" \
  -F "model=whisper-large-v3" \
  -F "file=@call.mp3"

Next steps

Voice agent use case

Streaming STT + LLM + TTS at sub-600ms turn-taking.

Transcription use case

Diarization, PII redaction, long-form export.

Models catalog

Whisper variants, XTTS, Kokoro, Bark.

Rate limits

Per-key budgets and the Retry-After header.

​POST /v1/audio/speech (TTS)

​Parameters

​POST /v1/audio/transcriptions (ASR)

​Parameters

​Response

​Long-form audio

​Diarization and PII redaction

​Reference-audio TTS

​Retries

​Next steps

Voice agent use case

Transcription use case

Models catalog

Rate limits

POST /v1/audio/speech (TTS)

Parameters

POST /v1/audio/transcriptions (ASR)

Parameters

Response

Long-form audio

Diarization and PII redaction

Reference-audio TTS

Retries

Next steps