Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra exposes both halves of the OpenAI audio API.
Audio requests are charge-bearing. Send X-Client-Request-Id for tracing and Idempotency-Key for manual retry protection. If the first request settles, a retry with the same key will not run or charge a second inference.

POST /v1/audio/speech (TTS)

Generate audio from text. Binary response, the body is raw audio bytes in the format the upstream model produced.
from openai import OpenAI
client = OpenAI(base_url="https://api.runinfra.ai/v1", api_key="YOUR_RUNINFRA_API_KEY")

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="alloy",
    input="Welcome to RunInfra.",
) as resp:
    resp.stream_to_file("hello.mp3")

Parameters

model
string
required
TTS model id (e.g. "kokoro", "xtts-v2"). Must be deployed in your workspace.
input
string
required
Text to synthesize. Character-count drives billing.
voice
string
Voice preset. Varies per model, alloy, echo, fable, onyx, nova, shimmer for OpenAI-compat models; model-specific speaker ids for others.
response_format
string
default:"mp3"
mp3 | wav | flac | opus. Upstream support varies.
speed
number
default:"1.0"
Playback speed multiplier, 0.25-4.0.

POST /v1/audio/transcriptions (ASR)

Speech-to-text. Multipart request, file is the audio blob, model + optional fields as form fields.
with open("meeting.mp3", "rb") as f:
    resp = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        language="en",
    )
print(resp.text)

Parameters

file
binary
required
Audio file. Supported formats per upstream, Whisper handles mp3/mp4/mpeg/mpga/m4a/wav/webm.
model
string
required
ASR model id (e.g. "whisper-large-v3", "whisper-large-v3-turbo").
language
string
BCP-47 code (e.g. "en", "es"). If omitted, model auto-detects.
prompt
string
Short text hint to bias decoding, useful for domain vocabulary.
response_format
string
default:"json"
json | text | srt | vtt | verbose_json.
temperature
number
default:"0"
Sampling temperature for the ASR decoder.

Response

{ "text": "Welcome to the meeting. Today we'll cover..." }
With response_format=verbose_json:
{
  "task": "transcribe",
  "language": "english",
  "duration": 47.3,
  "segments": [
    { "id": 0, "start": 0.0, "end": 3.2, "text": "Welcome to the meeting." },
    ...
  ],
  "text": "..."
}

Long-form audio

Files over a few minutes are chunked internally with overlap, transcribed in parallel batches on the GPU, then stitched back together with timestamp alignment. The OpenAI SDK contract is unchanged, you upload one file and get one transcript.
Audio lengthBehavior
Under 30 sSingle forward pass
30 s to 30 minChunked with 5 s overlap, stitched
30 min to 2 hrChunked + streamed to disk during decode
Over 2 hrRecommended to pre-split client-side, then concatenate transcripts

Diarization and PII redaction

Pipelines built from the Transcription use case accept two extra body fields beyond the OpenAI contract:
FieldTypeEffect
diarizebooleanAdds speaker labels to each segment (Speaker 1, Speaker 2, …)
redactbooleanReplaces detected emails, phones, and names with [REDACTED_*] tokens
Pass them via extra_body on the OpenAI SDK or as form fields on raw HTTP:
transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("call.mp3", "rb"),
    response_format="verbose_json",
    extra_body={"diarize": True, "redact": True},
)

Reference-audio TTS

Some TTS models (XTTS-v2, Bark) accept a short reference audio clip to clone the voice. Pass a base64-encoded audio sample via the reference_audio field on a pipeline deployed from the voice-agent use case:
import base64

with open("speaker_sample.wav", "rb") as f:
    sample_b64 = base64.b64encode(f.read()).decode()

audio = client.audio.speech.create(
    model="xtts-v2",
    voice="cloned",
    input="Hello from a voice that did not exist five seconds ago.",
    extra_body={"reference_audio": sample_b64},
)
The reference clip should be 3 to 10 seconds of clean speech in the target voice. Longer clips are truncated; noisy clips degrade the output.

Idempotency and retries

Audio requests are billed inference, so retries should be guarded with an Idempotency-Key header. A retry with the same key within 24 hours of a successful first call returns the cached response, no second inference runs.
curl https://api.runinfra.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_RUNINFRA_API_KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -F "model=whisper-large-v3" \
  -F "file=@call.mp3"

Next steps

Voice agent use case

Streaming STT + LLM + TTS at sub-600ms turn-taking.

Transcription use case

Diarization, PII redaction, long-form export.

Models catalog

Whisper variants, XTTS, Kokoro, Bark.

Rate limits

Per-key budgets and the Retry-After header.