Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

RunInfra serves text-to-image diffusion models through a Diffusers-based runtime with torchao FP8 dynamic quantization and torch.compile on Ada Lovelace (L4, L40S), Hopper (H100, H200), and Blackwell (B200) GPUs. Older Ampere cards (A10G, A100) fall back to FP16; FP8 there has no native kernel and would only emulate. Pick a model, choose a mode, and the agent provisions a RunPod endpoint with the right base image, the right scheduler, and the right step count for the model family. The OpenAI-compatible /v1/images/generations endpoint returns base64 PNG bytes, exactly like the OpenAI Images API.

Models

ModelProviderLicenseParameter countVRAM (FP16)Default stepsDefault guidance
FLUX.1 SchnellBlack Forest LabsApache-2.012B24GB40.0
FLUX.1 DevBlack Forest LabsFLUX-NC-1.0.1 (non-commercial)12B24GB283.5
Stable Diffusion 3.5 LargeStability AIStability Community (gated above $1M ARR)8B16GB284.5
SDXL TurboStability AIStability Community3.5B7GB10.0
SDXL Lightning (4-step)ByteDanceOpenRAIL-M3.5B7GB40.0
FLUX.1 Dev is licensed for non-commercial use only. RunInfra hosts the weights, but you are responsible for license compliance. The config tab surfaces a banner when you pick FLUX-Dev so the obligation is hard to miss.

Mode picker

The pipeline config exposes two modes, designed so you don’t need to know what num_inference_steps means:
  • Realtime, picks a 4-step distilled model (FLUX-Schnell, SDXL-Turbo, SDXL-Lightning) with guidance_scale=0. Target latency: under 1.5 seconds on L40S with FP8 + compile.
  • Quality, picks a non-distilled model (FLUX-Dev, SD3.5-Large) with full step count and guidance. Target latency: 3 to 5 seconds on L40S.
Power users can open the Advanced disclosure to override num_inference_steps, guidance_scale, scheduler (FlowMatchEuler for FLUX/SD3, DPM++ for SDXL), seed, and negative prompt directly.

Latency tiers (FLUX-Schnell, 1024x1024, 4 steps, FP8 + compile)

GPUFamilyQuantFirst inference (cold)Steady-state p50
L4AdaFP8 + offload~150s~2.5s
A10GAmpereFP16 (no native FP8)~120s~2.0s
L40SAdaFP8 + compile~90s~1.0s
A100-40GBAmpereFP16 + compile~100s~1.6s
H100HopperFP8 + compile~90s~0.6s
Cold-start cost includes weights pull from Hugging Face (24GB for FLUX) plus the torch.compile warmup pass. After the first request, subsequent calls hit the warm path. The compile cache is mounted on the RunPod network volume, so a worker that scales back to zero and respawns reuses the prior compile artifacts.

API

Send a POST to the deployed endpoint:
curl https://YOUR-ENDPOINT.runinfra.ai/v1/images/generations \
  -H "Authorization: Bearer $RUNINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "black-forest-labs/FLUX.1-schnell",
    "prompt": "A photorealistic portrait of an astronaut on Mars at golden hour",
    "size": "1024x1024",
    "n": 1,
    "response_format": "b64_json"
  }'
The response shape matches the OpenAI Images API:
{
  "created": 1714512000,
  "data": [
    { "b64_json": "iVBORw0KGgoAAAANSUhEU..." }
  ],
  "inference_ms": 952.7,
  "width": 1024,
  "height": 1024,
  "model": "black-forest-labs/FLUX.1-schnell"
}

RunInfra extensions

The runtime accepts these optional fields beyond the OpenAI spec:
FieldTypeBoundsDefault
num_inference_stepsint1 to 504 (FLUX-Schnell), 28 (FLUX-Dev), 20 (SDXL), 1 (SDXL-Turbo), 4 (SDXL-Lightning)
guidance_scalefloat0 to 20family default
schedulerstringflow-match-euler, dpm++, euler-a, heun, ddimfamily default
seedint0 to 2^31 - 1unset (non-deterministic)
negative_promptstringup to 2000 charsempty

Optimization

When you trigger optimization on an image-gen pipeline, RunInfra runs a candidate sweep:
  1. Baseline FP16 is always measured.
  2. FP8 + compile is added when the target GPU is Ada / Hopper / Blackwell. The candidate generator skips it on Ampere because torchao FP8 falls back to BF16 W8A16 there with no speedup.
  3. The Pareto winner (lowest p50 latency under the steady-state batch) is crowned and persisted to the optimization receipt.
Check the receipt chips after deploy:
4 steps / gs 0 / FlowMatch / FP8 dynamic / compiled
Each chip reflects a real measurement, not a label. The model-selection knobs (num_inference_steps, guidance_scale, scheduler) are deliberately NOT swept; those are product knobs you set, not optimization axes the system explores.

Pricing

Image generation is priced per image, not per token. The deploy form shows the per-image cost based on the model + GPU pairing you select. Prompt input is included in the per-image price; you don’t pay separately for tokens.

Limits and constraints

  • Image dimensions: 64 to 2048 pixels per side. The runtime rejects requests outside this range with HTTP 422.
  • Steps: capped at 50. Distilled models reject anything above their training contract via the family-default ceiling.
  • n (images per request): capped at 4.
  • Cold-start budget: 240 seconds. If a worker can’t load weights and warm up within 4 minutes, the deploy is marked failed.

What’s not yet supported

  • Image-to-image and inpainting, text-to-image only.
  • ControlNet, IP-Adapter, LoRA, base models only. Custom fine-tunes are roadmap.
  • Hyper-FLUX 8-step LoRA, research-only license; deferred until commercial path is clear.
  • GGUF / INT8 quantization, FP16 + FP8 dynamic cover the production GPU fleet today.
  • ComfyUI graphs, the v1 runtime is Diffusers + thin FastAPI; ComfyUI is opt-in roadmap.

See also

  • Quickstart, pick a model, click deploy, hit the API.
  • Models, full catalog including LLMs, ASR, TTS, embeddings.
  • GPU pricing, per-GPU hourly rates for the active tier.
  • Optimization, how RunInfra picks the winning candidate per pipeline.