Documentation Index
Fetch the complete documentation index at: https://runinfra.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
RunInfra serves text-to-image diffusion models through a Diffusers-based runtime with torchao FP8 dynamic quantization and torch.compile on Ada Lovelace (L4, L40S), Hopper (H100, H200), and Blackwell (B200) GPUs. Older Ampere cards (A10G, A100) fall back to FP16; FP8 there has no native kernel and would only emulate.
Pick a model, choose a mode, and the agent provisions a RunPod endpoint with the right base image, the right scheduler, and the right step count for the model family. The OpenAI-compatible /v1/images/generations endpoint returns base64 PNG bytes, exactly like the OpenAI Images API.
Models
| Model | Provider | License | Parameter count | VRAM (FP16) | Default steps | Default guidance |
|---|
| FLUX.1 Schnell | Black Forest Labs | Apache-2.0 | 12B | 24GB | 4 | 0.0 |
| FLUX.1 Dev | Black Forest Labs | FLUX-NC-1.0.1 (non-commercial) | 12B | 24GB | 28 | 3.5 |
| Stable Diffusion 3.5 Large | Stability AI | Stability Community (gated above $1M ARR) | 8B | 16GB | 28 | 4.5 |
| SDXL Turbo | Stability AI | Stability Community | 3.5B | 7GB | 1 | 0.0 |
| SDXL Lightning (4-step) | ByteDance | OpenRAIL-M | 3.5B | 7GB | 4 | 0.0 |
FLUX.1 Dev is licensed for non-commercial use only. RunInfra hosts the weights, but you are responsible for license compliance. The config tab surfaces a banner when you pick FLUX-Dev so the obligation is hard to miss.
Mode picker
The pipeline config exposes two modes, designed so you don’t need to know what num_inference_steps means:
- Realtime, picks a 4-step distilled model (FLUX-Schnell, SDXL-Turbo, SDXL-Lightning) with
guidance_scale=0. Target latency: under 1.5 seconds on L40S with FP8 + compile.
- Quality, picks a non-distilled model (FLUX-Dev, SD3.5-Large) with full step count and guidance. Target latency: 3 to 5 seconds on L40S.
Power users can open the Advanced disclosure to override num_inference_steps, guidance_scale, scheduler (FlowMatchEuler for FLUX/SD3, DPM++ for SDXL), seed, and negative prompt directly.
Latency tiers (FLUX-Schnell, 1024x1024, 4 steps, FP8 + compile)
| GPU | Family | Quant | First inference (cold) | Steady-state p50 |
|---|
| L4 | Ada | FP8 + offload | ~150s | ~2.5s |
| A10G | Ampere | FP16 (no native FP8) | ~120s | ~2.0s |
| L40S | Ada | FP8 + compile | ~90s | ~1.0s |
| A100-40GB | Ampere | FP16 + compile | ~100s | ~1.6s |
| H100 | Hopper | FP8 + compile | ~90s | ~0.6s |
Cold-start cost includes weights pull from Hugging Face (24GB for FLUX) plus the torch.compile warmup pass. After the first request, subsequent calls hit the warm path. The compile cache is mounted on the RunPod network volume, so a worker that scales back to zero and respawns reuses the prior compile artifacts.
API
Send a POST to the deployed endpoint:
curl https://YOUR-ENDPOINT.runinfra.ai/v1/images/generations \
-H "Authorization: Bearer $RUNINFRA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "black-forest-labs/FLUX.1-schnell",
"prompt": "A photorealistic portrait of an astronaut on Mars at golden hour",
"size": "1024x1024",
"n": 1,
"response_format": "b64_json"
}'
The response shape matches the OpenAI Images API:
{
"created": 1714512000,
"data": [
{ "b64_json": "iVBORw0KGgoAAAANSUhEU..." }
],
"inference_ms": 952.7,
"width": 1024,
"height": 1024,
"model": "black-forest-labs/FLUX.1-schnell"
}
RunInfra extensions
The runtime accepts these optional fields beyond the OpenAI spec:
| Field | Type | Bounds | Default |
|---|
num_inference_steps | int | 1 to 50 | 4 (FLUX-Schnell), 28 (FLUX-Dev), 20 (SDXL), 1 (SDXL-Turbo), 4 (SDXL-Lightning) |
guidance_scale | float | 0 to 20 | family default |
scheduler | string | flow-match-euler, dpm++, euler-a, heun, ddim | family default |
seed | int | 0 to 2^31 - 1 | unset (non-deterministic) |
negative_prompt | string | up to 2000 chars | empty |
Optimization
When you trigger optimization on an image-gen pipeline, RunInfra runs a candidate sweep:
- Baseline FP16 is always measured.
- FP8 + compile is added when the target GPU is Ada / Hopper / Blackwell. The candidate generator skips it on Ampere because
torchao FP8 falls back to BF16 W8A16 there with no speedup.
- The Pareto winner (lowest p50 latency under the steady-state batch) is crowned and persisted to the optimization receipt.
Check the receipt chips after deploy:
4 steps / gs 0 / FlowMatch / FP8 dynamic / compiled
Each chip reflects a real measurement, not a label. The model-selection knobs (num_inference_steps, guidance_scale, scheduler) are deliberately NOT swept; those are product knobs you set, not optimization axes the system explores.
Pricing
Image generation is priced per image, not per token. The deploy form shows the per-image cost based on the model + GPU pairing you select. Prompt input is included in the per-image price; you don’t pay separately for tokens.
Limits and constraints
- Image dimensions: 64 to 2048 pixels per side. The runtime rejects requests outside this range with HTTP 422.
- Steps: capped at 50. Distilled models reject anything above their training contract via the family-default ceiling.
n (images per request): capped at 4.
- Cold-start budget: 240 seconds. If a worker can’t load weights and warm up within 4 minutes, the deploy is marked
failed.
What’s not yet supported
- Image-to-image and inpainting, text-to-image only.
- ControlNet, IP-Adapter, LoRA, base models only. Custom fine-tunes are roadmap.
- Hyper-FLUX 8-step LoRA, research-only license; deferred until commercial path is clear.
- GGUF / INT8 quantization, FP16 + FP8 dynamic cover the production GPU fleet today.
- ComfyUI graphs, the v1 runtime is Diffusers + thin FastAPI; ComfyUI is opt-in roadmap.
See also
- Quickstart, pick a model, click deploy, hit the API.
- Models, full catalog including LLMs, ASR, TTS, embeddings.
- GPU pricing, per-GPU hourly rates for the active tier.
- Optimization, how RunInfra picks the winning candidate per pipeline.