Image generation - RunInfra

RunInfra serves text-to-image diffusion models through a Diffusers-based runtime with torchao FP8 dynamic quantization and torch.compile on Ada Lovelace (L4, L40S), Hopper (H100, H200), and Blackwell (B200) GPUs. Older Ampere cards (A10G, A100) fall back to FP16; FP8 there has no native kernel and would only emulate. Pick a model, choose a mode, and the agent provisions a RunPod endpoint with the right base image, the right scheduler, and the right step count for the model family. The OpenAI-compatible /v1/images/generations endpoint returns base64 PNG bytes, exactly like the OpenAI Images API.

Models

Model	Provider	License	Parameter count	VRAM (FP16)	Default steps	Default guidance
FLUX.1 Schnell	Black Forest Labs	Apache-2.0	12B	24GB	4	0.0
FLUX.1 Dev	Black Forest Labs	FLUX-NC-1.0.1 (non-commercial)	12B	24GB	28	3.5
Stable Diffusion 3.5 Large	Stability AI	Stability Community (gated above $1M ARR)	8B	16GB	28	4.5
SDXL Turbo	Stability AI	Stability Community	3.5B	7GB	1	0.0
SDXL Lightning (4-step)	ByteDance	OpenRAIL-M	3.5B	7GB	4	0.0

FLUX.1 Dev is licensed for non-commercial use only. RunInfra hosts the weights, but you are responsible for license compliance. The config tab surfaces a banner when you pick FLUX-Dev so the obligation is hard to miss.

Mode picker

The pipeline config exposes two modes, designed so you don’t need to know what num_inference_steps means:

Realtime, picks a 4-step distilled model (FLUX-Schnell, SDXL-Turbo, SDXL-Lightning) with guidance_scale=0. Target latency: under 1.5 seconds on L40S with FP8 + compile.
Quality, picks a non-distilled model (FLUX-Dev, SD3.5-Large) with full step count and guidance. Target latency: 3 to 5 seconds on L40S.

Power users can open the Advanced disclosure to override num_inference_steps, guidance_scale, scheduler (FlowMatchEuler for FLUX/SD3, DPM++ for SDXL), seed, and negative prompt directly.

Latency tiers (FLUX-Schnell, 1024x1024, 4 steps, FP8 + compile)

GPU	Family	Quant	First inference (cold)	Steady-state p50
L4	Ada	FP8 + offload	~150s	~2.5s
A10G	Ampere	FP16 (no native FP8)	~120s	~2.0s
L40S	Ada	FP8 + compile	~90s	~1.0s
A100-40GB	Ampere	FP16 + compile	~100s	~1.6s
H100	Hopper	FP8 + compile	~90s	~0.6s

Cold-start cost includes weights pull from Hugging Face (24GB for FLUX) plus the torch.compile warmup pass. After the first request, subsequent calls hit the warm path. The compile cache is mounted on the RunPod network volume, so a worker that scales back to zero and respawns reuses the prior compile artifacts.

API

Send a POST to the deployed endpoint:

curl https://YOUR-ENDPOINT.runinfra.ai/v1/images/generations \
  -H "Authorization: Bearer $RUNINFRA_GATEWAY_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "black-forest-labs/FLUX.1-schnell",
    "prompt": "A photorealistic portrait of an astronaut on Mars at golden hour",
    "size": "1024x1024",
    "n": 1,
    "response_format": "b64_json"
  }'

The response shape matches the OpenAI Images API:

{
  "created": 1714512000,
  "data": [
    { "b64_json": "iVBORw0KGgoAAAANSUhEU..." }
  ],
  "inference_ms": 952.7,
  "width": 1024,
  "height": 1024,
  "model": "black-forest-labs/FLUX.1-schnell"
}

RunInfra extensions

The runtime accepts these optional fields beyond the OpenAI spec:

Field	Type	Bounds	Default
`num_inference_steps`	int	1 to 50	4 (FLUX-Schnell), 28 (FLUX-Dev), 20 (SDXL), 1 (SDXL-Turbo), 4 (SDXL-Lightning)
`guidance_scale`	float	0 to 20	family default
`scheduler`	string	`flow-match-euler`, `dpm++`, `euler-a`, `heun`, `ddim`	family default
`seed`	int	0 to 2^31 - 1	unset (non-deterministic)
`negative_prompt`	string	up to 2000 chars	empty

Optimization

When you trigger optimization on an image-gen pipeline, RunInfra runs a candidate sweep:

Baseline FP16 is always measured.
FP8 + compile is added when the target GPU is Ada / Hopper / Blackwell. The candidate generator skips it on Ampere because torchao FP8 falls back to BF16 W8A16 there with no speedup.
The Pareto winner (lowest p50 latency under the steady-state batch) is crowned and persisted to the optimization receipt.

Check the receipt chips after deploy:

4 steps / gs 0 / FlowMatch / FP8 dynamic / compiled

Each chip reflects a real measurement, not a label. The model-selection knobs (num_inference_steps, guidance_scale, scheduler) are deliberately NOT swept; those are product knobs you set, not optimization axes the system explores.

Pricing

Image generation is priced per image, not per token. The deploy form shows the per-image cost based on the model + GPU pairing you select. Prompt input is included in the per-image price; you don’t pay separately for tokens.

Limits and constraints

Image dimensions: 64 to 2048 pixels per side. The runtime rejects requests outside this range with HTTP 422.
Steps: capped at 50. Distilled models reject anything above their training contract via the family-default ceiling.
n (images per request): capped at 4.
Cold-start budget: 240 seconds. If a worker can’t load weights and warm up within 4 minutes, the deploy is marked failed.

What’s not yet supported

Image-to-image and inpainting, text-to-image only.
ControlNet, IP-Adapter, LoRA, base models only. Custom fine-tunes are roadmap.
Hyper-FLUX 8-step LoRA, research-only license; deferred until commercial path is clear.
GGUF / INT8 quantization, FP16 + FP8 dynamic cover the production GPU fleet today.
ComfyUI graphs, the v1 runtime is Diffusers + thin FastAPI; ComfyUI is opt-in roadmap.

​Models

​Mode picker

​Latency tiers (FLUX-Schnell, 1024x1024, 4 steps, FP8 + compile)

​API

​RunInfra extensions

​Optimization

​Pricing

​Limits and constraints

​What’s not yet supported

​See also

Models

Mode picker

Latency tiers (FLUX-Schnell, 1024x1024, 4 steps, FP8 + compile)

API

RunInfra extensions

Optimization

Pricing

Limits and constraints

What’s not yet supported

See also