What happened to TGI?

Hugging Face moved Text Generation Inference to maintenance mode in March 2026. We note it for context and did not benchmark it as a contender.

Does the ranking hold at fp8?

fp8 raised throughput by 23 to 31 percent for both vLLM and SGLang on the H100 and lowered cost per token. vLLM stayed ahead.

vLLM vs SGLang vs TensorRT-LLM: a reproducible benchmark

Everything below is scoped to what we tested: Llama-3.1-8B-Instruct, one H100 80GB and one L40S 48GB, bf16 and fp8. We do not generalize past it. Versions, as of June 20 2026: vLLM 0.23.0, SGLang 0.5.13, TensorRT-LLM 1.2.1. The harness and every raw number are open source.

The short version

There is no single fastest engine. The winner changes with how you load it.

Decision guide

If you care about	Use	On this benchmark
Lowest first-token latency at moderate load	TensorRT-LLM	235 ms TTFT at c32 vs vLLM 514 ms
Highest throughput at saturation	vLLM	5,333 tok/s on one H100
Lowest cost per token	vLLM at fp8	$0.158 per 1M tokens
Shared-prefix caching	a big win for both	throughput ~2.7x at 90% hit rate

The rule for these configs: TensorRT-LLM for latency-bound serving below its throughput ceiling, vLLM for the highest throughput and the lowest cost. SGLang tracked vLLM within a few percent.

We ran every engine on the same weights, same GPU, and same request stream, and we open-sourced the harness so you can check us. The one clear surprise that holds: the cheap L40S GPU costs about twice as much per token as the H100.

Why we ran this

Hugging Face put TGI into maintenance mode in March 2026, so a lot of teams are re-choosing a serving engine right now. The comparisons you can find are point-in-time snapshots on someone else's workload. So we ran it ourselves, on our own fleet, and published the harness. If our numbers do not match yours, open the repo and find out why.

How we measured

One client drives all three engines with identical request streams, so the timing definitions are the same everywhere. TTFT is the time to the first streamed token. Throughput is output tokens per second. We warm up, then time three repeats per operating point and chart the mean; the 95 percent confidence intervals are tight and live in the committed raw data. Prefix caching is off for the unique-prompt sweeps so nothing gets a free cache hit.

Throughput vs concurrency

We swept concurrency from 1 to 256 at 1,024 input and 256 output tokens. At batch 1 the three engines are within four percent of each other. As load rises, TensorRT-LLM is competitive or best through concurrency 64, then flattens. vLLM keeps scaling and reaches the highest peak, 5,333 tok/s.

Output throughput vs concurrency, Llama-3.1-8B bf16, one H100

The latency vs throughput frontier

The throughput table hides the more useful picture. Plot first-token latency against throughput and the tradeoff is clear. TensorRT-LLM holds the lowest TTFT across the whole mid range and keeps that edge up to about 4,000 tok/s, then hits its ceiling and TTFT climbs steeply. vLLM owns the high-throughput end: it reaches 5,333 tok/s at a lower TTFT than the other two get near that rate.

Latency vs throughput frontier (each point is a concurrency level)

So the frontier has two owners. TensorRT-LLM for latency-sensitive serving below its ceiling, vLLM for maximum throughput. That is the whole point: the right engine is a function of your operating point, not a constant.

TensorRT-LLM: we benchmarked the no-compile path

Read this before you weigh the throughput numbers. We ran TensorRT-LLM on its PyTorch backend, which loads a Hugging Face checkpoint and serves it with no ahead-of-time engine build. That is the simplest path and the NVIDIA-recommended default, and it is what most people reach for first.

It is not the maximum-throughput path. A compiled TensorRT engine, built offline with trtllm-build, chases peak tokens per second, at the cost of a multi-minute, GPU-architecture-locked compile step. We did not build it. So the 4,813 tok/s we measured is the PyTorch-backend peak, not TensorRT-LLM's ceiling, and we got no compile-time number. We make no claim that TensorRT-LLM cannot win raw throughput. What we can say: its PyTorch backend had the lowest first-token latency of the three across the mid range, and among the configs we ran, vLLM reached the highest throughput.

Cost per token

We turn throughput into dollars with one formula: cost per 1M tokens equals the GPU hourly price divided by sustained tokens per hour. The H100 is $3.95 an hour and the L40S is $1.95 an hour on Modal, observed June 20 2026.

Cost per 1M output tokens at each config's saturation (lower is better)

Two things stand out. fp8 is the cheapest way to serve this model: vLLM at fp8 costs $0.158 per 1M tokens, 23 percent less than bf16, and it is faster too.

The second is the one to remember. The L40S is half the hourly price of the H100, so it looks like the budget option. It is not. The L40S is about four times slower on this model, so it costs roughly twice as much per token, $0.425 versus $0.206. For Llama-3.1-8B at load, the H100 is both faster and cheaper per token. The cheap GPU is the expensive choice.

Shared-prefix workloads

RAG and agent traffic reuse a long system prompt across requests, so a prefix cache should help. We ran this two ways, because the shape of the reuse matters.

One shared prefix

First the simple case: one 2,048-token prefix sent to a fraction of requests, with that fraction varied from 0 to 90 percent, caching on and off. Both caches work. With caching off, throughput is flat. With caching on, it climbs about 2.7x as the hit rate reaches 90 percent.

Single shared prefix: throughput vs hit rate, caching on vs off (c32)

vLLM and SGLang were within a few percent, vLLM slightly ahead. We do not read a winner into that gap. A single shared prefix is the easy case that any block-level cache handles well. It is not the workload SGLang's RadixAttention is built for.

Many distinct prefixes, the RadixAttention case

RadixAttention is designed for many distinct prefixes held in a tree at once, the shape you get from branching conversations and a pool of system prompts. So we ran that: 256 requests spread across a growing number of distinct 2,048-token prefixes, 8 then 32 then 128 of them, both engines with caching on.

Many distinct prefixes: throughput vs number of prefixes (c32, both caches on)

RadixAttention worked. SGLang's logs show the prefixes served straight from its tree cache. But vLLM's prefix cache stayed ahead on throughput by 6 to 8 percent at every prefix count, with lower TTFT. So on Llama-3.1-8B on one H100 we did not reproduce a RadixAttention throughput win, even on the workload it targets. Both engines cache shared prefixes well; vLLM was a few percent faster, in line with its general edge on this hardware. RadixAttention may pull ahead with longer prefixes, deeper trees, or heavier eviction pressure than we tested. On our test it did not, and we are not going to claim otherwise.

Reproduce it yourself

The harness is open source under Apache-2.0 and every raw CSV is committed. One command runs the matrix under a hard cost cap; another re-plots from committed data with no GPU. Found a result that disagrees with yours? Open an issue with your run id and row. Corrections become a documented caveat or a v2.

If you would rather not run this matrix for every model you ship, that is what RunInfra does. Point it at any Hugging Face model and it benchmarks the engines, picks the config for your latency and cost target, and deploys it serverless.

FAQ

vLLM or SGLang?

On plain unique-prompt traffic they are within a few percent, with vLLM slightly ahead on throughput, latency, and cost in our tests. Use vLLM as the default; reach for SGLang for its programmable frontend or structured-output features.

Is TensorRT-LLM worth it?

For latency-sensitive serving below its throughput ceiling, yes. It held the lowest time to first token across the mid range, 235 ms at concurrency 32 versus vLLM's 514 ms. We ran the PyTorch backend, which skips the engine compile.

Which GPU should I use?

For Llama-3.1-8B at load, the H100. It is twice the hourly price of the L40S but about four times faster, so it costs roughly half as much per token. The L40S looks cheaper and is not.

Method notes and caveats

One H100 80GB and one L40S 48GB on Modal, single GPU, tensor-parallel size 1.
Same Llama-3.1-8B-Instruct weights and same request stream across engines. Three timed repeats; the 95 percent confidence intervals are tight and live in the committed raw data.
Latency here is TTFT and per-request percentiles, not goodput at a fixed SLO. Whether 514 ms first-token at high load is acceptable depends on your use case; an SLO-goodput sweep is future work.
Numbers are self-reported on our harness. We do not cross-calibrate against an external suite like MLPerf, so the open harness is the check: re-run it and compare.
TensorRT-LLM ran the PyTorch backend (no engine compile), so there is no compile time to report and its peak number may differ from a built TensorRT engine.
The L40S sweep stopped at concurrency 128 (its saturation), the H100 at 256.
Engines move weekly. These numbers are vLLM 0.23.0, SGLang 0.5.13, TensorRT-LLM 1.2.1, as of June 20 2026.

References

Written by

Jaber JaberFounder and researcher, RunInfra

The short version

There is no single fastest engine. The winner changes with how you load it.

Decision guide

If you care about	Use	On this benchmark
Lowest first-token latency at moderate load	TensorRT-LLM	235 ms TTFT at c32 vs vLLM 514 ms
Highest throughput at saturation	vLLM	5,333 tok/s on one H100
Lowest cost per token	vLLM at fp8	$0.158 per 1M tokens
Shared-prefix caching	a big win for both	throughput ~2.7x at 90% hit rate

The rule for these configs: TensorRT-LLM for latency-bound serving below its throughput ceiling, vLLM for the highest throughput and the lowest cost. SGLang tracked vLLM within a few percent.

Why we ran this

How we measured

Throughput vs concurrency

Output throughput vs concurrency, Llama-3.1-8B bf16, one H100

The latency vs throughput frontier

Latency vs throughput frontier (each point is a concurrency level)

TensorRT-LLM: we benchmarked the no-compile path

Cost per token

Cost per 1M output tokens at each config's saturation (lower is better)

Two things stand out. fp8 is the cheapest way to serve this model: vLLM at fp8 costs $0.158 per 1M tokens, 23 percent less than bf16, and it is faster too.

Shared-prefix workloads

RAG and agent traffic reuse a long system prompt across requests, so a prefix cache should help. We ran this two ways, because the shape of the reuse matters.

One shared prefix

Single shared prefix: throughput vs hit rate, caching on vs off (c32)

Many distinct prefixes, the RadixAttention case

Many distinct prefixes: throughput vs number of prefixes (c32, both caches on)

Reproduce it yourself

FAQ

vLLM or SGLang?

Is TensorRT-LLM worth it?

Which GPU should I use?

For Llama-3.1-8B at load, the H100. It is twice the hourly price of the L40S but about four times faster, so it costs roughly half as much per token. The L40S looks cheaper and is not.

Method notes and caveats

One H100 80GB and one L40S 48GB on Modal, single GPU, tensor-parallel size 1.
Same Llama-3.1-8B-Instruct weights and same request stream across engines. Three timed repeats; the 95 percent confidence intervals are tight and live in the committed raw data.
Latency here is TTFT and per-request percentiles, not goodput at a fixed SLO. Whether 514 ms first-token at high load is acceptable depends on your use case; an SLO-goodput sweep is future work.
Numbers are self-reported on our harness. We do not cross-calibrate against an external suite like MLPerf, so the open harness is the check: re-run it and compare.
TensorRT-LLM ran the PyTorch backend (no engine compile), so there is no compile time to report and its peak number may differ from a built TensorRT engine.
The L40S sweep stopped at concurrency 128 (its saturation), the H100 at 256.
Engines move weekly. These numbers are vLLM 0.23.0, SGLang 0.5.13, TensorRT-LLM 1.2.1, as of June 20 2026.

References

Written by

Jaber JaberFounder and researcher, RunInfra

vLLM vs SGLang vs TensorRT-LLM: a reproducible benchmark

ENGINE BENCHMARK

The short version

Why we ran this

How we measured

Throughput vs concurrency

The latency vs throughput frontier

TensorRT-LLM: we benchmarked the no-compile path

Cost per token

Shared-prefix workloads

One shared prefix

Many distinct prefixes, the RadixAttention case

Reproduce it yourself

FAQ

Method notes and caveats

References

Written by

Deploy your first optimized model
in under 5 minutes

vLLM vs SGLang vs TensorRT-LLM: a reproducible benchmark

ENGINE BENCHMARK

The short version

Why we ran this

How we measured

Throughput vs concurrency

The latency vs throughput frontier

TensorRT-LLM: we benchmarked the no-compile path

Cost per token

Shared-prefix workloads

One shared prefix

Many distinct prefixes, the RadixAttention case

Reproduce it yourself

FAQ

Method notes and caveats

References

Written by

Deploy your first optimized model
in under 5 minutes

vLLM vs SGLang vs TensorRT-LLM: a reproducible benchmark

ENGINE BENCHMARK

The short version

Why we ran this

How we measured

Throughput vs concurrency

The latency vs throughput frontier

TensorRT-LLM: we benchmarked the no-compile path

Cost per token

Shared-prefix workloads

One shared prefix

Many distinct prefixes, the RadixAttention case

Reproduce it yourself

FAQ

Method notes and caveats

References

Written by

Deploy your first optimized model in under 5 minutes

vLLM vs SGLang vs TensorRT-LLM: a reproducible benchmark

ENGINE BENCHMARK

The short version

Why we ran this

How we measured

Throughput vs concurrency

The latency vs throughput frontier

TensorRT-LLM: we benchmarked the no-compile path

Cost per token

Shared-prefix workloads

One shared prefix

Many distinct prefixes, the RadixAttention case

Reproduce it yourself

FAQ

Method notes and caveats

References

Written by

Deploy your first optimized model in under 5 minutes

Deploy your first optimized model
in under 5 minutes

Deploy your first optimized model
in under 5 minutes