Everything below is scoped to what we tested: Llama-3.1-8B-Instruct, one H100 80GB and one L40S 48GB, bf16 and fp8. We do not generalize past it. Versions, as of June 20 2026: vLLM 0.23.0, SGLang 0.5.13, TensorRT-LLM 1.2.1. The harness and every raw number are open source.
The short version
There is no single fastest engine. The winner changes with how you load it.
| If you care about | Use | On this benchmark |
|---|---|---|
| Lowest first-token latency at moderate load | TensorRT-LLM | 235 ms TTFT at c32 vs vLLM 514 ms |
| Highest throughput at saturation | vLLM | 5,333 tok/s on one H100 |
| Lowest cost per token | vLLM at fp8 | $0.158 per 1M tokens |
| Shared-prefix caching | a big win for both | throughput ~2.7x at 90% hit rate |
The rule for these configs: TensorRT-LLM for latency-bound serving below its throughput ceiling, vLLM for the highest throughput and the lowest cost. SGLang tracked vLLM within a few percent.
We ran every engine on the same weights, same GPU, and same request stream, and we open-sourced the harness so you can check us. The one clear surprise that holds: the cheap L40S GPU costs about twice as much per token as the H100.
Why we ran this
Hugging Face put TGI into maintenance mode in March 2026, so a lot of teams are re-choosing a serving engine right now. The comparisons you can find are point-in-time snapshots on someone else's workload. So we ran it ourselves, on our own fleet, and published the harness. If our numbers do not match yours, open the repo and find out why.
How we measured
One client drives all three engines with identical request streams, so the timing definitions are the same everywhere. TTFT is the time to the first streamed token. Throughput is output tokens per second. We warm up, then time three repeats per operating point and chart the mean; the 95 percent confidence intervals are tight and live in the committed raw data. Prefix caching is off for the unique-prompt sweeps so nothing gets a free cache hit.
Throughput vs concurrency
We swept concurrency from 1 to 256 at 1,024 input and 256 output tokens. At batch 1 the three engines are within four percent of each other. As load rises, TensorRT-LLM is competitive or best through concurrency 64, then flattens. vLLM keeps scaling and reaches the highest peak, 5,333 tok/s.
The latency vs throughput frontier
The throughput table hides the more useful picture. Plot first-token latency against throughput and the tradeoff is clear. TensorRT-LLM holds the lowest TTFT across the whole mid range and keeps that edge up to about 4,000 tok/s, then hits its ceiling and TTFT climbs steeply. vLLM owns the high-throughput end: it reaches 5,333 tok/s at a lower TTFT than the other two get near that rate.
So the frontier has two owners. TensorRT-LLM for latency-sensitive serving below its ceiling, vLLM for maximum throughput. That is the whole point: the right engine is a function of your operating point, not a constant.
TensorRT-LLM: we benchmarked the no-compile path
Read this before you weigh the throughput numbers. We ran TensorRT-LLM on its PyTorch backend, which loads a Hugging Face checkpoint and serves it with no ahead-of-time engine build. That is the simplest path and the NVIDIA-recommended default, and it is what most people reach for first.
It is not the maximum-throughput path. A compiled TensorRT engine, built offline with trtllm-build, chases peak tokens per second, at the cost of a multi-minute, GPU-architecture-locked compile step. We did not build it. So the 4,813 tok/s we measured is the PyTorch-backend peak, not TensorRT-LLM's ceiling, and we got no compile-time number. We make no claim that TensorRT-LLM cannot win raw throughput. What we can say: its PyTorch backend had the lowest first-token latency of the three across the mid range, and among the configs we ran, vLLM reached the highest throughput.
Cost per token
We turn throughput into dollars with one formula: cost per 1M tokens equals the GPU hourly price divided by sustained tokens per hour. The H100 is $3.95 an hour and the L40S is $1.95 an hour on Modal, observed June 20 2026.
Two things stand out. fp8 is the cheapest way to serve this model: vLLM at fp8 costs $0.158 per 1M tokens, 23 percent less than bf16, and it is faster too.
The second is the one to remember. The L40S is half the hourly price of the H100, so it looks like the budget option. It is not. The L40S is about four times slower on this model, so it costs roughly twice as much per token, $0.425 versus $0.206. For Llama-3.1-8B at load, the H100 is both faster and cheaper per token. The cheap GPU is the expensive choice.
Shared-prefix workloads
RAG and agent traffic reuse a long system prompt across requests, so a prefix cache should help. We ran this two ways, because the shape of the reuse matters.
One shared prefix
First the simple case: one 2,048-token prefix sent to a fraction of requests, with that fraction varied from 0 to 90 percent, caching on and off. Both caches work. With caching off, throughput is flat. With caching on, it climbs about 2.7x as the hit rate reaches 90 percent.
vLLM and SGLang were within a few percent, vLLM slightly ahead. We do not read a winner into that gap. A single shared prefix is the easy case that any block-level cache handles well. It is not the workload SGLang's RadixAttention is built for.
Many distinct prefixes, the RadixAttention case
RadixAttention is designed for many distinct prefixes held in a tree at once, the shape you get from branching conversations and a pool of system prompts. So we ran that: 256 requests spread across a growing number of distinct 2,048-token prefixes, 8 then 32 then 128 of them, both engines with caching on.
RadixAttention worked. SGLang's logs show the prefixes served straight from its tree cache. But vLLM's prefix cache stayed ahead on throughput by 6 to 8 percent at every prefix count, with lower TTFT. So on Llama-3.1-8B on one H100 we did not reproduce a RadixAttention throughput win, even on the workload it targets. Both engines cache shared prefixes well; vLLM was a few percent faster, in line with its general edge on this hardware. RadixAttention may pull ahead with longer prefixes, deeper trees, or heavier eviction pressure than we tested. On our test it did not, and we are not going to claim otherwise.
Reproduce it yourself
The harness is open source under Apache-2.0 and every raw CSV is committed. One command runs the matrix under a hard cost cap; another re-plots from committed data with no GPU. Found a result that disagrees with yours? Open an issue with your run id and row. Corrections become a documented caveat or a v2.
If you would rather not run this matrix for every model you ship, that is what RunInfra does. Point it at any Hugging Face model and it benchmarks the engines, picks the config for your latency and cost target, and deploys it serverless.
FAQ
On plain unique-prompt traffic they are within a few percent, with vLLM slightly ahead on throughput, latency, and cost in our tests. Use vLLM as the default; reach for SGLang for its programmable frontend or structured-output features.
For latency-sensitive serving below its throughput ceiling, yes. It held the lowest time to first token across the mid range, 235 ms at concurrency 32 versus vLLM's 514 ms. We ran the PyTorch backend, which skips the engine compile.
For Llama-3.1-8B at load, the H100. It is twice the hourly price of the L40S but about four times faster, so it costs roughly half as much per token. The L40S looks cheaper and is not.
Method notes and caveats
- One H100 80GB and one L40S 48GB on Modal, single GPU, tensor-parallel size 1.
- Same Llama-3.1-8B-Instruct weights and same request stream across engines. Three timed repeats; the 95 percent confidence intervals are tight and live in the committed raw data.
- Latency here is TTFT and per-request percentiles, not goodput at a fixed SLO. Whether 514 ms first-token at high load is acceptable depends on your use case; an SLO-goodput sweep is future work.
- Numbers are self-reported on our harness. We do not cross-calibrate against an external suite like MLPerf, so the open harness is the check: re-run it and compare.
- TensorRT-LLM ran the PyTorch backend (no engine compile), so there is no compile time to report and its peak number may differ from a built TensorRT engine.
- The L40S sweep stopped at concurrency 128 (its saturation), the H100 at 256.
- Engines move weekly. These numbers are vLLM 0.23.0, SGLang 0.5.13, TensorRT-LLM 1.2.1, as of June 20 2026.
