The ecosystem of LLM inference frameworks has been growing rapidly. As models become larger and more capable, the frameworks that power them are forced to keep pace, optimizing for everything from latency to throughput to memory efficiency. For developers, researchers, and enterprises alike, the choice of framework can dramatically affect both performance and cost.
In this blog, we bring those considerations together by comparing SGLang, vLLM, and TensorRT-LLM. We evaluate how each performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The results highlight the unique strengths of each framework and offer practical guidance on which to choose based on your workload and hardware.
SGLang: SGLang was designed around the idea of structured generation. It brings unique abstractions like RadixAttention and specialized state management that allow it to deliver low latency for interactive applications. This makes SGLang especially appealing when the workload requires precise control over outputs, such as when generating structured data formats or working with agentic workflows.
vLLM: vLLM has established itself as one of the leading open-source inference frameworks for serving large language models at scale. Its key advantage lies in throughput, powered by continuous batching and efficient memory management through PagedAttention. It also provides broad support for quantization techniques like INT8, INT4, GPTQ, AWQ, and FP8, making it a versatile choice for those who need to maximize tokens per second across many concurrent requests.
TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract maximum performance from NVIDIA GPUs. It is deeply optimized for Hopper and Blackwell architectures, which means it takes full advantage of hardware features in the H100 and B200. The result is higher efficiency, faster response times, and better scaling as workloads increase. While it requires a bit more setup and tuning compared to other frameworks, TensorRT-LLM represents NVIDIA’s vision for production-grade inference performance.
Framework | Design Focus | Key Strengths |
---|---|---|
SGLANG | Structured generation, RadixAttention | Low latency, efficient token generation |
vLLM | Continuous batching, PagedAttention | High throughput, supports quantization |
TensorRT-LLM | TensorRT optimizations | GPU-level efficiency, lowest latency on H100/B200 |
To evaluate the three frameworks fairly, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs under a variety of conditions. The GPT-OSS-120B model is a large mixture-of-experts model that pushes the boundaries of open-weight performance. Its size and complexity make it a demanding benchmark, which is exactly why it is ideal for testing inference frameworks and hardware.
We measured three main categories of performance:
Let's start with latency. When you care about responsiveness, two things matter most: the time to first token and the per-token latency once decoding begins.
Here's how the three frameworks stacked up:
Time to First Token (seconds)
Concurrency | vLLM | SGLang | TensorRT-LLM |
---|---|---|---|
1 | 0.053 | 0.125 | 0.177 |
10 | 1.91 | 1.155 | 2.496 |
50 | 7.546 | 3.08 | 4.14 |
100 | 1.87 | 8.991 | 5.467 |
Per-Token Latency (seconds)
Concurrency | vLLM | SGLang | TensorRT-LLM |
---|---|---|---|
1 | 0.005 | 0.004 | 0.004 |
10 | 0.011 | 0.01 | 0.009 |
50 | 0.021 | 0.015 | 0.018 |
100 | 0.019 | 0.021 | 0.049 |
What this shows:
When it comes to serving lots of requests, throughput is the number to watch. Here's how the three frameworks performed as concurrency increased:
Overall Throughput (tokens/second)
Concurrency | vLLM | SGLang | TensorRT-LLM |
---|---|---|---|
1 | 187.15 | 230.96 | 242.79 |
10 | 863.15 | 988.18 | 867.21 |
50 | 2211.85 | 3108.75 | 2162.95 |
100 | 4741.62 | 3221.84 | 1942.64 |
One of the most important findings was how vLLM achieved the highest throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang showed strong performance at moderate to high concurrency (50 requests), while TensorRT-LLM demonstrated the best single-request throughput but lower scaling at extreme concurrency.
SGLang
Strengths: Stable per-token latency, strong throughput at moderate concurrency, good overall balance.
Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.
Best For: Moderate to high-throughput applications, scenarios requiring consistent token generation timing.
vLLM
Strengths: Fastest time-to-first-token across all concurrency levels, highest throughput at extreme concurrency, excellent scaling.
Weaknesses: Slightly higher per-token latency at high loads.
Best For: Interactive applications, high-concurrency deployments, scenarios prioritizing fast initial responses and maximum throughput scaling.
TensorRT-LLM
Strengths: Best single-request throughput, competitive per-token latency at low concurrency, hardware-optimized performance.
Weaknesses: Slowest time-to-first-token, poor scaling at high concurrency, significantly degraded per-token latency at 100 requests.
Best For: Single-user or low-concurrency applications, scenarios where hardware optimization matters more than scaling.
There is no single framework that outperforms across all categories. Instead, each has been optimized for different goals, and the right choice depends on workload and infrastructure.
The key takeaway is that choosing the right framework depends on workload type and hardware availability, rather than looking for a universal winner. Running GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks powerful options for building and deploying AI applications at scale.
It's worth noting that these performance characteristics can shift dramatically depending on your GPU hardware. We also extended the benchmarks to B200 GPUs, where TensorRT-LLM consistently outperformed both SGLang and vLLM across all metrics, thanks to its deeper optimization for NVIDIA's latest hardware architecture.
This highlights how framework selection isn't just about software capabilities—it's equally about matching the right framework to your specific hardware to unlock maximum performance potential.
You can explore the full set of benchmark results here.
Bonus: Serve a Model with Your Preferred Framework
Getting started with these frameworks is simple. With Clarifai's Compute Orchestration, you can serve GPT-OSS-120B or any other open-weight models or your own custom models from your preferred inference engine, whether it is SGLang, vLLM, or TensorRT-LLM .
From setting up the runtime to deploying a production-ready API, you can quickly go from model to application. The best part is that you are not locked into a single framework. You can experiment with different runtimes, and choose the one that best aligns with your performance and cost requirements.
This flexibility makes it easy to integrate cutting-edge frameworks into your workflows and ensures you are always getting the best possible performance from your hardware. Check out the documentation to learn how to upload your own models.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy