🚀 E-book
Learn how to master the modern AI infrastructural challenges.
August 29, 2025

Comparing SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Table of Contents:

Blog thumbnail - Comparing SGLANG, vLLM, and TRTLM 
with GPT-OSS-120B.png.png

Introduction

The ecosystem of LLM inference frameworks has been growing rapidly. As models become larger and more capable, the frameworks that power them are forced to keep pace, optimizing for everything from latency to throughput to memory efficiency. For developers, researchers, and enterprises alike, the choice of framework can dramatically affect both performance and cost.

In this blog, we bring those considerations together by comparing SGLang, vLLM, and TensorRT-LLM. We evaluate how each performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The results highlight the unique strengths of each framework and offer practical guidance on which to choose based on your workload and hardware.

Overview of the Frameworks

SGLang: SGLang was designed around the idea of structured generation. It brings unique abstractions like RadixAttention and specialized state management that allow it to deliver low latency for interactive applications. This makes SGLang especially appealing when the workload requires precise control over outputs, such as when generating structured data formats or working with agentic workflows.

vLLM: vLLM has established itself as one of the leading open-source inference frameworks for serving large language models at scale. Its key advantage lies in throughput, powered by continuous batching and efficient memory management through PagedAttention. It also provides broad support for quantization techniques like INT8, INT4, GPTQ, AWQ, and FP8, making it a versatile choice for those who need to maximize tokens per second across many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract maximum performance from NVIDIA GPUs. It is deeply optimized for Hopper and Blackwell architectures, which means it takes full advantage of hardware features in the H100 and B200. The result is higher efficiency, faster response times, and better scaling as workloads increase. While it requires a bit more setup and tuning compared to other frameworks, TensorRT-LLM represents NVIDIA’s vision for production-grade inference performance.

Framework Design Focus Key Strengths
SGLANG Structured generation, RadixAttention Low latency, efficient token generation
vLLM Continuous batching, PagedAttention High throughput, supports quantization
TensorRT-LLM TensorRT optimizations GPU-level efficiency, lowest latency on H100/B200

Benchmark Setup and Results

Benchmark Setup and Results

To evaluate the three frameworks fairly, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs under a variety of conditions. The GPT-OSS-120B model is a large mixture-of-experts model that pushes the boundaries of open-weight performance. Its size and complexity make it a demanding benchmark, which is exactly why it is ideal for testing inference frameworks and hardware.

We measured three main categories of performance:

  • Latency – How fast the model generates the first token (TTFT) and how quickly it produces subsequent tokens.
  • Throughput – How many tokens per second can be generated under varying levels of concurrency.
  • Concurrency scaling – How well each framework holds up as the number of simultaneous requests increases.

Latency Results

Let's start with latency. When you care about responsiveness, two things matter most: the time to first token and the per-token latency once decoding begins.

Here's how the three frameworks stacked up:

Time to First Token (seconds)

Concurrency vLLM SGLang TensorRT-LLM
1 0.053 0.125 0.177
10 1.91 1.155 2.496
50 7.546 3.08 4.14
100 1.87 8.991 5.467

Per-Token Latency (seconds)

Concurrency vLLM SGLang TensorRT-LLM
1 0.005 0.004 0.004
10 0.011 0.01 0.009
50 0.021 0.015 0.018
100 0.019 0.021 0.049

What this shows:

  • vLLM was consistently the fastest to generate the first token across all concurrency levels, with excellent scaling characteristics.
  • SGLang had the most stable per-token latency, consistently around 4–21 ms across different loads.
  • TensorRT-LLM showed the slowest time to first token but maintained competitive per-token performance at lower concurrency levels.

Throughput Results

When it comes to serving lots of requests, throughput is the number to watch. Here's how the three frameworks performed as concurrency increased:

Overall Throughput (tokens/second)

Concurrency vLLM SGLang TensorRT-LLM
1 187.15 230.96 242.79
10 863.15 988.18 867.21
50 2211.85 3108.75 2162.95
100 4741.62 3221.84 1942.64


One of the most important findings was how vLLM achieved the highest throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang showed strong performance at moderate to high concurrency (50 requests), while TensorRT-LLM demonstrated the best single-request throughput but lower scaling at extreme concurrency.

Framework Analysis and Recommendations

SGLang

  • Strengths: Stable per-token latency, strong throughput at moderate concurrency, good overall balance.

  • Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.

  • Best For: Moderate to high-throughput applications, scenarios requiring consistent token generation timing.

vLLM

  • Strengths: Fastest time-to-first-token across all concurrency levels, highest throughput at extreme concurrency, excellent scaling.

     

  • Weaknesses: Slightly higher per-token latency at high loads.

     

  • Best For: Interactive applications, high-concurrency deployments, scenarios prioritizing fast initial responses and maximum throughput scaling.

TensorRT-LLM

  • Strengths: Best single-request throughput, competitive per-token latency at low concurrency, hardware-optimized performance.

     

  • Weaknesses: Slowest time-to-first-token, poor scaling at high concurrency, significantly degraded per-token latency at 100 requests.

     

  • Best For: Single-user or low-concurrency applications, scenarios where hardware optimization matters more than scaling.

Conclusion

There is no single framework that outperforms across all categories. Instead, each has been optimized for different goals, and the right choice depends on workload and infrastructure.

  • Use vLLM for interactive applications and high-concurrency deployments requiring fast responses and maximum throughput scaling.
  • Choose SGLang when moderate throughput and consistent performance are needed.
  • Deploy TensorRT-LLM for single-user applications or when maximizing hardware efficiency at low concurrency is the priority.

The key takeaway is that choosing the right framework depends on workload type and hardware availability, rather than looking for a universal winner. Running GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks powerful options for building and deploying AI applications at scale.

It's worth noting that these performance characteristics can shift dramatically depending on your GPU hardware. We also extended the benchmarks to B200 GPUs, where TensorRT-LLM consistently outperformed both SGLang and vLLM across all metrics, thanks to its deeper optimization for NVIDIA's latest hardware architecture.

This highlights how framework selection isn't just about software capabilities—it's equally about matching the right framework to your specific hardware to unlock maximum performance potential.

 

You can explore the full set of benchmark results here.

Bonus: Serve a Model with Your Preferred Framework

Getting started with these frameworks is simple. With Clarifai's Compute Orchestration, you can serve GPT-OSS-120B or any other open-weight models or your own custom models from your preferred inference engine, whether it is SGLang, vLLM, or TensorRT-LLM .

From setting up the runtime to deploying a production-ready API, you can quickly go from model to application. The best part is that you are not locked into a single framework. You can experiment with different runtimes, and choose the one that best aligns with your performance and cost requirements.

This flexibility makes it easy to integrate cutting-edge frameworks into your workflows and ensures you are always getting the best possible performance from your hardware. Check out the documentation to learn how to upload your own models.