Optimizing LLMs: Comparing vLLM, LMDeploy, and SGLang

Large Language Models (LLMs) are at the forefront of AI innovation, offering remarkable capabilities in natural language processing tasks. However, their impressive performance comes with a significant trade-off: inference efficiency, which impacts both cost and time for model owners and users. To address these challenges, extensive research has focused on optimizing caching techniques, memory allocation, GPU kernel performance, and more. Among open-source solutions, frameworks like vLLM, LMDeploy, and SGLang stand out, delivering exceptional performance compared to others. In this blog, we will explore the foundations of these frameworks, provide sample code, and compare their performance.

Background

The attention algorithm lies at the heart of the remarkable capabilities of LLMs, revolutionizing natural language processing by addressing the limitations of earlier sequential techniques like RNNs and LSTMs. These older methods struggled with handling long contexts, were slow to train, and lacked scalability. Attention effectively overcomes these challenges.

However, as the saying goes, "Life is essentially an endless series of problems. The solution to one problem is merely the creation of another.” quoted from this book . While attention offers significant advantages, it also introduces new considerations, such as increased computational demands. The algorithm requires extensive matrix calculations and caching of processed tensors for the decoding step, which can lead to slower inference times.

Solutions

Common approaches to improve LLM efficiency include running models with lower precision formats, such as FP16 or even more compact formats like INT8 or 4-bit quantization, instead of the standard FP32, and utilizing more powerful hardware. However, these methods do not fundamentally address the inherent inefficiencies of the algorithm itself.

A more effective alternative focuses on optimizing one of the core bottlenecks: the KV cache in LLMs. Key strategies include:

Smarter Cache Management: Efficiently manage caching across batched requests to minimize memory waste.
Optimized Memory Allocation: Structure memory usage to store more data within limited memory capacity.
Enhanced Processing Efficiency: If memory is not the constraint, leverage system resources to accelerate processing.
Optimized Kernel Implementations: Replace naive Torch implementations with robust, inference-optimized kernels.

And there’s much more to explore in this domain.

The Frameworks

A key pioneer in addressing LLM inefficiency is vLLM, followed by LMDeploy and SGLang. While these frameworks share common foundational ideas to tackle inefficiencies in LLMs, each employs distinct, customized methods to achieve its goals.

vLLM

vLLM optimizes LLMs by enhancing memory efficiency and enabling parallel computation. It reduces the overhead associated with large-scale model inference, allowing for faster processing and better resource utilization without compromising accuracy.

LMDeploy

LMDeploy focuses on simplifying the deployment process of LLMs at scale. It integrates model parallelism and fine-tuning techniques, improving the speed and scalability of deploying models for real-world applications, particularly in distributed settings.

SGLang

SGLang leverages structured programming techniques to optimize LLMs by focusing on efficient resource management and computation. It introduces specialized language abstractions and tools for fine-grained control over model execution, leading to enhanced performance in specific tasks or environments.

The table below provides an overview of vLLM, LMDeploy and SGLang, including their specifications, supported architectures and GPU compatibility.

Framework	Specs	Supported architects	Supported GPU
LMDeploy	LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on. LMDeploy has 2 inference engines: pytorch and turbomind Core features: Inference: persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on. Quantizations: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. Distributed inference	Transformers Multimodal LLMs Mixture-of-Expert LLMs Supported models list	Nvidia
vLLM	vLLM is a fast and easy-to-use library for LLM inference and serving: Cached PagedAttention Continuous batching Distributed inference Fast model execution with CUDA/HIP graph Quantizations: GPTQ, AWQ, INT4, INT8, and FP8. Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.	Transformers Multimodal LLMs Mixture-of-Expert LLMs Embedding Models Mamba Supported Models List	NVIDIA GPUs AMD CPUs and GPUs Intel CPUs and GPUs PowerPC CPUs, TPU, and AWS Neuron.
SGLang	SGLang builds upon open-source LLM engines like LightLLM, vLLM, and Guidance, incorporating high-performance CUDA kernels from FlashInfer and torch.compile from gpt-fast. It introduces innovations like RadixAttention for KV cache reuse and a compressed state machine for fast constrained decoding. Its Python-based batch scheduler is highly efficient, often matching or outperforming C++-based systems	Almost all transformer based models Supported Models List	Nvidia AMD (supported recently)

Framework

Specs

Supported architects

Supported GPU

LMDeploy

LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

LMDeploy has 2 inference engines: pytorch and turbomind

Core features:

Inference: persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
Quantizations: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16.
Distributed inference

Transformers
Multimodal LLMs
Mixture-of-Expert LLMs

Supported models list

Nvidia

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving:

Cached PagedAttention
Continuous batching
Distributed inference
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.

Transformers
Multimodal LLMs
Mixture-of-Expert LLMs
Embedding Models
Mamba

Supported Models List

NVIDIA GPUs
AMD CPUs and GPUs
Intel CPUs and GPUs
PowerPC CPUs, TPU, and AWS Neuron.

SGLang

SGLang builds upon open-source LLM engines like LightLLM, vLLM, and Guidance, incorporating high-performance CUDA kernels from FlashInfer and torch.compile from gpt-fast.

It introduces innovations like RadixAttention for KV cache reuse and a compressed state machine for fast constrained decoding. Its Python-based batch scheduler is highly efficient, often matching or outperforming C++-based systems

Almost all transformer based models

Supported Models List

Nvidia

AMD (supported recently)

Benchmark

Environment setup

Hardware

CPU

RAM (GB)

GPU

VRAM (GB)

AMD EPYC 7J13 64-Core Processor

216

A100-SXM4

40
Metrics: We utilized standard metrics to benchmark these frameworks, including:
- TTFT (Time to First Token): Measured in seconds, it evaluates the time taken by the model to process input tokens and produce the first output token during streaming (lower is better).
- Generated Output Tokens per Second: Assesses the overall speed of token generation by the model with the framework, both in total and per request (higher is better).
  
  The benchmarking was conducted using the open-source test framework llmperf, with a custom fork llmperf multimodal to enable testing of multimodal models.
  
  Models were served via Docker Compose services, utilizing the latest Docker images provided by the framework authors.
Test config: We utilized standard metrics to benchmark these frameworks, including:
- Input tokens (i): 2048
- Output tokens (o): 2048
- Number of concurrent requests: 1 and 100
Models: To ensure that the test candidate models were not overly optimized for a specific framework, we evaluated them using a variety of architectures:
- Qwen/Qwen2-7B-Instruct
- meta-llama/Llama-3.1-8B-Instruct
- mistralai/Mistral-7B-Instruct-v0.3

CPU	RAM (GB)	GPU	VRAM (GB)
AMD EPYC 7J13 64-Core Processor	216	A100-SXM4	40

These are all mid size models (or you can call them small in your way).

We also use TGI as baseline for the test.

Results

Single request (c1)

With one request at a time, SGLang handles best in term of ttfs, it faster than slowest (lmdeploy-pytorch) 22.3%. On the other hand, lmdeploy-turbomind outperforms the rest with 88.6 tok/s on average and 8.12% better than worst one (vllm).
100 requests
- For TTFS, SGLang performs exceptionally well for 2 out of 3 models but falls significantly short for Mistralv0.3, even after multiple retests yielding consistent results. This suggests the framework is not well-optimized for the Mistral architecture.
- Throughput per second is led by lmdeploy-turbomind, outperforming the worst-performing framework by over 20%.
- TGI encountered OOM errors with both Llama and Mistral.

Conclusion

In this blog, we have benchmarked various models using different inference frameworks. SGLang demonstrates strong performance in handling single requests efficiently, excelling in TTFS and showing notable speed advantages over its slowest competitor. However, its optimization appears architecture-specific, as it struggles with the Mistral model under concurrent load. Meanwhile, lmdeploy-turbomind consistently leads in throughput across both single and concurrent request scenarios, proving to be the most robust framework overall. TGI, on the other hand, faces stability issues with Out-Of-Memory (OOM) errors for certain architectures, indicating potential limitations in resource management for high-demand scenarios.

BONUS: Serve a model and test it yourself on Clarifai

Clarifai makes it simple to deploy any model, whether as a serverless function or a dedicated instance, using an intuitive command-line interface (CLI). Whether you're working on a small project or scaling up for enterprise needs, Clarifai streamlines the process so you can focus on what matters most—building and innovating.

If you're looking to deploy a LLM, you can leverage our examples repository to get started quickly. For instance, to deploy an LLM using LMDeploy, clone the examples repository and navigate to this folder where we have the ready to use example.

Install Clarifai SDK, skip it if you installed already:

Update config.yaml with your model details, compute settings, and checkpoints:
Deploy the model:

For detailed information, check out the documentation here.

Ready to Take Control of Your AI Infrastructure?

Clarifai’s Compute Orchestration gives you the tools to deploy, manage, and scale models across any compute environment, whether it’s serverless, dedicated, on-premises, or multi-cloud. With full control over performance, cost, and security, you can focus on building AI solutions while we handle the infrastructure complexity.

Sign up for the public preview to see how we can help transform the way you deploy, manage, and scale your AI models.

Previous Return to Blog Menu Next

Optimizing LLMs: Comparing vLLM, LMDeploy, and SGLang

Table of Contents:

Background

Solutions

The Frameworks

vLLM

LMDeploy

SGLang

Benchmark

Environment setup

Results

Conclusion

BONUS: Serve a model and test it yourself on Clarifai

CONTACT

Platform

Solutions

Community

COMPANY

Resources

CONTACT