Large Language Models (LLMs) are at the forefront of AI innovation, offering remarkable capabilities in natural language processing tasks. However, their impressive performance comes with a significant trade-off: inference efficiency, which impacts both cost and time for model owners and users. To address these challenges, extensive research has focused on optimizing caching techniques, memory allocation, GPU kernel performance, and more. Among open-source solutions, frameworks like vLLM, LMDeploy, and SGLang stand out, delivering exceptional performance compared to others. In this blog, we will explore the foundations of these frameworks, provide sample code, and compare their performance.
The attention algorithm lies at the heart of the remarkable capabilities of LLMs, revolutionizing natural language processing by addressing the limitations of earlier sequential techniques like RNNs and LSTMs. These older methods struggled with handling long contexts, were slow to train, and lacked scalability. Attention effectively overcomes these challenges.
However, as the saying goes, "Life is essentially an endless series of problems. The solution to one problem is merely the creation of another.” quoted from this book . While attention offers significant advantages, it also introduces new considerations, such as increased computational demands. The algorithm requires extensive matrix calculations and caching of processed tensors for the decoding step, which can lead to slower inference times.
Common approaches to improve LLM efficiency include running models with lower precision formats, such as FP16 or even more compact formats like INT8 or 4-bit quantization, instead of the standard FP32, and utilizing more powerful hardware. However, these methods do not fundamentally address the inherent inefficiencies of the algorithm itself.
A more effective alternative focuses on optimizing one of the core bottlenecks: the KV cache in LLMs. Key strategies include:
Smarter Cache Management: Efficiently manage caching across batched requests to minimize memory waste.
Optimized Memory Allocation: Structure memory usage to store more data within limited memory capacity.
Enhanced Processing Efficiency: If memory is not the constraint, leverage system resources to accelerate processing.
Optimized Kernel Implementations: Replace naive Torch implementations with robust, inference-optimized kernels.
And there’s much more to explore in this domain.
A key pioneer in addressing LLM inefficiency is vLLM, followed by LMDeploy and SGLang. While these frameworks share common foundational ideas to tackle inefficiencies in LLMs, each employs distinct, customized methods to achieve its goals.
vLLM optimizes LLMs by enhancing memory efficiency and enabling parallel computation. It reduces the overhead associated with large-scale model inference, allowing for faster processing and better resource utilization without compromising accuracy.
LMDeploy focuses on simplifying the deployment process of LLMs at scale. It integrates model parallelism and fine-tuning techniques, improving the speed and scalability of deploying models for real-world applications, particularly in distributed settings.
SGLang leverages structured programming techniques to optimize LLMs by focusing on efficient resource management and computation. It introduces specialized language abstractions and tools for fine-grained control over model execution, leading to enhanced performance in specific tasks or environments.
The table below provides an overview of vLLM, LMDeploy and SGLang, including their specifications, supported architectures and GPU compatibility.
Framework |
Specs |
Supported architects |
Supported GPU |
---|---|---|---|
LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on. LMDeploy has 2 inference engines: pytorch and turbomind Core features:
|
|
Nvidia |
|
vLLM is a fast and easy-to-use library for LLM inference and serving: |
|
|
|
SGLang builds upon open-source LLM engines like LightLLM, vLLM, and Guidance, incorporating high-performance CUDA kernels from FlashInfer and torch.compile from gpt-fast. It introduces innovations like RadixAttention for KV cache reuse and a compressed state machine for fast constrained decoding. Its Python-based batch scheduler is highly efficient, often matching or outperforming C++-based systems |
Almost all transformer based models |
Nvidia AMD (supported recently) |
CPU |
RAM (GB) |
GPU |
VRAM (GB) |
---|---|---|---|
AMD EPYC 7J13 64-Core Processor |
216 |
A100-SXM4 |
40 |
Generated Output Tokens per Second: Assesses the overall speed of token generation by the model with the framework, both in total and per request (higher is better).
The benchmarking was conducted using the open-source test framework llmperf, with a custom fork llmperf multimodal to enable testing of multimodal models.
Models were served via Docker Compose services, utilizing the latest Docker images provided by the framework authors.
Test config: We utilized standard metrics to benchmark these frameworks, including:
Input tokens (i): 2048
Output tokens (o): 2048
Number of concurrent requests: 1 and 100
Models: To ensure that the test candidate models were not overly optimized for a specific framework, we evaluated them using a variety of architectures:
Qwen/Qwen2-7B-Instruct
meta-llama/Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.3
These are all mid size models (or you can call them small in your way).
We also use TGI as baseline for the test.
Single request (c1)
With one request at a time, SGLang handles best in term of ttfs, it faster than slowest (lmdeploy-pytorch) 22.3%. On the other hand, lmdeploy-turbomind outperforms the rest with 88.6 tok/s on average and 8.12% better than worst one (vllm).
100 requests
In this blog, we have benchmarked various models using different inference frameworks. SGLang demonstrates strong performance in handling single requests efficiently, excelling in TTFS and showing notable speed advantages over its slowest competitor. However, its optimization appears architecture-specific, as it struggles with the Mistral model under concurrent load. Meanwhile, lmdeploy-turbomind consistently leads in throughput across both single and concurrent request scenarios, proving to be the most robust framework overall. TGI, on the other hand, faces stability issues with Out-Of-Memory (OOM) errors for certain architectures, indicating potential limitations in resource management for high-demand scenarios.
Clarifai makes it simple to deploy any model, whether as a serverless function or a dedicated instance, using an intuitive command-line interface (CLI). Whether you're working on a small project or scaling up for enterprise needs, Clarifai streamlines the process so you can focus on what matters most—building and innovating.
If you're looking to deploy a LLM, you can leverage our examples repository to get started quickly. For instance, to deploy an LLM using LMDeploy, clone the examples repository and navigate to this folder where we have the ready to use example.
Install Clarifai SDK, skip it if you installed already:
Update config.yaml with your model details, compute settings, and checkpoints:
For detailed information, check out the documentation here.
Ready to Take Control of Your AI Infrastructure?
Clarifai’s Compute Orchestration gives you the tools to deploy, manage, and scale models across any compute environment, whether it’s serverless, dedicated, on-premises, or multi-cloud. With full control over performance, cost, and security, you can focus on building AI solutions while we handle the infrastructure complexity.
Sign up for the public preview to see how we can help transform the way you deploy, manage, and scale your AI models.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy