🚀 E-book
Learn how to master the modern AI infrastructural challenges.
October 24, 2025

Run vLLM Models Locally with a Secure Public API

Table of Contents:

Blog thumbnail - Run vLLM 
Models Locally.png.png

Introduction

vLLM is a high-throughput, open-source inference and serving engine for large language models (LLMs). It provides fast, memory-efficient inference using GPU optimizations such as PagedAttention and continuous batching, making it suitable for GPU-based workloads.

In this tutorial, we will show how to run LLMs with vLLM entirely on your local machine and expose them through a secure public API. This approach lets you run models with GPU acceleration, maintain local execution speed, and keep full control over your environment without relying on cloud services or an internet connection.

Clarifai Local Runners make this process simple. You can serve AI models or agents directly from your laptop, workstation, or internal server through a secure public API. You do not need to upload your model or manage infrastructure. The Local Runner routes API requests to your machine, executes them locally, and returns the results to the client, while all computation stays on your hardware.

Let's see how to set that up.

Running Models via vLLM Locally

The vLLM Toolkit in the Clarifai CLI lets you initialize, configure, and run models via vLLM locally while exposing them through a secure public API. You can test, integrate, and iterate directly from your machine without standing up any infrastructure.

Step 1: Prerequisites

Install the Clarifai CLI

vLLM supports models from the Hugging Face Hub. If you’re using private repositories, you’ll need a Hugging Face access token.

Step 2: Initialize a Model

Use the Clarifai CLI to scaffold a vLLM-based model directory. This will prepare all required files for local execution and integration with Clarifai.

If you want to work with a specific model, use the --model-name flag:

Note: Some models are large and require significant memory. Ensure your machine meets the model’s requirements.

After initialization, the generated folder structure looks like this:

  • model.py – Contains logic that runs the vLLM server locally and handles inference.

  • config.yaml – Defines metadata, runtime, checkpoints, and compute settings.

  • requirements.txt – Lists Python dependencies.

Step 3: Customize model.py

The scaffold includes a VLLMModel class extending OpenAIModelClass. It defines how your Local Runner interacts with vLLM’s OpenAI-compatible server.

Key methods:

  • load_model() – Launches vLLM’s local runtime, loads checkpoints, and connects to the OpenAI-compatible API endpoint.

  • predict() – Handles single-prompt inference with optional parameters like max_tokens, temperature, and top_p. Returns the complete response.

  • generate() – Streams generated tokens in real time for interactive outputs.

You can use these implementations as-is or customize them to fit your preferred request/response structures. 

Step 4: Configure config.yaml

The config.yaml file defines the model identity, runtime, checkpoints, and compute metadata:

Note: For local execution, inference_compute_info is optional — the model runs entirely on your machine using local CPU/GPU resources. If deploying on Clarifai’s dedicated compute, you can specify accelerators and resource limits.

Step 5: Start the Local Runner

Start a Local Runner that connects to the vLLM runtime:

If any configuration is missing, the CLI will prompt you to define it. After startup, you will receive a public Clarifai URL for your model. Requests sent to this endpoint route securely to your machine, run through vLLM, and return to the client.

Step 6: Run Inference with Local Runner

Once your model is running locally and exposed via the Clarifai Local Runner, you can send inference requests using the OpenAI-compatible API or the Clarifai SDK.

OpenAI-Compatible API

Clarifai Python SDK

You can also experiment with generate() method for real-time streaming.

Conclusion

Local Runners give you full control over where your models execute without sacrificing integration, security, or flexibility. You can prototype, test, and serve real workloads on your own hardware, while Clarifai handles routing, authentication, and the public endpoint.

You can try Local Runners for free with the Free Tier, or upgrade to the Developer Plan at $1 per month for the first year to connect up to 5 Local Runners with unlimited hours.