Most AI development begins locally. You experiment with model architectures, fine-tune them on small datasets, and iterate until the results look promising. But when it’s time to test the model in a real-world pipeline, things quickly become complicated.
You usually have two choices: upload the model to the cloud even for simple testing, or set up your own API, managing routing, authentication, and security just to run it locally.
Neither approach works well if you’re:
Working on smaller or resource-limited projects
Needing access to local files or private data
Building for edge or on-prem environments where cloud access isn’t practical
Local Runners let you serve AI models, MCP servers, or agents directly from your laptop, workstation, or internal server, securely and seamlessly via a Public API. You don’t need to upload your model or manage any infrastructure. Simply run it locally, and Clarifai takes care of the API handling, routing, and integration.
Once running, the Local Runner establishes a secure connection to Clarifai's control plane. Any API request sent to your model is routed to your machine, processed locally, and returned to the client. From the outside, it behaves like a Clarifai-hosted model, while all computation occurs on your local hardware.
With Local Runners, you can:
Now that you understand the benefits and capabilities of Local Runners, let’s see how you can run Hugging Face models locally and expose them securely.
The Hugging Face Toolkit in Clarifai CLI enables you to download, configure, and run Hugging Face models locally while exposing them securely through a public API. You can test, integrate, and iterate on models directly from your local environment without managing any external infrastructure.
First, install the Clarifai Package. This also provides the Clarifai CLI:
Next, log in to Clarifai to link your local environment to your account. This allows you to manage and expose your models.
Follow the prompts to enter your User ID and Personal Access Token (PAT). If you need help obtaining these, refer to the documentation.
If you plan to access private Hugging Face models or repositories, generate a token from your Hugging Face account settings and set it as an environment variable:
Finally, install the Hugging Face Hub library to enable model downloads and integration:
With these steps complete, your environment is ready to initialize and run Hugging Face models locally with Clarifai.
Use the Clarifai CLI to initialize and configure any supported Hugging Face model locally with the Toolkit:
By default, this command downloads and sets up the unsloth/Llama-3.2-1B-Instruct
model in your current directory.
If you want to use a different model, you can specify it with the --model-name
flag and pass the full model name from Hugging Face. For example:
Note: Some models can be very large and require significant memory or GPU resources. Make sure your machine has enough compute capacity to load and run the model locally before initializing it.
Now, once you run the above command, the CLI will scaffold the project for you. The generated directory structure will look like this:
model.py – Contains the logic for loading the model and running predictions.
config.yaml – Holds model metadata, compute resources, and checkpoint configuration.
requirements.txt – Lists the Python dependencies required for your model.
model.py
Once your project scaffold is ready, the next step is to configure your model’s behavior in model.py
. By default, this file includes a class called MyModel
that extends ModelClass
from Clarifai. Inside this class, you’ll find four main methods ready for use:
load_model()
– Loads checkpoints from Hugging Face, initializes the tokenizer, and sets up streaming for real-time output.
predict()
– Handles single-prompt inference and returns responses. You can adjust parameters such as max_tokens
, temperature
, and top_p
.
generate()
– Streams outputs token by token, useful for live previews.
chat()
– Manages multi-turn conversations and returns structured responses.
You can use these methods as-is, or customize them to fit your specific model behavior. The scaffold ensures that all core functionality is already implemented, so you can get started with minimal setup.
config.yaml
The config.yaml
file defines model metadata and compute requirements. For Local Runners, most defaults work, but it’s important to understand each section:
checkpoints
– Specifies the Hugging Face repository and token for private models.inference_compute_info
– Defines compute requirements. For Local Runners, you can typically use defaults. When deploying on dedicated infrastructure, you can customize accelerators, memory, and CPU based on the model requirements.
model
– Contains metadata such as app_id
, model_id
, model_type_id
, and user_id
. Replace YOUR_USER_ID
with your own Clarifai user ID.
Finally, the requirements.txt
file lists all Python dependencies required for your model. You can add any additional packages your model needs to run.
Once your model is configured, you can launch it locally using the Clarifai CLI:
This command starts a Local Runner instance on your machine. The CLI automatically handles all necessary setup, so you don’t need to manually configure infrastructure.
After the Local Runner starts, you’ll receive a public Clarifai URL. This URL acts as a secure gateway to your locally running model. Any requests made to this endpoint are routed to your local environment, processed by your model, and returned through the same endpoint.
Once your Hugging Face model is running locally and exposed via the Clarifai Local Runner, you can send inference requests to it from anywhere — using either the OpenAI-compatible endpoint or the Clarifai SDK.
Use the OpenAI client to send a request to your locally running Hugging Face model:
You can also interact directly through the Clarifai SDK, which provides a lightweight interface for inference:
You can also experiment with:
generate()
for real-time streaming output
chat()
for multi-turn interactions
With this setup, your Hugging Face model runs entirely on your local hardware — yet remains accessible via Clarifai’s secure public API.
Local Runners give you full control over where your models run — without sacrificing integration, security, or flexibility.
You can prototype, test, and serve real workloads on your own hardware while still using Clarifai’s platform to route traffic, handle authentication, and scale when needed.
You can try Local Runners for free with the Free Tier, or upgrade to the Developer Plan at $1/month for the first year to connect up to 5 Local Runners with unlimited hours. Read more in the documentation here to get started.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy