This blog post focuses on new features and improvements. For a comprehensive list, including bug fixes, please see the release notes.
Artificial Analysis has benchmarked Clarifai’s Compute Orchestration with the GPT-OSS-120B model—one of the most advanced open-source large language models available today. The results underscore Clarifai as one of the top hardware and GPU-agnostic engines for AI workloads where speed, flexibility, efficiency and reliability matter most.
What the benchmark shows (P50, last 72h; single query, 1k-token prompt):
High throughput: 313 output tokens per second—among the very fastest measured in this configuration.
Low latency: 0.27s time-to-first-token (TTFT), so responses begin streaming almost instantly.
Compelling price/performance: Placed in the benchmark’s “most attractive quadrant” (high speed + low price).
Clarifai offers GPT-OSS-120B at $0.09 per 1M input tokens and $0.36 per 1M output tokens. Artificial Analysis displays a blended price (3:1 input:output) of just $0.16 per 1M tokens, placing Clarifai significantly below the $0.26–$0.28 cluster of competitors while matching or exceeding their performance.
Below is a comparison of output speed versus price across major providers for GPT-OSS-120B. Clarifai stands out in the “most attractive quadrant,” combining high throughput with competitive pricing.
Output Speed vs. Price
This chart compares latency (time to first token) against output speed. Clarifai demonstrates one of the lowest latencies while maintaining top-tier throughput—placing it among the best-in-class providers.
Latency vs. Output Speed
As one of the leading open-source “GPT-OSS” models, GPT-OSS-120B represents the growing demand for transparent, community-driven alternatives to closed-source LLMs. Running a model of this scale requires infrastructure that can not only deliver high speed and low latency, but also keep costs under control at production scale. That’s exactly where Clarifai’s Compute Orchestraction makes a difference.
These results are more than numbers—they show how Clarifai has engineered every layer of the stack to optimize GPU utilization. With CO, multiple models can run on the same GPUs, workloads scale elastically, and enterprises can squeeze more value out of every accelerator. The payoff is fast, reliable, and cost-efficient inference that can support both experimentation and large-scale deployment.
Check the full benchmarks on Artificial Analysis here.
Here’s a quick demo of how to access the GPT-OSS-120B model in the Playground.
Local Runners let you develop and run models on your own hardware—laptops, workstations, edge boxes—while making them callable through Clarifai’s cloud API. Clarifai handles the public URL, routing, and authentication; your model executes locally and your data stays on your machine. It behaves like any other Clarifai‑hosted model.
Why teams use Local Runners
Build where your data and tools live. Keep models close to local files, internal databases, and OS‑level utilities.
No custom networking. Start a runner and get a public URL—no port‑forwarding or reverse proxies.
Use your own compute. Bring your GPUs and custom setups; the platform still provides the API, workflows, and governance around them.
We’ve added an Ollama Toolkit to the Clarifai CLI so you can initialize an Ollama‑backed model directory in one command (and choose any model from the Ollama library). It pairs perfectly with Local Runners—download, run, and expose an Ollama model via a public API with a minimal setup.
The CLI supports --toolkit ollama
plus flags like --model-name
, --port
, and --context-length
, making it trivial to target specific Ollama models.
Pick a model in Ollama.
Gemma 3 270M (tiny, fast; 32K context): gemma3:270m
.
GPT‑OSS 20B (OpenAI open‑weight, optimized for local use): gpt-oss:20b
.
Initialize the project with the Ollama Toolkit.
Use the command above, swapping --model-name
for your pick (e.g., gpt-oss:20b
). This will create a new model directory structure that is compatible with the Clarifai platform. You can customize or optimize the generated model by modifying the 1/model.py
file as needed.
Start your Local Runner.
From the model directory:
The runner registers with Clarifai and exposes your local model via a public URL; the CLI prints a ready‑to‑run client snippet.
Call it like any Clarifai model.
For example (Python SDK):
Behind the scenes, the API call is routed to your machine; results return to the caller over Clarifai’s secure control plane.
Deep dive: We published a step‑by‑step guide that walks through running Ollama models locally and exposing them with Local Runners. Check it out here.
You can start for free, or use the Developer Plan—$1/month for the first year—which includes up to 5 Local Runners and unlimited runner hours.
Check out the full example and setup guide in the documentation here.
We’ve made billing more transparent and flexible with this release. Monthly spending limits have been introduced: $100 for Developer and Essential plans, and $500 for the Professional plan. If you need higher limits, you can reach out to our team.
We’ve also added a new credit card pre-authorization process. A temporary charge is applied to verify card validity and available funds — $50 for Developer, $100 for Essential, and $500 for Professional plans. The amount is automatically refunded within seven days, ensuring a seamless verification experience.
With Local Runners, you can now serve models, MCP servers, or agents directly from your own hardware without uploading model weights or managing infrastructure. It’s the fastest way to test, iterate, and securely run models from your laptop, workstation, or on-prem server. You can read the documentation to get started, or check out the blog to see how to run Ollama models locally and expose them via a public API.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy