This blog post focuses on new features and improvements. For a comprehensive list, including bug fixes, please see the release notes.
OpenAI has released gpt-oss-120b and gpt-oss-20b, a new generation of open-weight reasoning models under the Apache 2.0 license. Built for robust instruction following, powerful tool use, and advanced reasoning, these models are designed for next-generation agentic workflows.
With a Mixture of Experts (MoE) design, extended context length of 131K tokens, and quantization that allows the 120b model to run on a single 80 GB GPU, GPT-OSS combines massive scale with practical deployment. Developers can adjust reasoning levels from low to high to optimize speed, cost, or accuracy, and use built-in browsing, code execution, and custom tools for complex workflows.
Our research team benchmarked gpt-oss-120b across NVIDIA B200 and H100 GPUs using vLLM, SGLang, and TensorRT-LLM. Tests covered single-request scenarios and high-concurrency workloads with 50–100 requests. Key findings include:
Single request speed: B200 with TensorRT-LLM delivers a 0.023s time-to-first-token (TTFT), outperforming dual-H100 setups in several cases.
High concurrency: B200 sustains 7,236 tokens/sec at maximum load with lower per-token latency.
Efficiency: One B200 can replace two H100s for equal or better performance, with lower power use and less complexity.
Performance gains: Some workloads see up to 15x faster inference compared to a single H100.
For detailed benchmarks on throughput, latency, time to first token, and other metrics, read our full blog on NVIDIA B200 vs H100.
If you are looking to deploy GPT-OSS models on H100s, you can do it today on Clarifai across multiple clouds. Support for B200s is coming soon, giving you access to the latest NVIDIA GPUs for testing and production.
Last month we launched Local Runners, and the response from developers has been incredible. From AI hobbyists to production teams, many have been eager to run open source models locally on their own hardware while still taking advantage of the Clarifai platform. With Local Runners, you can run and test models on your own machines, then access them through a public API for integration into any application.
Now, with the arrival of the latest GPT-OSS models including gpt-oss-20b, you can run these advanced reasoning models locally with full control of your compute and the ability to deploy agentic workflows instantly.
To make it even easier, we are introducing the Developer Plan at a promotional price of just $1/month. It includes everything in the Community Plan, plus:
Connect up to 5 Local Runners
Unlimited runner hours
Check out the Developer Plan and start running your own models locally today. If you are ready to run GPT-OSS-20b on your hardware, follow our step-by-step tutorial here.
We have expanded our model library with new open-weight and specialized models that are ready to use in your workflows.
The latest additions include:
GPT-OSS-120b – open-weight language model designed for strong reasoning, advanced tool use, and efficient on-device deployment. This model supports extended context lengths and variable reasoning levels, making it ideal for complex agentic applications.
GPT-5, GPT-5 Mini, and GPT-5 Nano – GPT-5 is the flagship model for the most demanding reasoning and generative tasks. GPT-5 Mini offers a faster, cost-effective alternative for real-time applications. GPT-5 Nano delivers ultra-low-latency inference for edge and budget-sensitive deployments.
Qwen3-Coder-30B-A3B-Instruct – a high-efficiency coding model with long-context support and strong agentic capabilities, well-suited for code generation, refactoring, and development automation.
You can start exploring these models directly in the Clarifai Playground or access them via API to integrate into your applications.
Ollama makes it simple to download and run powerful open-source models directly on your machine. With Clarifai Local Runners, you can now expose those locally running models via a secure public API.
We’ve also added Ollama toolkit to the Clarifai CLI, letting you download, run, and expose Ollama models with a single command.
Read our step-by-step guide on running Ollama models locally and making them accessible via API.
You can now compare multiple models side by side in the Playground instead of testing them one at a time. Quickly spot differences in output, speed, and quality to choose the best fit for your use case.
We’ve also added enhanced inference controls, Pythonic support, and model version selectors for smoother experimentation.
Python SDK:
Improved logging, pipeline handling, authentication, Local Runner support, and code validation.
Added live logging, verbose output, and integration with GitHub repositories for flexible model initialization.
Platform:
Token-based billing improvements for Community models.
Enhanced workflow pricing visibility and settings navigation.
Clarifai Organizations:
Improved invites, token visibility, and onboarding prompts.
With Clarifai's Compute Orchestration, you can deploy GPT-OSS, Qwen3-Coder, and other open source and your own custom models on dedicated GPUs like NVIDIA B200s and H100s, on-prem or in the cloud. Serve models, MCP servers, or full agentic workflows directly from your hardware with full control over performance, cost, and security.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy