🚀 E-book
Learn how to master the modern AI infrastructural challenges.
October 10, 2025

Top LLM Inference Providers Compared - GPT-OSS-120B

Table of Contents:

TL;DR

In this post, we explore how leading inference providers perform on the GPT-OSS-120B model using benchmarks from Artificial Analysis. You will learn what matters most when evaluating inference platforms including throughput, time to first token, and cost efficiency. We compare Vertex AI, Azure, AWS, Databricks, Clarifai, Together AI, Fireworks, Nebius, CompactifAI, and Hyperbolic on their performance and deployment efficiency.

Introduction

Large language models (LLMs) like GPT-OSS-120B, an open-weight 120-billion-parameter mixture-of-experts model, are designed for advanced reasoning and multi-step generation. Reasoning workloads consume tokens rapidly and place high demands on compute, so deploying these models in production requires inference infrastructure that delivers low latency, high throughput, and lower cost.

Differences in hardware, software optimizations, and resource allocation strategies can lead to large variations in latency, efficiency, and cost. These differences directly affect real-world applications such as reasoning agents, document understanding systems, or copilots, where even small delays can impact overall responsiveness and throughput.

To evaluate these differences objectively, independent benchmarks have become essential. Instead of relying on internal performance claims, open and data-driven evaluations now offer a more transparent way to assess how different platforms perform under real workloads.

In this post, we compare leading GPU-based inference providers using the GPT-OSS-120B model as a reference benchmark. We examine how each platform performs across key inference metrics such as throughput, time to first token, and cost efficiency, and how these trade-offs impact performance and scalability for reasoning-heavy workloads.

Before diving into the results, let’s take a quick look at Artificial Analysis and how their benchmarking framework works.

Artificial Analysis Benchmarks

Artificial Analysis (AA) is an independent benchmarking initiative that runs standardized tests across inference providers to measure how models like GPT-OSS-120B perform in real conditions. Their evaluations focus on realistic workloads involving long contexts, streaming outputs, and reasoning-heavy prompts rather than short, synthetic samples.

You can explore the full GPT-OSS-120B benchmark results here.

Artificial Analysis evaluates a range of performance metrics, but here we focus on the three key factors that matter when choosing an inference platform for GPT-OSS-120B: time to first token, throughput, and cost per million tokens.

  • Time to First Token (TTFT)
    The time between sending a prompt and receiving the model’s first token. Lower TTFT means output starts streaming sooner, which is critical for interactive applications and multi-step reasoning where delays can disrupt the flow.
  • Throughput (tokens per second)
    The rate at which tokens are generated once streaming begins. Higher throughput shortens total completion time for long outputs and allows more concurrent requests, directly affecting scalability for large-context or multi-turn workloads.
  • Cost per million tokens (blended cost)
    A combined metric that accounts for both input and output token pricing. This provides a clear view of operational costs for extended contexts and streaming workloads, helping teams plan for predictable expenses.

Benchmark Methodology

  • Prompt Size: Benchmarks covered in this blog use a 1,000-token input prompt run by Artificial Analysis, reflecting a typical real-world scenario such as a chatbot query or reasoning-heavy instruction. Benchmarks for substantially longer prompts are also available and can be explored for reference here.
  • Median Measurements: The reported values represent the median (p50) over the last 72 hours, capturing sustained performance trends rather than single-point spikes or dips. For the most up-to-date benchmark results, visit the Artificial Analysis GPT‑OSS‑120B model providers page here.
  • Metrics Focus: This summary highlights time to first token (TTFT), throughput, and blended cost to provide a practical view for workload planning. Other metrics—such as end-to-end response time, latency by input token count, and time to first answer token—are also measured by Artificial Analysis but are not included in this overview.

With this methodology in mind, we can now compare how different GPU-based platforms perform on GPT‑OSS‑120B and what these results imply for reasoning-heavy workloads.

Provider Comparison (GPT‑OSS‑120B)

Clarifai

  • Time to First Token: 0.32 s

  • Throughput: 544 tokens/s

  • Blended Cost: $0.16 per 1M tokens

  • Notes: Extremely high throughput; low latency; cost-efficient; strong choice for reasoning-heavy workloads.

Key Features:

  • GPU fractioning and autoscaling options for efficient compute usage
  • Local runners to execute models locally on your own hardware for testing and development
  • On-prem, VPC, and multi-site deployment options
  • Control Center for monitoring and managing usage and performance

Google Vertex AI

  • Time to First Token: 0.40 s

  • Throughput: 392 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: Moderate latency and throughput; suitable for general-purpose reasoning workloads.

Key Features:

  • Integrated AI tools (AutoML, training, deployment, monitoring)

  • Scalable cloud infrastructure for batch and online inference

  • Enterprise-grade security and compliance

Microsoft Azure

  • Time to First Token: 0.48 s

  • Throughput: 348 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: Slightly higher latency; balanced performance and cost for standard workloads.

Key Features:

  • Comprehensive AI services (ML, cognitive services, custom bots)

  • Deep integration with Microsoft ecosystem

  • Global enterprise-grade infrastructure

Hyperbolic

  • Time to First Token: 0.52 s

  • Throughput: 395 tokens/s

  • Blended Cost: $0.30 per 1M tokens

  • Notes: Higher cost than peers; good throughput for reasoning-heavy tasks.

Key Features:

  • High-performance GPU infrastructure

  • Customizable deployment options

AWS

  • Time to First Token: 0.64 s

  • Throughput: 252 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: Lower throughput and higher latency; suitable for less time-sensitive workloads.

Key Features:

  • Broad AI/ML service portfolio (Bedrock, SageMaker)

  • Global cloud infrastructure

  • Enterprise-grade security and compliance

Databricks

  • Time to First Token: 0.36 s

  • Throughput: 195 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: Lower throughput; acceptable latency; better for batch or background tasks.

Key Features:

  • Unified analytics platform (Spark + ML + notebooks)

  • Collaborative workspace for teams

  • Scalable compute for large ML/AI workloads

Together AI

  • Time to First Token: 0.25 s

  • Throughput: 248 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: Very low latency; moderate throughput; good for real-time reasoning-heavy applications.

Key Features:

  • Real-time inference and training

  • Cloud/VPC-based deployment orchestration

  • Flexible and secure platform

Fireworks AI

  • Time to First Token: 0.44 s

  • Throughput: 482 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: High throughput and balanced latency; suitable for interactive applications.

Key Features:

  • Custom model hosting

  • Scalable infrastructure for high concurrency

  • API/CLI access for inference and management

CompactifAI

  • Time to First Token: 0.29 s

  • Throughput: 186 tokens/s

  • Blended Cost: $0.10 per 1M tokens

  • Notes: Low cost; lower throughput; best for cost-sensitive workloads with smaller concurrency needs.

Key Features:

  • Efficient, compressed models for cost savings

  • Simplified deployment on AWS

  • Optimized for high-throughput batch inference

Nebius Base

  • Time to First Token: 0.66 s

  • Throughput: 165 tokens/s

  • Blended Cost: $0.26 per 1M tokens

  • Notes: Significantly lower throughput and higher latency; may struggle with reasoning-heavy or interactive workloads.

Key Features:

  • Basic AI service endpoints

  • Standard cloud infrastructure

  • Suitable for steady-demand workloads

Best Providers Based on Price and Throughput

Selecting the right inference provider for GPT‑OSS‑120B requires evaluating time to first token, throughput, and cost based on your workload. Platforms like Clarifai offer high throughput, low latency, and competitive cost, making them well-suited for reasoning-heavy or interactive tasks. Other providers, such as CompactifAI, prioritize lower cost but come with reduced throughput, which may be more suitable for cost-sensitive or batch-oriented workloads. The optimal choice depends on which trade-offs matter most for your applications.

Best for Price

  • CompactifAI: Lowest cost at $0.10 per 1M tokens, suitable for cost-sensitive projects.

  • Clarifai: Offers a low blended cost of $0.16 per 1M tokens while maintaining very high throughput

Best for Throughput

  • Clarifai: Highest throughput at 544 tokens/s with low first-chunk latency.

  • Fireworks AI: Strong throughput at 482 tokens/s and moderate latency.

  • Hyperbolic: Good throughput at 395 tokens/s; higher cost but viable for heavy workloads.

Performance and Flexibility

Along with price and throughput, flexibility is critical for real-world workloads. Teams often need control over scaling behavior, GPU utilization, and deployment environments to manage cost and efficiency.

Clarifai, for example, supports fractional GPU utilization, autoscaling, and local runners — features that can improve efficiency and reduce infrastructure overhead.

These capabilities extend beyond GPT‑OSS‑120B. With the Clarifai Reasoning Engine, custom or open-weight reasoning models can run with consistent performance and reliability. The engine also adapts to workload patterns over time, gradually improving speed for repetitive tasks without sacrificing accuracy.

Benchmark Summary

So far, we’ve compared providers based on throughput, latency, and cost using the Artificial Analysis Benchmark. To see how these trade-offs play out in practice, here’s a visual summary of the results across the different providers. These charts are directly from Artificial Analysis.

The first chart highlights output speed vs price, while the second chart compares latency vs output speed.

Output Speed vs Price (8 Oct 25)

Output Speed vs. Price

Latency vs Output Speed (8 Oct 25)

Latency vs. Output Speed

Below is a detailed comparison table summarizing the key metrics for GPT-OSS-120B inference across providers.

Provider Throughput (tokens/s) Time to First Token (s) Blended Cost ($ / 1M tokens)
Clarifai 544 0.32 0.16
Google Vertex AI 392 0.40 0.26
Microsoft Azure 348 0.48 0.26
Hyperbolic 395 0.52 0.30
AWS 252 0.64 0.26
Databricks 195 0.36 0.26
Together AI 248 0.25 0.26
Fireworks AI 482 0.44 0.26
CompactifAI 186 0.29 0.10
Nebius Base 165 0.66 0.26

Conclusion

Choosing an inference provider for GPT‑OSS‑120B involves balancing throughput, latency, and cost. Each provider handles these trade-offs differently, and the best choice depends on the specific workload and performance requirements.

Providers with high throughput excel at reasoning-heavy or interactive tasks, while those with lower median throughput may be more suitable for batch or background processing where speed is less critical. Latency also plays a key role: low time-to-first-token improves responsiveness for real-time applications, whereas slightly higher latency may be acceptable for less time-sensitive tasks.

Cost considerations remain important. Some providers offer strong performance at low blended costs, while others trade efficiency for price. Benchmarks covering throughput, time to first token, and blended cost provide a clear basis for understanding these trade-offs.

Ultimately, the right provider depends on the engineering problem, workload characteristics, and which trade-offs matter most for the application.

 

Learn more about Clarifai's reasoning engine

The Fastest AI Inference and Reasoning on GPUs.

Verified by Artificial Analysis