In this post, we explore how leading inference providers perform on the GPT-OSS-120B model using benchmarks from Artificial Analysis. You will learn what matters most when evaluating inference platforms including throughput, time to first token, and cost efficiency. We compare Vertex AI, Azure, AWS, Databricks, Clarifai, Together AI, Fireworks, Nebius, CompactifAI, and Hyperbolic on their performance and deployment efficiency.
Large language models (LLMs) like GPT-OSS-120B, an open-weight 120-billion-parameter mixture-of-experts model, are designed for advanced reasoning and multi-step generation. Reasoning workloads consume tokens rapidly and place high demands on compute, so deploying these models in production requires inference infrastructure that delivers low latency, high throughput, and lower cost.
Differences in hardware, software optimizations, and resource allocation strategies can lead to large variations in latency, efficiency, and cost. These differences directly affect real-world applications such as reasoning agents, document understanding systems, or copilots, where even small delays can impact overall responsiveness and throughput.
To evaluate these differences objectively, independent benchmarks have become essential. Instead of relying on internal performance claims, open and data-driven evaluations now offer a more transparent way to assess how different platforms perform under real workloads.
In this post, we compare leading GPU-based inference providers using the GPT-OSS-120B model as a reference benchmark. We examine how each platform performs across key inference metrics such as throughput, time to first token, and cost efficiency, and how these trade-offs impact performance and scalability for reasoning-heavy workloads.
Before diving into the results, let’s take a quick look at Artificial Analysis and how their benchmarking framework works.
Artificial Analysis (AA) is an independent benchmarking initiative that runs standardized tests across inference providers to measure how models like GPT-OSS-120B perform in real conditions. Their evaluations focus on realistic workloads involving long contexts, streaming outputs, and reasoning-heavy prompts rather than short, synthetic samples.
You can explore the full GPT-OSS-120B benchmark results here.
Artificial Analysis evaluates a range of performance metrics, but here we focus on the three key factors that matter when choosing an inference platform for GPT-OSS-120B: time to first token, throughput, and cost per million tokens.
With this methodology in mind, we can now compare how different GPU-based platforms perform on GPT‑OSS‑120B and what these results imply for reasoning-heavy workloads.
Time to First Token: 0.32 s
Throughput: 544 tokens/s
Blended Cost: $0.16 per 1M tokens
Notes: Extremely high throughput; low latency; cost-efficient; strong choice for reasoning-heavy workloads.
Key Features:
Time to First Token: 0.40 s
Throughput: 392 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: Moderate latency and throughput; suitable for general-purpose reasoning workloads.
Key Features:
Integrated AI tools (AutoML, training, deployment, monitoring)
Scalable cloud infrastructure for batch and online inference
Enterprise-grade security and compliance
Time to First Token: 0.48 s
Throughput: 348 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: Slightly higher latency; balanced performance and cost for standard workloads.
Key Features:
Comprehensive AI services (ML, cognitive services, custom bots)
Deep integration with Microsoft ecosystem
Global enterprise-grade infrastructure
Time to First Token: 0.52 s
Throughput: 395 tokens/s
Blended Cost: $0.30 per 1M tokens
Notes: Higher cost than peers; good throughput for reasoning-heavy tasks.
Key Features:
High-performance GPU infrastructure
Customizable deployment options
Time to First Token: 0.64 s
Throughput: 252 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: Lower throughput and higher latency; suitable for less time-sensitive workloads.
Key Features:
Broad AI/ML service portfolio (Bedrock, SageMaker)
Global cloud infrastructure
Enterprise-grade security and compliance
Time to First Token: 0.36 s
Throughput: 195 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: Lower throughput; acceptable latency; better for batch or background tasks.
Key Features:
Unified analytics platform (Spark + ML + notebooks)
Collaborative workspace for teams
Scalable compute for large ML/AI workloads
Time to First Token: 0.25 s
Throughput: 248 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: Very low latency; moderate throughput; good for real-time reasoning-heavy applications.
Key Features:
Real-time inference and training
Cloud/VPC-based deployment orchestration
Flexible and secure platform
Time to First Token: 0.44 s
Throughput: 482 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: High throughput and balanced latency; suitable for interactive applications.
Key Features:
Custom model hosting
Scalable infrastructure for high concurrency
API/CLI access for inference and management
Time to First Token: 0.29 s
Throughput: 186 tokens/s
Blended Cost: $0.10 per 1M tokens
Notes: Low cost; lower throughput; best for cost-sensitive workloads with smaller concurrency needs.
Key Features:
Efficient, compressed models for cost savings
Simplified deployment on AWS
Optimized for high-throughput batch inference
Time to First Token: 0.66 s
Throughput: 165 tokens/s
Blended Cost: $0.26 per 1M tokens
Notes: Significantly lower throughput and higher latency; may struggle with reasoning-heavy or interactive workloads.
Key Features:
Basic AI service endpoints
Standard cloud infrastructure
Suitable for steady-demand workloads
Selecting the right inference provider for GPT‑OSS‑120B requires evaluating time to first token, throughput, and cost based on your workload. Platforms like Clarifai offer high throughput, low latency, and competitive cost, making them well-suited for reasoning-heavy or interactive tasks. Other providers, such as CompactifAI, prioritize lower cost but come with reduced throughput, which may be more suitable for cost-sensitive or batch-oriented workloads. The optimal choice depends on which trade-offs matter most for your applications.
CompactifAI: Lowest cost at $0.10 per 1M tokens, suitable for cost-sensitive projects.
Clarifai: Highest throughput at 544 tokens/s with low first-chunk latency.
Fireworks AI: Strong throughput at 482 tokens/s and moderate latency.
Hyperbolic: Good throughput at 395 tokens/s; higher cost but viable for heavy workloads.
Along with price and throughput, flexibility is critical for real-world workloads. Teams often need control over scaling behavior, GPU utilization, and deployment environments to manage cost and efficiency.
Clarifai, for example, supports fractional GPU utilization, autoscaling, and local runners — features that can improve efficiency and reduce infrastructure overhead.
These capabilities extend beyond GPT‑OSS‑120B. With the Clarifai Reasoning Engine, custom or open-weight reasoning models can run with consistent performance and reliability. The engine also adapts to workload patterns over time, gradually improving speed for repetitive tasks without sacrificing accuracy.
So far, we’ve compared providers based on throughput, latency, and cost using the Artificial Analysis Benchmark. To see how these trade-offs play out in practice, here’s a visual summary of the results across the different providers. These charts are directly from Artificial Analysis.
The first chart highlights output speed vs price, while the second chart compares latency vs output speed.
Output Speed vs. Price
Latency vs. Output Speed
Below is a detailed comparison table summarizing the key metrics for GPT-OSS-120B inference across providers.
Provider | Throughput (tokens/s) | Time to First Token (s) | Blended Cost ($ / 1M tokens) |
---|---|---|---|
Clarifai | 544 | 0.32 | 0.16 |
Google Vertex AI | 392 | 0.40 | 0.26 |
Microsoft Azure | 348 | 0.48 | 0.26 |
Hyperbolic | 395 | 0.52 | 0.30 |
AWS | 252 | 0.64 | 0.26 |
Databricks | 195 | 0.36 | 0.26 |
Together AI | 248 | 0.25 | 0.26 |
Fireworks AI | 482 | 0.44 | 0.26 |
CompactifAI | 186 | 0.29 | 0.10 |
Nebius Base | 165 | 0.66 | 0.26 |
Choosing an inference provider for GPT‑OSS‑120B involves balancing throughput, latency, and cost. Each provider handles these trade-offs differently, and the best choice depends on the specific workload and performance requirements.
Providers with high throughput excel at reasoning-heavy or interactive tasks, while those with lower median throughput may be more suitable for batch or background processing where speed is less critical. Latency also plays a key role: low time-to-first-token improves responsiveness for real-time applications, whereas slightly higher latency may be acceptable for less time-sensitive tasks.
Cost considerations remain important. Some providers offer strong performance at low blended costs, while others trade efficiency for price. Benchmarks covering throughput, time to first token, and blended cost provide a clear basis for understanding these trade-offs.
Ultimately, the right provider depends on the engineering problem, workload characteristics, and which trade-offs matter most for the application.
The Fastest AI Inference and Reasoning on GPUs.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy