April 28, 2026

NVIDIA Nemotron 3 Nano Omni on Clarifai Reasoning Engine: Zero Day Support at 400 Tokens Per Second

Table of Contents:

Blog thumbnail - NemotronTM 3 Nano Omni  day 0

We’re excited to announce day-0 support for NVIDIA Nemotron 3 Nano Omni on Clarifai. Available now on Clarifai Reasoning Engine, Nano Omni brings fast multimodal reasoning to developers building agentic systems, delivering throughput of 400+ tokens per second.

NVIDIA Nemotron 3 Nano Omni is a 30B A3B multimodal reasoning model built for workloads that span documents, images, video, and audio. With a 256K context window and support for text, image, video, and audio inputs with text output, it gives developers a single model for handling rich multimodal context inside agentic workflows.

That makes it a strong fit for sub-agents in workflows where multimodal understanding and speed need to go together.

A Multimodal Model for Specialized Sub-Agents

As agent systems grow more capable, they also become more specialized. Different models and components take on planning, execution, retrieval, and verification, each operating within a broader workflow. In that architecture, the model handling multimodal inputs has to do more than process isolated inputs. It has to interpret multiple modalities together, preserve context across steps, and respond fast enough to stay within the operational loop.

As a lightweight multimodal model for sub-agents, Nemotron 3 Nano Omni can reason across screens, documents, charts, audio, and video without routing each modality through a separate stack. Rather than splitting vision, speech, and language across multiple models, it gives developers a more unified way to handle multimodal reasoning while keeping the overall system easier to manage.

Built for Computer Use, Documents, and Audio-Video Reasoning

Nano Omni is especially relevant for the kinds of workloads that are becoming central to enterprise agentic systems.

For computer use, agents need to read interfaces, track UI state over time, and verify whether actions completed as expected. For document intelligence, they need to reason across text, tables, charts, screenshots, scanned pages, and mixed visual structure in the same pass. For audio and video workflows, they need to connect what was said, what was shown, and what changed over time.

These are all cases where multimodal capability has to work reliably in production, with a model that can handle multiple modalities efficiently without splitting the workflow across separate models.

The model represents a significant jump in capability from previous models in the Nemotron family. Significant improvement in benchmarks like OCRBenchV2, OCR_Reasoning, MathVista_MINI and OSWorld reflect the model's improved performance for the real world workloads today's agents are likely to serve.



MULTIMODAL ACCURACY - nemotron

That is where Nano Omni fits naturally, giving developers a single multimodal reasoning stream for the tasks sub-agents are increasingly expected to handle.

Agent-Friendly Tokenomics

In agent systems, sub-agents take on recurring tasks across documents, screens, audio, and video within a larger workflow. Each invocation adds to the cost, throughput, and infrastructure demands of the overall system. NVIDIA Nemotron 3 Nano Omni consolidates vision, speech, and language into a single multimodal model, reducing inference hops, orchestration logic, and cross-model synchronization compared with separate perception stacks.

Nano Omni delivers roughly 2x higher throughput on average, along with about 2.5x lower compute for video reasoning through temporal-aware perception and efficient video sampling. For multimodal agent workflows, that means higher throughput and lower compute overhead without adding complexity to the stack.

The model uses a hybrid Mixture-of-Experts architecture with a Transformer-Mamba design, along with 3D convolution layers and Efficient Video Sampling for temporal and video inputs. It can run on a single H100, H200, or B200, making it practical to deploy multimodal sub-agents without stretching infrastructure requirements.

High-Throughput Inference on Clarifai

On Clarifai Reasoning Engine, NVIDIA Nemotron 3 Nano Omni runs at 400+ tokens per second, giving developers the throughput needed for production multimodal agent workflows. That matters in systems where sub-agents are called repeatedly to process documents, interfaces, audio, and video as part of an ongoing workflow.

Clarifai Reasoning Engine is built for inference acceleration by combining optimized kernels, speculative decoding and adaptive performance techniques to improve throughput for reasoning models without compromising accuracy.

Getting Started on Clarifai

Developers can try NVIDIA Nemotron 3 Nano Omni in the Clarifai Playground and can also access it via an OpenAI-compatible API, making it easier to integrate into existing applications, tools, and agentic frameworks.

For larger-scale or more controlled deployments, Clarifai provides a direct path to production with Compute Orchestration. Developers can run Nano Omni on Clarifai Reasoning Engine or deploy it across their own cloud, VPC, on-prem or air-gapped environments while managing deployments through a unified control plane.

NVIDIA Nemotron 3 Nano Omni is available on Clarifai today.

If you have any questions about accessing NVIDIA Nemotron 3 Nano Omni on Clarifai, join our Discord.