Choosing the Right Models for Vision, OCR and Language Tasks

Introduction

The Clarifai platform has evolved significantly. Earlier generations of the platform relied on many small, task-specific models for visual classification, detection, OCR, text classification and segmentation. These legacy models were built on older architectures that were sensitive to domain shift, required separate training pipelines and did not generalize well outside their original conditions.

As model architectures have matured, large language models and vision-language models are now trained on broader multimodal data, cover multiple tasks within a single model family, and deliver more stable performance across different input types. As part of the platform upgrade, we are standardizing around these newer model types.

With this update, several legacy task-specific models are being deprecated and will no longer be available after December 31. Their functionality is still fully supported on the platform, but is now provided through more capable and general model families. These newer models are deployed and scaled using Compute Orchestration, which ensures consistent behavior across open source and custom model deployments without requiring changes to existing infrastructure.

This blog outlines the core task categories supported today, the recommended models for each and how to use them within the platform. It also clarifies which older models are being retired and how their capabilities map to the current model families.

For a quick reference, here are the recommended model families for each workload.

TL;DR: Model Recommendations by Task

Visual Classification and Recognition

Use MM-Poly-8B for prompt-driven visual classification and moderation. It supports image, text, audio, and video inputs and works well with explicit taxonomies and conservative, policy-based rules, making it suitable for zero-shot classification, concept extraction, and safety-critical moderation workflows where consistent decision-making matters.
Document Intelligence and OCR

Use DeepSeek OCR for open-source, multilingual OCR. For higher accuracy and complex layouts, use multimodal models like Gemini 2.5 Pro, Claude Opus 4.5, or GPT-5.1.
Text Classification and NLP

Use reasoning-capable language models such as GPT-OSS-120B, along with models like Gemma 3 (12B), MiniCPM-4 8B, and Qwen3-14B, for zero-shot text classification, routing, and moderation workflows.
Custom or Specialized Models

Deploy your own models or Community models using Compute Orchestration when you need custom behavior or full control in production.

With that overview in mind, let's walk through each task category in more detail, including why these models are recommended, which legacy models are being retired, and how to use the current options effectively on the platform.

Recommended Models for Core Vision and NLP Tasks

Visual Classification and Recognition

Visual classification and recognition involve identifying objects, scenes and concepts in an image. These tasks power product tagging, content moderation, semantic search, retrieval indexing and general scene understanding.

Modern vision-language models handle these tasks well in zero-shot mode. Instead of training separate classifiers, you define the taxonomy in the prompt and the model returns labels directly, which reduces the need for task-specific training and simplifies updates.

Models on the platform suited for visual classification, recognition and moderation

The models below offer strong visual understanding and perform well for classification, recognition, concept extraction and image moderation workflows, including sensitive-safety taxonomy setups.

Qwen2.5-VL-7B-Instruct
Optimized for visual recognition, localized reasoning and structured visual understanding. Strong at identifying concepts in images with multiple objects and extracting structured information.

MM-Poly-8B

MM-Poly-8B is Clarifai’s multimodal model designed for prompt-driven analysis across images, text, audio, and video. It is particularly well suited for classification and moderation workflows where the decision logic must be explicit, conservative, and consistently applied.

Unlike general-purpose vision-language models that focus on broad recognition, MM-Poly-8B performs best when given clearly defined taxonomies, rules, and examples. This makes it a strong fit for production systems that require predictable behavior under strict policies rather than open-ended interpretation.

A significant portion of real-world visual classification workloads involve moderation. These use cases often require determining whether content is safe, sensitive, or disallowed according to a specific policy. Moderation differs from general classification in that it demands conservative thresholds, careful handling of edge cases, and a bias toward safety. MM-Poly-8B is well suited for these scenarios.

Key capabilities:

Accepts image, text, audio, and video inputs
Handles detailed taxonomies and multi-level decision logic
Supports example-driven and rule-based prompting
Produces consistent outputs for safety-critical workflows
Behaves predictably when policies require conservative interpretation

Because MM-Poly-8B is tuned to follow instructions faithfully, it is appropriate for moderation scenarios where false negatives carry higher risk and models must err on the side of caution. It can be prompted to classify content using a defined policy, identify violations, return structured reasoning, or generate confidence-based outputs.

“Evaluate this image according to the categories Safe, Suggestive, Explicit, Drug and Gore. Apply a strict safety policy and classify the image into the most appropriate category.”

For more advanced use cases, you can provide the model with a detailed set of moderation rules, decision criteria and examples that define how each category should be applied. This allows you to verify how model behaves under stricter, policy-driven conditions and how it can be integrated into production-grade moderation pipelines.

MM-Poly-8B is available on the platform and can be used through the Playground or accessed programmatically via the OpenAI-compatible API.

Note: If you want to access the above models like Qwen2.5-VL-7B-Instruct, MM-Poly-8B directly, you can deploy them to your own dedicated compute using the Platform and access them via API just like any other model.

How to access these models via API

All models described above can be accessed through Clarifai’s OpenAI-compatible API. Send an image and a prompt in a single request and receive either plain text or structured JSON, which is useful when you need consistent labels or want to feed the results into downstream pipelines.

For details on structured JSON output, check the documentation here.

Training your own classifier (fine-tuning)

For most users migrating from legacy visual classifiers, zero-shot prompting with modern vision-language models will be sufficient. In many cases, you can replace a task-specific classifier by defining your labels or taxonomy directly in the prompt, without retraining or maintaining a custom model.

Fine-tuning is recommended only when your use case involves highly domain-specific labels, visual concepts that are not well represented in general-purpose models, or strict accuracy requirements that cannot be met through prompting alone.

If fine-tuning is needed, Clarifai provides visual classification templates with configurable training pipelines and adjustable hyperparameters, allowing you to train models tailored to your dataset and domain.

Available templates include:

MMClassification ResNet 50 RSB A1
Clarifai InceptionBatchNorm
Clarifai InceptionV2
Clarifai ResNeXt
Clarifai InceptionTransferEmbedNorm

You can upload your dataset, configure hyperparameters and train your own classifier through the UI or API. Check out the Fine-tuning Guide on the platform.

Document Intelligence and OCR

Document intelligence covers OCR, layout understanding and structured field extraction across scanned pages, forms and text-heavy images. The legacy OCR pipeline on the platform relied on language-specific PaddleOCR variants. These models were narrow in scope, sensitive to formatting issues and required separate maintenance for each language. They are now being decommissioned.

Models being decommissioned

These models were single-language engines with limited robustness. Modern OCR and multimodal systems support multilingual extraction by default and handle noisy scans, mixed formats and documents that combine text and visual elements without requiring separate pipelines.

Choosing the right OCR approach:

High-volume, cost-sensitive OCR:

Use DeepSeek OCR when you need multilingual extraction at scale and can tolerate lower accuracy on complex layouts or noisy scans.
Complex documents and reasoning:

Use multimodal models like Gemini 2.5 Pro, Claude Opus 4.5, or GPT-5.1 when documents include tables, mixed layouts, or require semantic understanding and structured extraction.

Let's get into them in detail:

Open-source OCR model on the platform

DeepSeek OCR
DeepSeek OCR is the primary open-source option. It supports multilingual documents, processes noisy scans reasonably well and can handle structured and unstructured documents. However, it is not perfect. Benchmarks show inconsistent accuracy on messy handwriting, irregular layouts and low-resolution scans. It also has input size constraints that can limit performance on large documents or multi-page flows. While it is stronger than the earlier language-specific engines, it is not the best option for high-stakes extraction on complex documents.

Third-party multimodal models for OCR-style tasks

The platform also supports several multimodal models that combine OCR with visual reasoning. These models can extract text, interpret tables, identify key fields and summarize content even when structure is complex. They are more capable than DeepSeek OCR, especially for long documents or workflows requiring reasoning.

Gemini 2.5 Pro
Handles text-heavy documents, receipts, forms and complex layouts with strong multimodal reasoning.

Claude Opus 4.5
Performs well on dense, complex documents, including table interpretation and structured extraction.

Claude Sonnet 4.5
A faster option that still produces reliable field extraction and summarization for scanned pages.

GPT-5.1
Reads documents, extracts fields, interprets tables and summarizes multi-section pages with strong semantic accuracy.

Gemini 2.5 Flash
Lightweight and optimized for speed. Suitable for common forms, receipts and straightforward document extraction.

These models perform well across languages, handle complex layouts and understand document context. The tradeoffs matter. They are closed-source, require third-party inference and are more expensive to operate at scale compared to an open-source OCR engine. They are ideal for high-accuracy extraction and reasoning, but not always cost-efficient for large batch OCR workloads.

How to access these models

Using the Playground

Upload your document image or scanned page in the Playground and run it with DeepSeek OCR or any of the multimodal models listed above. These models return Markdown-formatted text, which preserves structure such as headings, paragraphs, lists or table-like formatting. This makes it easier to render the extracted content directly or process it in downstream document workflows.

Screenshot 2025-11-28 at 4.45.52 PM

Using the API (OpenAI-compatible)

All these models are also accessible through Clarifai’s OpenAI-compatible API. Send the image and prompt in one request, and the model returns the extracted content in Markdown. This makes it easy to use directly in downstream pipelines. Check out the detailed guide on accessing DeepSeek OCR via the API.

Text Classification and NLP

Text classification is used in moderation, topic labeling, intent detection, routing, and broader text understanding. These tasks require models that follow instructions reliably, generalize across domains, and support multilingual input without needing task-specific retraining.

Instruction-tuned language models make this practical at scale. They can perform classification in a zero-shot manner, where you define the classes or rules directly in the prompt and the model returns the label without requiring a dedicated classifier. This allows teams to update policies, adjust categories, and reuse the same logic across languages without retraining. When stricter structure or domain alignment is required, these models can also be fine-tuned or constrained through structured outputs.

Models suited for text classification and moderation.

GPT-OSS-120B

GPT-OSS-120B is a 120-billion-parameter open-weight Mixture-of-Experts (MoE) reasoning model from OpenAI, excelling in instruction following, chain-of-thought reasoning, and tool integration for agentic workflows like coding, math, and multi-step problem-solving.

While primarily built for broad reasoning with adjustable effort levels (low/medium/high) and 128k context, GPT-OSS-120B can handle text classification via explicit prompting—describing rules, categories, and constraints directly to leverage its strong instruction adherence. It outperforms on safety benchmarks and produces transparent, step-by-step outputs, making it viable for structured tasks beyond simple labeling.

Typical Use Cases

Agentic classification with reasoning traces (e.g., intent detection via multi-step logic)
Safety evaluations combining tools and explicit policies
Custom text routing with prompt-defined categories
Workflows needing explanations alongside JSON outputs

Below is an example of using GPT-OSS-120B via Clarifai's OpenAI-compatible API for a text moderation workflow that returns structured JSON output.

Text Moderation with Structured JSON Output

In this example, GPT-OSS-120B evaluates user-generated text against a fixed moderation taxonomy and returns a structured JSON response that conforms to a predefined schema.

Moderation categories

safe
suggestive
explicit
drug
gore

Next, let’s look at other models that are well suited for text classification and general NLP tasks.

MiniCPM-4 8B

A compact, high-performing model built for instruction following. Works well on classification, QA, and general-purpose language tasks with competitive performance at lower latency.

Gemma 3 (12B)
A recent open model from Google, tuned for efficiency and high-quality language understanding. Strong at zero-shot classification, multilingual reasoning, and following prompt instructions across varied classification tasks.

Qwen3-14B
A multilingual model trained on a wide range of language tasks. Excels at zero-shot classification, text routing, and multi-language moderation and topic identification.

Note: If you want to access the above open-source models like Gemma 3, MiniCPM-4 or Qwen3 directly, you can deploy them to your own dedicated compute using the Platform and access them via API just like any other model on the platform.

There are also many additional third-party and open-source models available in the Community section, including GPT-5.1 family variants, Gemini 2.5 Pro, and several high-quality models. You can explore these based on your scale, and domain-specific needs.

Custom Model Deployment

For teams migrating off deprecated models who need more control over model behavior or deployment, the platform also supports deploying your own models or open-source alternatives using Compute Orchestration (CO).

CO handles the operational details required to serve models reliably. It containerizes models automatically, applies GPU fractioning so multiple models can share the same hardware, manages autoscaling and uses optimized scheduling to reduce latency under load. This lets you scale custom or open source models without needing to manage the underlying infrastructure.

CO supports deployment on multiple cloud environments such as AWS, Azure and GCP, which helps avoid vendor lock-in and gives you flexibility in how and where your models run. Check out the guide here on uploading and deploying your own custom models.

Conclusion

The model families outlined in this guide represent the most reliable and scalable way to handle visual classification, detection, moderation, OCR and text-understanding workloads on the platform today. By consolidating these tasks around stronger multimodal and language-model architectures, developers can avoid maintaining many narrow, task-specific legacy models and instead work with tools that generalize well, support zero-shot instructions and adapt cleanly to new use cases.

You can explore additional open source and third-party models in the Community section and use the documentation to get started with the Playground, API or fine-tuning workflows. If you need help planning a migration or selecting the right model for your workload, you can reach out to us on Discord or contact our support team here.

Previous Return to Blog Menu Next