The Clarifai platform has evolved significantly. Earlier generations of the platform relied on many small, task-specific models for visual classification, detection, OCR, text classification and segmentation. These legacy models were built on older architectures that were sensitive to domain shift, required separate training pipelines and did not generalize well outside their original conditions.
The ecosystem has moved on. Modern large language models and vision-language models are trained on broader multimodal data, cover multiple tasks within a single model family and deliver more stable performance across different input types. As part of the platform upgrade, we are standardizing around these newer model types.
With this update, several legacy task-specific models are being deprecated and will no longer be available. Their functionality is still fully supported on the platform, but is now provided through more capable and general model families. Compute Orchestration manages scheduling, scaling and resource allocation for these models so that workloads behave consistently across open source and custom model deployments.
This blog outlines the core task categories supported today, the recommended models for each and how to use them within the platform. It also clarifies which older models are being retired and how their capabilities map to the current model families.
Visual classification and recognition involve identifying objects, scenes and concepts in an image. These tasks power product tagging, content moderation, semantic search, retrieval indexing and general scene understanding.
Modern vision-language models handle these tasks well in zero-shot mode. Instead of training separate classifiers, you define the taxonomy in the prompt and the model returns labels directly, which reduces the need for task-specific training and simplifies updates.
The models below offer strong visual understanding and perform well for classification, recognition, concept extraction and image moderation workflows, including sensitive-safety taxonomy setups.
MiniCPM-o 2.6
A compact VLM that handles images, video and text. Performs well for flexible classification workloads where speed, cost efficiency and coverage need to be balanced.
Qwen2.5-VL-7B-Instruct
Optimized for visual recognition, localized reasoning and structured visual understanding. Strong at identifying concepts in images with multiple objects and extracting structured information.
Moderation with MM-Poly-8B
A large portion of real-world visual classification work involves moderation. Many customer workloads are built around determining whether an image is safe, sensitive or banned according to a specific policy. Unlike general classification, moderation requires strict taxonomy, conservative thresholds and consistent rule-following. This is where MM-Poly-8B is particularly effective.
MM-Poly-8B is Clarifai’s multimodal model designed for detailed, prompt-driven analysis across images, text, audio and video. It performs well when the classification logic needs to be explicit and tightly controlled. Moderation teams often rely on layered instructions, examples and edge-case handling. MM-Poly-8B supports this pattern directly and behaves predictably when given structured policies or rule sets.
Key capabilities:
Accepts image, text, audio and video inputs
Handles detailed taxonomies and multi-level decision logic
Supports example-driven prompting
Produces consistent classifications for safety-critical use cases
Works well when the moderation policy requires conservative interpretation and bias toward safety
Because MM-Poly-8B is tuned to follow instructions faithfully, it is suited for moderation scenarios where false negatives carry higher risk and models must err on the side of caution. It can be prompted to classify content using your policy, identify violations, return structured reasoning or generate confidence-based outputs.
If you want to demonstrate a moderation workflow, you can prompt the model with a clear taxonomy and ruleset. For example:
“Evaluate this image according to the categories Safe, Suggestive, Explicit, Drug and Gore. Apply a strict safety policy and classify the image into the most appropriate category.”

For more advanced use cases, you can provide the model with a detailed set of moderation rules, decision criteria and examples that define how each category should be applied. This allows you to verify how model behaves under stricter, policy-driven conditions and how it can be integrated into production-grade moderation pipelines.
MM-Poly-8B is available on the platform and can be used through the Playground or accessed programmatically via the OpenAI-compatible API.
Note: If you want to access the above models like MiniCPM-o-2.6 and Qwen2.5-VL-7B-Instruct directly, you can deploy them to your own dedicated compute using the Platform and access them via API just like any other model.
All models described above can be accessed through Clarifai’s OpenAI-compatible API. Send an image and a prompt in a single request and receive either plain text or structured JSON, which is useful when you need consistent labels or want to feed the results into downstream pipelines.
For details on structured JSON output, check the documentation here.
If your application requires domain-specific labels, industry-specific concepts or a dataset that differs from general web imagery, you can train a custom classifier using Clarifai’s visual classification templates. These templates provide configurable training pipelines with adjustable hyperparameters, allowing you to build models tailored to your use case.
Available templates include:
MMClassification ResNet 50 RSB A1
Clarifai InceptionBatchNorm
Clarifai InceptionV2
Clarifai ResNeXt
Clarifai InceptionTransferEmbedNorm
You can upload your dataset, configure hyperparameters and train your own classifier through the UI or API. Check out the Fine-tuning Guide on the platform.
Document intelligence covers OCR, layout understanding and structured field extraction across scanned pages, forms and text-heavy images. The legacy OCR pipeline on the platform relied on language-specific PaddleOCR variants. These models were narrow in scope, sensitive to formatting issues and required separate maintenance for each language. They are now being decommissioned.
Models being decommissioned
These models were single-language engines with limited robustness. Modern OCR and multimodal systems support multilingual extraction by default and handle noisy scans, mixed formats and documents that combine text and visual elements without requiring separate pipelines.
DeepSeek OCR
DeepSeek OCR is the primary open-source option. It supports multilingual documents, processes noisy scans reasonably well and can handle structured and unstructured documents. However, it is not perfect. Benchmarks show inconsistent accuracy on messy handwriting, irregular layouts and low-resolution scans. It also has input size constraints that can limit performance on large documents or multi-page flows. While it is stronger than the earlier language-specific engines, it is not the best option for high-stakes extraction on complex documents.
The platform also supports several multimodal models that combine OCR with visual reasoning. These models can extract text, interpret tables, identify key fields and summarize content even when structure is complex. They are more capable than DeepSeek OCR, especially for long documents or workflows requiring reasoning.
Gemini 2.5 Pro
Handles text-heavy documents, receipts, forms and complex layouts with strong multimodal reasoning.
Claude Opus 4.5
Performs well on dense, complex documents, including table interpretation and structured extraction.
Claude Sonnet 4.5
A faster option that still produces reliable field extraction and summarization for scanned pages.
GPT-5.1
Reads documents, extracts fields, interprets tables and summarizes multi-section pages with strong semantic accuracy.
Gemini 2.5 Flash
Lightweight and optimized for speed. Suitable for common forms, receipts and straightforward document extraction.
These models perform well across languages, handle complex layouts and understand document context. The tradeoffs matter. They are closed-source, require third-party inference and are more expensive to operate at scale compared to an open-source OCR engine. They are ideal for high-accuracy extraction and reasoning, but not always cost-efficient for large batch OCR workloads.
Using the Playground
Upload your document image or scanned page in the Playground and run it with DeepSeek OCR or any of the multimodal models listed above. These models return Markdown-formatted text, which preserves structure such as headings, paragraphs, lists or table-like formatting. This makes it easier to render the extracted content directly or process it in downstream document workflows.

Using the API (OpenAI-compatible)
All these models are also accessible through Clarifai’s OpenAI-compatible API. Send the image and prompt in one request, and the model returns the extracted content in Markdown. This makes it easy to use directly in downstream pipelines. Check out the detailed guide on accessing DeepSeek OCR via the API.
Text classification is used in moderation, topic labeling, intent detection, routing, and broader text understanding. These tasks require models that follow instructions reliably, generalize across domains, and support multilingual input without needing task-specific retraining.
Instruction-tuned language models make this much easier. They can perform classification in a zero-shot manner, where you define the classes or rules directly in the prompt and the model returns the label without needing a dedicated classifier. This makes it easy to update categories, experiment with different label sets and deploy the same logic across multiple languages. If you need deeper domain alignment, these models can also be fine-tuned.
Below are the some stronger models on the platform for text classification and NLP:
Gemma 3 (12B)
A recent open model from Google, tuned for efficiency and high-quality language understanding. Strong at zero-shot classification, multilingual reasoning, and following prompt instructions across varied classification tasks.
MiniCPM-4 8B
A compact, high-performing model built for instruction following. Works well on classification, QA, and general-purpose language tasks with competitive performance at lower latency.
Qwen3-14B
A multilingual model trained on a wide range of language tasks. Excels at zero-shot classification, text routing, and multi-language moderation and topic identification.
Note: If you want to access the above open-source models like Gemma 3, MiniCPM-4 or Qwen3 directly, you can deploy them to your own dedicated compute using the Platform and access them via API just like any other model on the platform.
There are also many additional third-party and open-source models available in the Community section, including GPT-5.1 family variants, Gemini 2.5 Pro, and several high-quality models. You can explore these based on your scale, and domain-specific needs.
In addition to the models listed above, the platform also lets you bring your own models or deploy open source models from the Community using Compute Orchestration (CO). This is helpful when you need a model that isn’t already available on the platform, or when you want full control over how a model runs in production.
CO handles the operational details required to serve models reliably. It containerizes models automatically, applies GPU fractioning so multiple models can share the same hardware, manages autoscaling and uses optimized scheduling to reduce latency under load. This lets you scale custom or open source models without needing to manage the underlying infrastructure.
CO supports deployment on multiple cloud environments such as AWS, Azure and GCP, which helps avoid vendor lock-in and gives you flexibility in how and where your models run. Check out the guide here on uploading and deploying your own custom models.
The model families outlined in this guide represent the most reliable and scalable way to handle visual classification, detection, moderation, OCR and text-understanding workloads on the platform today. By consolidating these tasks around stronger multimodal and language-model architectures, developers can avoid maintaining many narrow, task-specific legacy models and instead work with tools that generalize well, support zero-shot instructions and adapt cleanly to new use cases.
You can explore additional open source and third-party models in the Community section and use the documentation to get started with the Playground, API or fine-tuning workflows. If you need help planning a migration or selecting the right model for your workload, you can reach out to us on Discord or contact our support team here.
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy