WEBINAR | AI Prototype to Production: Operationalizing and Orchestrating AI
December 31, 2021

Clarifai Release 8.0

Table of Contents:

Clarifai Release 8.0

The Clarifai platform has now been updated to Release 8.0! We've added some incredible models for both image and video captioning, as well as handwritten optical character recognition. and added new state-of-the-art models that you can start using today. 


Multilingual OCR "language-aware-multilingual-ocr-multiplex" Model

A shortcoming of the current open-source OCR libraries such as EasyOCR and PaddleOCR is that you need to know the language beforehand. However, EasyOCR supports some specific combinations of languages (e.g. Japanese and English) but it doesn’t allow any arbitrary combination of languages (e.g. English, Japanese, Arabic).

Recent advances in OCR have shown that an end-to-end (E2E) training pipeline that includes both detection and recognition leads to the best results. However, many existing methods focus primarily on Latin-alphabet languages, often even only case-insensitive English characters. This model proposes an E2E approach, Multiplexed Multilingual Mask TextSpotter, that performs script identification at the word level and handles different scripts with different recognition heads, all while maintaining a unified loss that simultaneously optimizes script identification and multiple recognition heads. This method outperforms the single-head model with a similar number of parameters in end-to-end recognition tasks and achieves state-of-the-art results on MLT17 and MLT19 joint text detection and script identification benchmarks.


Multilingual OCR "paddleocr-multilingual" Model

PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and apply them into practice.

  • PP-OCR series of high-quality pre-trained models, comparable to commercial effects
    • Ultra lightweight PP-OCRv2 series models: detection (3.1M) + direction classifier (1.4M) + recognition 8.5M) = 13.0M
    • Ultra lightweight PP-OCR mobile series models: detection (3.0M) + direction classifier (1.4M) + recognition (5.0M) = 9.4M
    • General PP-OCR server series models: detection (47.1M) + direction classifier (1.4M) + recognition (94.9M) = 143.4M
    • Support Chinese, English, and digit recognition, vertical text recognition, and long text recognition
    • Support multi-language recognition: about 80 languages like Korean, Japanese, German, French, etc

Image-to-text "general-english-image-caption-clip" Model

We’re implementing OpenAI's neural network called CLIP, which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.

Read more about it here.

"CLIP" text-embedder and visual-embedder Models

These models are used as part of CLIP.


Microsoft's TrOCR, an encoder-decoder model consisting of an image Transformer encoder and a text Transformer decoder for state-of-the-art optical character recognition (OCR) on single-text line images. This particular model is fine-tuned on IAM, a dataset of annotated handwritten images.  

See a demo of it here.