DeepSeek-OCR is the latest open-weight OCR model from DeepSeek, built to extract structured text, formulas, and tables from complex documents with high accuracy. It combines a vision encoder (based on SAM and CLIP) and a Mixture-of-Experts decoder (DeepSeek-3B-MoE) for efficient text generation.
You can try DeepSeek-OCR directly on Clarifai — no separate API key or setup required.
Playground: Test DeepSeek-OCR directly in the Clarifai Playground here.
API Access: Use Clarifai’s OpenAI-compatible endpoint. Authenticate with your Personal Access Token (PAT) and specify the DeepSeek-OCR model URL.
DeepSeek-OCR is a multi-modal model designed to convert complex images such as invoices, scientific papers, and handwritten notes into accurate, structured text.
Unlike traditional OCR systems that rely purely on convolutional networks for text detection and recognition, DeepSeek-OCR uses a transformer-based encoder-decoder architecture. This allows it to handle dense documents, tables, and mixed visual content more effectively while keeping GPU usage low.
Key features:
Processes images as vision tokens using a hybrid SAM + CLIP encoder.
Compresses visual data by up to 10× with minimal accuracy loss.
Uses a 3B-parameter Mixture-of-Experts decoder, activating only 6 of 64 experts during inference for high efficiency.
Can process up to 200K pages per day on a single A100 GPU due to its optimized token compression and activation strategy.
You can access DeepSeek-OCR in two simple ways: through the Clarifai Playground or via the API.
The Playground provides a fast, interactive environment to test and explore model behavior. You can select the DeepSeek-OCR model directly from the community, upload an image such as an invoice, scanned document, or handwritten page, and add a relevant prompt describing what you want the model to extract or analyze. The output text is displayed in real time, allowing you to quickly verify accuracy and formatting.

Clarifai provides an OpenAI-compatible endpoint that allows you to call DeepSeek-OCR using the same Python or TypeScript client libraries you already use. Once you set your Personal Access Token (PAT) as an environment variable, you can call the model directly by specifying its URL.
Below are two ways to send an image input — either from a local file or via an image URL.
Option 1: Using a Local Image File
This example reads a local file (e.g., document.jpeg), encodes it in base64, and sends it to the model for OCR extraction.
Option 2: Using an Image URL
If your image is hosted online, you can directly pass its URL to the model.
You can use Clarifai’s OpenAI-compatible API with any TypeScript or JavaScript SDK. For example, the snippet below shows how you can use the Vercel AI SDK to access the DeepSeek-OCR.
Option 1: Using a Local Image File
Option 2: Using an Image URL
Clarifai’s OpenAI-compatible API lets you access DeepSeek-OCR using any language or SDK that supports the OpenAI format. You can experiment in the Clarifai Playground or integrate it directly into your applications. Learn more about the Open AI Compatabile API in the documentation here.
DeepSeek-OCR is built from the ground up using a two-stage vision-language architecture that combines a powerful vision encoder and a Mixture-of-Experts (MoE) text decoder. This setup enables efficient and accurate text extraction from complex documents.

Image Source: DeepSeek-OCR Research Paper
The DeepEncoder is a 380M-parameter vision backbone that transforms raw images into compact visual embeddings.
Patch Embedding: The input image is divided into 16×16 patches.
Local Attention (SAM - ViTDet):
SAM applies local attention to capture fine-grained features such as font style, handwriting, edges, and texture details within each region of the image. This helps preserve spatial precision at a local level.
Downsampling: The patch embeddings are downsampled 16× via convolution to reduce the total number of visual tokens and improve efficiency.
Global Attention (CLIP - ViT):
CLIP introduces global attention, enabling the model to understand document layout, structure, and semantic relationships across sections of the image.
Compact Visual Embeddings:
The encoder produces a sequence of vision tokens that are roughly 10× smaller than equivalent text tokens, resulting in high compression and faster decoding.
The encoded visual tokens are passed to a Mixture-of-Experts Transformer Decoder, which converts them into readable text.
Expert Activation: 6 out of 64 experts are activated per token, along with 2 shared experts (about 570M active parameters).
Text Generation: Transformer layers decode the visual embeddings into structured text sequences, capturing plain text, formulas, tables, and layout information.
Efficiency and Scale: Although the total model size is 3B parameters, only a fraction is active during inference, providing 3B-scale performance at <600M active cost.
DeepSeek-OCR is more than a breakthrough in document understanding. It redefines how multimodal models process visual information by combining SAM’s fine-grained visual precision, CLIP’s global layout reasoning, and a Mixture-of-Experts decoder for efficient text generation. Through Clarifai, you can experiment DeepSeek-OCR in the Playground, integrate it directly via the OpenAI-compatible API.
Learn more:
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy
© 2023 Clarifai, Inc. Terms of Service Content TakedownPrivacy Policy