🚀 E-book
Learn how to master the modern AI infrastructural challenges.
October 27, 2025

Run DeepSeek-OCR with an API

Table of Contents:

How to Use DeepSeek-OCR via API

TL;DR

DeepSeek-OCR is the latest open-weight OCR model from DeepSeek, built to extract structured text, formulas, and tables from complex documents with high accuracy. It combines a vision encoder (based on SAM and CLIP) and a Mixture-of-Experts decoder (DeepSeek-3B-MoE) for efficient text generation.

You can try DeepSeek-OCR directly on Clarifai — no separate API key or setup required.

  • Playground: Test DeepSeek-OCR directly in the Clarifai Playground here.

  • API Access: Use Clarifai’s OpenAI-compatible endpoint. Authenticate with your Personal Access Token (PAT) and specify the DeepSeek-OCR model URL.

Introduction

DeepSeek-OCR is a multi-modal model designed to convert complex images such as invoices, scientific papers, and handwritten notes into accurate, structured text.

Unlike traditional OCR systems that rely purely on convolutional networks for text detection and recognition, DeepSeek-OCR uses a transformer-based encoder-decoder architecture. This allows it to handle dense documents, tables, and mixed visual content more effectively while keeping GPU usage low.

Key features:

  • Processes images as vision tokens using a hybrid SAM + CLIP encoder.

  • Compresses visual data by up to 10× with minimal accuracy loss.

  • Uses a 3B-parameter Mixture-of-Experts decoder, activating only 6 of 64 experts during inference for high efficiency.

  • Can process up to 200K pages per day on a single A100 GPU due to its optimized token compression and activation strategy.

Run DeepSeek-OCR

You can access DeepSeek-OCR in two simple ways: through the Clarifai Playground or via the API.

Playground

The Playground provides a fast, interactive environment to test and explore model behavior. You can select the DeepSeek-OCR model directly from the community, upload an image such as an invoice, scanned document, or handwritten page, and add a relevant prompt describing what you want the model to extract or analyze. The output text is displayed in real time, allowing you to quickly verify accuracy and formatting.

Screenshot 2025-10-27 at 6.11.22 PM

DeepSeek-OCR via API

Clarifai provides an OpenAI-compatible endpoint that allows you to call DeepSeek-OCR using the same Python or TypeScript client libraries you already use. Once you set your Personal Access Token (PAT) as an environment variable, you can call the model directly by specifying its URL.

Below are two ways to send an image input — either from a local file or via an image URL.

Option 1: Using a Local Image File

This example reads a local file (e.g., document.jpeg), encodes it in base64, and sends it to the model for OCR extraction.

Option 2: Using an Image URL

If your image is hosted online, you can directly pass its URL to the model.

You can use Clarifai’s OpenAI-compatible API with any TypeScript or JavaScript SDK. For example, the snippet below shows how you can use the Vercel AI SDK to access the  DeepSeek-OCR.

Option 1: Using a Local Image File

Option 2: Using an Image URL

Clarifai’s OpenAI-compatible API lets you access DeepSeek-OCR using any language or SDK that supports the OpenAI format. You can experiment in the Clarifai Playground or integrate it directly into your applications. Learn more about the Open AI Compatabile API in the documentation here.

How DeepSeek-OCR Works

DeepSeek-OCR is built from the ground up using a two-stage vision-language architecture that combines a powerful vision encoder and a Mixture-of-Experts (MoE) text decoder. This setup enables efficient and accurate text extraction from complex documents.

Screenshot 2025-10-27 at 5.48.34 PM

Image Source: DeepSeek-OCR Research Paper

DeepEncoder (Vision Encoder)

The DeepEncoder is a 380M-parameter vision backbone that transforms raw images into compact visual embeddings.

  • Patch Embedding: The input image is divided into 16×16 patches.

  • Local Attention (SAM - ViTDet):
    SAM applies local attention to capture fine-grained features such as font style, handwriting, edges, and texture details within each region of the image. This helps preserve spatial precision at a local level.

  • Downsampling: The patch embeddings are downsampled 16× via convolution to reduce the total number of visual tokens and improve efficiency.

  • Global Attention (CLIP - ViT):
    CLIP introduces global attention, enabling the model to understand document layout, structure, and semantic relationships across sections of the image.

  • Compact Visual Embeddings:
    The encoder produces a sequence of vision tokens that are roughly 10× smaller than equivalent text tokens, resulting in high compression and faster decoding.

DeepSeek-3B-MoE Decoder

The encoded visual tokens are passed to a Mixture-of-Experts Transformer Decoder, which converts them into readable text.

  • Expert Activation: 6 out of 64 experts are activated per token, along with 2 shared experts (about 570M active parameters).

  • Text Generation: Transformer layers decode the visual embeddings into structured text sequences, capturing plain text, formulas, tables, and layout information.

  • Efficiency and Scale: Although the total model size is 3B parameters, only a fraction is active during inference, providing 3B-scale performance at <600M active cost.

Conclusion

DeepSeek-OCR is more than a breakthrough in document understanding. It redefines how multimodal models process visual information by combining SAM’s fine-grained visual precision, CLIP’s global layout reasoning, and a Mixture-of-Experts decoder for efficient text generation. Through Clarifai, you can experiment DeepSeek-OCR in the Playground, integrate it directly via the OpenAI-compatible API.

Learn more: