Establish an AI Operating Model and get out of prototype and into production
January 30, 2024

Multimodal AI with Cross-Modal Search

Table of Contents:

First Venture Into Multimodal AI_ Cross-Modal Search In 5 Minutes-1


Cross-modal search is an emerging frontier in the world of information retrieval and data science. It represents a paradigm shift from traditional search methods, allowing users to query across diverse data types, such as text, images, audio, and video. It breaks down the barriers between different data modalities, offering a more holistic and intuitive search experience. This blog post aims to explore the concept of cross-modal search and its potential applications, and dive into the technical intricacies that make it possible. As the digital world continues to expand and diversify, cross-modal search technology is paving the way for more advanced, flexible, and accurate data retrieval.

Understanding Search Modalities: Unimodal, Cross-Modal, and Multimodal Search Explained

Unimodal, cross-modal, and multimodal search are terms that refer to the types of data inputs or sources that an artificial intelligence system uses to perform search tasks.  Here’s a brief explanation of each:

  • Unimodal search is a common type of search that only involves a single mode or type of data. Unimodal search is important when the query and the content to be searched are the same modality. This could mean that you have a short text description of what you are looking for and receive a ranked list of search results containing short paragraphs. For instance, if we’re trying to look for recipes, answers from Quora, or a short history lesson from Wikipedia, we are performing an unimodal search (in this case, with text). This can similarly be applicable to image-to-image search, like using Pinterest Lens to find similar apparel designs. Unimodal is the simplest form of search and is widely used in traditional search engines and databases.


Example Wikipedia article search on “vector quantization”

  • Cross-modal search refers to the ability to search across different modalities, where the query is expressed in one modality, and the content to be retrieved is  a different type (modality) of data. Imagine using a text description to search over images within your personal photo album. That would save so much scrolling time!

  • Multimodal search involves using two or more modalities in the search query and the retrieval process. This could mean combining text, images, audio, video, and other data types in the search.  Multimodal is important because it reflects the rich and complex nature of human communication

With Clarifai, you could use the “General” workflow for image-to-image search and the “Text” workflow for text-to-text search, both unimodal. Previously, to mimic text-to-image (cross-modal) search, we’d leverage the 9000+ concepts in the General model as our vocabulary. Now with the advent of visual-language models like CLIP, we launched the “Universal” workflow to enable anyone to use natural language to search over images.

How to perform Text-to-Image search with Clarifai

Operations can be done via the API or the portal UI. First, login to your account or sign up here for free.

Using the API

In this example, we will use Clarifai’s Python SDK to help us use as few lines as possible. Before you get started, get your Personal Access Token (PAT) by following these steps. Also follow the homepage instructions to install the SDK in one step. Use this notebook to follow along in your development environment or in Google Colab.

1. Create a new app with the default workflow specified as the “Universal” workflow

2. Upload the following 3 example images. Since this is a short demo, we directly ingest the inputs into the app. For production purposes, we recommend using datasets to organize your inputs. The SDK currently supports uploading from a csv file and from a folder and you can find the details in the examples.


3. Perform search by calling the query method and passing in a ranking.

4. Response is a generator. See the results by checking the “hits” attribute.

Using the UI

1. Create a new app by clicking the “+ Create” button on the top right corner in the portal screen. By default, “Start with a Blank App” is selected for you. For “Primary Input Type”, leave the default “Image/Video” selected as it sets the app’s base workflow with the Universal workflow. To verify that, click on “Advanced Settings”. Once the App ID and the short description have been filled in, click “Create App”.

2. You’ll then be automatically navigated to the app you just created. At this time, you might see the following “Add a model” pop-up. Click “Cancel” on the bottom left corner as we do not need this for our tutorial.

3. Upload images! On the left sidebar, click “Inputs”. Then click the blue button “Upload Inputs” on the top right. We can enter the image URLs line by line. Alternatively, we can upload them via a CSV file with a specific format. Here we use the following URLs. Copy and paste these into the box without new lines. 

4. After the upload is complete, you should see all 3 images. In the search bar, enter a text query and hit enter. Here we have used “Red pineapples on the beach” as an example, and indeed, the search returns a ranked list with the most semantically similar image first. 


The choice between unimodal, cross-modal, and multimodal search depends on the nature of your data and the goals of your search. If you need to find information across different types of data, a cross-modal search is necessary. As AI technology advances, there is a growing trend towards multimodal and cross-modal systems due to their ability to provide richer and more contextually relevant search results.

Try it out on the Clarifai platform today! Can’t find what you need? Consult our Docs Page or send us a message in our Community Discord channel.