In Machine Learning, Image Recognition, NLP

Multimodal Deep Learning Approaches and Applications

By Dan Marasco, Senior Research Scientist

Combining Multiple Modes of Data with Sequential Relationships Between Words and Images

Deep learning techniques are generally developed to reason from specific types of data. Natural language processing (NLP) techniques typically handle text data, while computer vision (CV) algorithms deal with imagery. Widely utilized deep learning techniques generally rely on unique architectures to process the distinct structures in these different forms of data. In contrast, the human mind can simultaneously process different types of data for specific tasks, leveraging information from one domain to enhance comprehension of another.


For example, a person primarily senses the hardness of an object through touch, but can often recognize that an object is hard based on sight and sound. Multimodal learning research focuses on developing models that combine multiple modes of data with varying structures such as sequential relationships between words in natural language and spatial pixel relationships in images. These models aspire to create joint representations of the input data that provide richer features for downstream tasks compared to models leveraging a single mode of data. In this post we will introduce multimodal learning approaches as well as possible applications.

The Need for Suitable Multimodal Representations in Deep Learning

Multimodal data sources are very common. As discussed by Gao et al. (2020), a sports news article on a specific match uses images to present specific moments of excitement and the text to describe a record of events. Presenting these two raw forms of data give the reader a better understanding of the match. During inference, processing multiple sources of data would give machine learning models a similar advantage over their unimodal counterparts, especially when one data source is incomplete.


A primary deep learning task that could benefit from a multimodal data fusion is feature extraction. Modern deep learning techniques typically involve developing and training deep neural network architectures for a discriminatory task like classification. These models, when trained on large amounts of data, can be adapted for other downstream tasks. Intermediate features of the model, or embeddings, can be extracted and used as a representation of input data. These embeddings are powerful because they can provide richer information than rule-based or one-hot-encoded vectors typically used to represent categorical data in traditional machine learning algorithms, increasing performance and reducing additional data requirements for models.


The values in these embeddings encode high level information about the language or imagery that were used to create them (e.g. sentiment for language and object classes for imagery). Embeddings created from large pre-trained models like BERT (Devlin et al. 2018) are prominent in industry and academia for searching, clustering, and categorizing data. Although all of these widely used embeddings are unimodal, there has been progress in producing multimodal embeddings. Similar to their unimodal counterparts, multimodal embeddings can be used as features for various downstream classification, search, and clustering tasks but may provide a richer representation.


For example, a classifier trained on images and captions may provide a stronger representation of context, even when one of inputs is missing or incomplete during inference. Dense unimodal text and image embeddings have had a significant impact on the field, therefore it is believed that generating multimodal embeddings could produce similar results. 

Creating Multimodal Embeddings Directly

One approach to creating multimodal representations is training models from scratch using multiple sources of data and architectures specific to those modalities. Many supervised and unsupervised architectures have been proposed. Many designs are customized for the specific task and data modalities available, limiting potential for generalization to other tasks. Ngiam et al. (2011) presented an unsupervised approach where they used layers of Boltzman machines as an autoencoder to recreate raw data.


For these models, text or images are used as an input and passed through a deep Boltzman network. Since they are unsupervised models, they do not require labeled training data. In a similar fashion, there are examples where a transformer architecture and an attention mechanism is used so that the importance of particular parts of an image are measured against respective tokens (Pramanik et al. 2019). Gao et al. 2020 provide an extensive survey of different trained architectures and applications.

Incorporating and Fusing Pre-trained Unimodal Embeddings

While creating multimodal embeddings from scratch can be useful, the fusion of rich unimodal embeddings allows models to leverage rich feature extractors developed on large quantities of unimodal data to create multimodal representations.


Fusion allows results from multiple models (e.g. pre-trained text and image embeddings) to be incorporated into new models for downstream tasks. One commonly used pre-trained model is BERT (Devlin et al. 2018). Training BERT (or similarly large models) from scratch is not only expensive and time consuming, but also requires a lot of training data. Being able to use state-of-the-art image and text embeddings is therefore cost effective and allows for higher quality information to be packed into embedding vectors. One limitation, however, is that unimodal embeddings do not have any features jointly based on multiple modalities of data, meaning eventual multimodal features may not be as rich. 


As discussed by Zhang et al. (2019), there are many approaches for fusing embeddings. In all of these approaches, the fusion model takes as input the unimodal embeddings, and returns output for some multimodal task. A simple method is to use basic operations like concatenation or weighted sums. Alternatively, a neural network can even be trained on the fused unimodal outputs to find a pseudo-optimal configuration (i.e. optimal without altering the upstream unimodal feature extractors).


Alternatively, a model can pass unimodal embeddings through a transformer in a purely attention-based fashion so that the resulting feature vector is a joint representation of multiple modalities. Attention mechanisms can also be combined with recurrent neural network layers, limiting the number of features to be attended to. Zhang et al. also discuss using pooling operations, common in convolutional neural network architectures, to create more expressive embeddings. Since pooling is equivalent to the outer product of the unimodal vectors, however, the result of this operation is a large M x N matrix. This will drastically increase computational complexity and training cost. In all of these approaches, typically the last layer of the network before the output can be used as a multimodal embedding.  


Multimodal embeddings can be valuable anywhere that you may have multiple modes of input data that can inform a downstream task, especially when some modalities are incomplete. One obvious example is image captioning when the images are accompanied by text. In image captioning a brief textual description is created from an image.


This has many uses including aiding the visually impaired or creating accurate, searchable, descriptions of the ever-growing body of visual media available on the web. For this task, we can leverage not only the image itself, but any accompanying text like a news article or a reactionary social media post.


For this example, the image and accompanying text would first be encoded by an image and text embedding model, respectively. The resultant embeddings could then be fused and passed through a purpose trained recurrent model to generate the caption or alternative text. Ideally, this model would function with partial data and show improved performance compared to it's unimodal counterpart due to the richer understanding of the input data. This example is just one of many possibilities for multimodal deep learning. Visual question answering and visual reasoning are some of the more methodologically interesting applications researchers are working on (i.e. given an image and question, provide a textual answer). Each application faces issues but learning to create multimodal embeddings and developing architectures are important steps forward. As deep learning continues to permeate technologies in modern society it will be increasingly important for these models to be able to process multiple, often incomplete, sources of information.


Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning,” inProc. ICML, 2011


Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” in arXiv: 1810.04805, 2018


Chao Zhang, Zichao Yang, Xiaodong He, Li Deng, “Multimodal Intelligence: Representation Learning, Information Fusion, and Applications” in arXiv:1911.03977 2019


Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang, "A Survey on Deep Learning for Multimodal Data Fusion", Neural Computation 2020 32:5, 829-864




New Whitepaper!

Multimodal Deep Learning Approaches and Applications

Download Now


Subscribe to updates

Recent Posts