Multimodality representation learning: a survey on evolution, pretraining and its applications

Arslan Manzoor, M., AlBarri, S., Xian, Z., Meng, Z. , Nakov, P. and Liang, S. (2024) Multimodality representation learning: a survey on evolution, pretraining and its applications. ACM Transactions on Multimedia Computing, Communications, and Applications, 20(3), 74. (doi: 10.1145/3617833)

[img] Text
304530.pdf - Accepted Version

1MB

Abstract

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.

Item Type:Articles
Keywords:Multimodality, representation learning, pretrained models, multimodal methods, multimodal applications.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Meng, Dr Zaiqiao
Authors: Arslan Manzoor, M., AlBarri, S., Xian, Z., Meng, Z., Nakov, P., and Liang, S.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:ACM Transactions on Multimedia Computing, Communications, and Applications
Publisher:Association for Computing Machinery (ACM)
ISSN:1551-6857
ISSN (Online):1551-6865
Published Online:29 August 2023
Copyright Holders:Copyright © 2023 held by the owner/author(s)
First Published:First published in ACM Transactions on Multimedia Computing, Communications, and Applications 20(3):74
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record