Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Ge, X., Chen, F., Jose, J. M. , Ji, Z., Wu, Z. and Liu, X. (2021) Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval. In: 29th ACM International Conference on Multimedia (MM '21), Chengdu, China, 20-24 Oct 2021, pp. 5185-5193. ISBN 9781450386517 (doi: 10.1145/3474085.3475634)

[img] Text
253061.pdf - Accepted Version



The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog → play → ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In order to jointly and explicitly learn the visual-textual embedding and the cross-modal alignment, SMFEA creates a novel multi-modal structured module with a shared context-aware referral tree. In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels, from which visual and textual features can be jointly learned and optimized. We utilize the multi-modal tree structure to explicitly align the heterogeneous image-sentence data by maximizing the semantic and structural similarity between corresponding inter-modal tree nodes. Extensive experiments on Microsoft COCO and Flickr30K benchmarks demonstrate the superiority of the proposed model in comparison to the state-of-the-art methods.

Item Type:Conference Proceedings
Additional Information:This work was supported by National Key R&D Program of China, under Grant No. 2020AAA0104500.
Glasgow Author(s) Enlighten ID:Jose, Professor Joemon and Ge, Ms Xuri
Authors: Ge, X., Chen, F., Jose, J. M., Ji, Z., Wu, Z., and Liu, X.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Computing Research Repository
Copyright Holders:Copyright © 2021 The Authors
First Published:First published in 29th ACM International Conference on Multimedia (MM '21): 5185-5193
Publisher Policy:Reproduced in accordance with the publisher copyright policy
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record