C4AV: learning cross-modal representations from transformers

Luo, S., Dai, H., Shao, L. and Ding, Y. (2021) C4AV: learning cross-modal representations from transformers. In: Bartoli, A. and Fusiello, A. (eds.) Computer Vision – ECCV 2020 Workshops. Series: Lecture notes in computer science, 12536. Springer, pp. 33-38. ISBN 9783030660963 (doi: 10.1007/978-3-030-66096-3_3)

Full text not currently available from Enlighten.


In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.

Item Type:Book Sections
Additional Information:Print ISBN: 9783030660956
Glasgow Author(s) Enlighten ID:Dai, Dr Hang
Authors: Luo, S., Dai, H., Shao, L., and Ding, Y.
College/School:College of Science and Engineering > School of Computing Science
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record