Image Captioning through Image Transformer

He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B. and Pugeault, N. (2021) Image Captioning through Image Transformer. In: 15th Asian Conference on Computer Vision, 30 Nov-04 Dec 2020, pp. 153-169. ISBN 9783030695378 (doi: 10.1007/978-3-030-69538-5_10)

[img] Text
223861.pdf - Accepted Version

2MB

Abstract

Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect of captioning is the notion of attention: how to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous works have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer’s internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widens the original transformer layer’s inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks. The code is available at https://github.com/wtliao/ImageTransformer.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Pugeault, Dr Nicolas
Authors: He, S., Liao, W., Tavakoli, H. R., Yang, M., Rosenhahn, B., and Pugeault, N.
College/School:College of Science and Engineering > School of Computing Science
ISSN:0302-9743
ISBN:9783030695378
Published Online:25 February 2021
Copyright Holders:Copyright © 2021 Springer Nature Switzerland AG
First Published:First published in Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science, vol 12625, pp 153-169
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record