Human Attention in Image Captioning: Dataset and Analysis

He, S., Tavakoli, H. R., Borji, A. and Pugeault, N. (2019) Human Attention in Image Captioning: Dataset and Analysis. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 Oct - 02 Nov 2019, pp. 8528-8537. ISBN 9781728148038 (doi: 10.1109/ICCV.2019.00862)

[img]
Preview
Text
218233.pdf - Accepted Version
Available under License Creative Commons Attribution.

5MB

Abstract

In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human atten- tion and language constructs during perception and sen- tence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image caption- ing. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Hu- mans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between de- scribed objects and attended objects (97% of the described objects are being attended), (3) a convolutional neural net- work as feature encoder accounts for human-attended re- gions during image captioning to a great extent (around 78%), (4) soft-attention mechanism differs from human at- tention, both spatially and temporally, and there is low correlation between caption scores and attention consis- tency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/ Human-Attention-in-Image-Captioning.

Item Type:Conference Proceedings
Additional Information:This research is supported by the EPSRC project DEVA (EP/N035399/1). Dr Pugeault is supported by the Alan Turing Institute (EP/N510129/1).
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Pugeault, Dr Nicolas
Authors: He, S., Tavakoli, H. R., Borji, A., and Pugeault, N.
College/School:College of Science and Engineering > School of Computing Science
Publisher:IEEE
ISSN:2380-7504
ISBN:9781728148038
Published Online:27 February 2020
Copyright Holders:Copyright © 2019 IEEE
First Published:First published in 2019 IEEE/CVF International Conference on Computer Vision (ICCV): 8528-8537
Publisher Policy:Reproduced under a Creative Commons License
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record