CMRE-UoG team at ImageCLEFmedical Caption 2022: Concept Detection and Image Captioning

Dalla Serra, F., Deligianni, F. , Dalton, J. and O'Neil, A. Q. (2022) CMRE-UoG team at ImageCLEFmedical Caption 2022: Concept Detection and Image Captioning. In: CLEF 2022: Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5–8, 2022, pp. 1381-1390.

[img] Text
274427.pdf - Published Version
Available under License Creative Commons Attribution.

1MB

Publisher's URL: http://ceur-ws.org/Vol-3180/

Abstract

This work presents the proposed solutions of our team for the ImageCLEFmedical Caption 2022 task [1]. This task is structured as two subtasks: (1) Concept Detection subtask – which consists of detecting Concept Unique Identifiers (CUIs) from the Unified Medical Language System (UMLS) [2] attributed to each image; and (2) the Caption Prediction subtask – which involves generating an accurate description of the content of the image, based on the concepts detected in the first subtask. For both subtasks, the dataset corresponds to a subset of the Radiology Objects in the COntext (ROCO) dataset [3]. In the Concept Detection subtask, we experiment with two different strategies: a) supervised learning – we train a Convolutional Neural Network (CNN) [4, 5] to classify the full set of CUIs; b) image retrieval – we retrieve the top K most “similar” images from the training set based on the cosine similarity score between the image representations (extracted from the last average pooling layer), and combine the associated CUIs using a soft majority voting approach, similar to the ImageCLEFmed Caption 2021 winning approach [6]. Our best submission consists of the second image retrieval approach, for which we used an ensemble of five different CNNs. This approach ranked 2nd with an F1 score equal to 0.451, with a margin of approximately 5 × 10−4 from the 1st position. In the Caption Prediction subtask, we adopt an image encoder-decoder Transformer model [7], which takes as input the image representation – generated using a CNN image encoder – and generates a text caption describing the image. Furthermore, we considered a multimodal encoder-decoder Transformer model, which differs from the previous by taking as an additional input the CUIs extracted from the previous subtask alongside an image representation. Our multimodal approach ranked 6th, with a BLEU score [8] of 0.291, and ranked 1st place in terms of ROUGE [9] (the secondary metric for this subtask), with a score of 0.201.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Dalton, Dr Jeff and Dalla Serra, Francesco and Deligianni, Dr Fani
Authors: Dalla Serra, F., Deligianni, F., Dalton, J., and O'Neil, A. Q.
College/School:College of Science and Engineering > School of Computing Science
ISSN:1613-0073
Copyright Holders:Copyright © 2022 Copyright for this paper by its authors
First Published:First published in Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum 3180: 1381-1390
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record