Dalla Serra, F., Deligianni, F. , Dalton, J. and O'Neil, A. Q. (2022) CMRE-UoG team at ImageCLEFmedical Caption 2022: Concept Detection and Image Captioning. In: CLEF 2022: Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5–8, 2022, pp. 1381-1390.
Text
274427.pdf - Published Version Available under License Creative Commons Attribution. 1MB |
Publisher's URL: http://ceur-ws.org/Vol-3180/
Abstract
This work presents the proposed solutions of our team for the ImageCLEFmedical Caption 2022 task [1]. This task is structured as two subtasks: (1) Concept Detection subtask – which consists of detecting Concept Unique Identifiers (CUIs) from the Unified Medical Language System (UMLS) [2] attributed to each image; and (2) the Caption Prediction subtask – which involves generating an accurate description of the content of the image, based on the concepts detected in the first subtask. For both subtasks, the dataset corresponds to a subset of the Radiology Objects in the COntext (ROCO) dataset [3]. In the Concept Detection subtask, we experiment with two different strategies: a) supervised learning – we train a Convolutional Neural Network (CNN) [4, 5] to classify the full set of CUIs; b) image retrieval – we retrieve the top K most “similar” images from the training set based on the cosine similarity score between the image representations (extracted from the last average pooling layer), and combine the associated CUIs using a soft majority voting approach, similar to the ImageCLEFmed Caption 2021 winning approach [6]. Our best submission consists of the second image retrieval approach, for which we used an ensemble of five different CNNs. This approach ranked 2nd with an F1 score equal to 0.451, with a margin of approximately 5 × 10−4 from the 1st position. In the Caption Prediction subtask, we adopt an image encoder-decoder Transformer model [7], which takes as input the image representation – generated using a CNN image encoder – and generates a text caption describing the image. Furthermore, we considered a multimodal encoder-decoder Transformer model, which differs from the previous by taking as an additional input the CUIs extracted from the previous subtask alongside an image representation. Our multimodal approach ranked 6th, with a BLEU score [8] of 0.291, and ranked 1st place in terms of ROUGE [9] (the secondary metric for this subtask), with a score of 0.201.
Item Type: | Conference Proceedings |
---|---|
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Dalton, Dr Jeff and Dalla Serra, Francesco and Deligianni, Dr Fani |
Authors: | Dalla Serra, F., Deligianni, F., Dalton, J., and O'Neil, A. Q. |
College/School: | College of Science and Engineering > School of Computing Science |
ISSN: | 1613-0073 |
Copyright Holders: | Copyright © 2022 Copyright for this paper by its authors |
First Published: | First published in Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum 3180: 1381-1390 |
Publisher Policy: | Reproduced under a Creative Commons License |
University Staff: Request a correction | Enlighten Editors: Update this record