An Inspection of the Reproducibility and Replicability of TCT-ColBERT

Wang, X., MacAvaney, S. , Macdonald, C. and Ounis, I. (2022) An Inspection of the Reproducibility and Replicability of TCT-ColBERT. In: SIGIR 2022: 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11-15 Jul 2022, pp. 2790-2800. ISBN 9781450387323 (doi: 10.1145/3477495.3531721)

[img] Text
268399.pdf - Accepted Version

990kB

Abstract

Dense retrieval approaches are of increasing interest because they can better capture contextualised similarity compared to sparse retrieval models such as BM25. Among the most prominent of these approaches is TCT-ColBERT, which trains a light-weight "student'' model from a more expensive "teacher'' model. In this work, we take a closer look into TCT-ColBERT concerning its reproducibility and replicability. To structure our study, we propose a three-stage perspective on reproducing the training, inference, and evaluation of model-focused papers, each using artefacts produced from different stages in the pipeline. We find that --- perhaps as expected --- precise reproduction is more challenging when the complete training process is conducted, rather than just inference from a released trained model. Each stage provides the opportunity to perform replication and ablation experiments. We are able to replicate (i.e., produce an effective independent implementation) for model inference and dense indexing/retrieval, but are unable to replicate the training process. We conduct several ablations to cover gaps in the original paper, and make the following observations: (1) the model can function as an inexpensive re-ranker, establishing a new Pareto-optimal result; (2) the index size can be reduced by using lower-precision floating point values, but only if ties in scores are handled appropriately; (3) training needs to be conducted for the entire suggested duration to achieve optimal performance; and (4) student initialisation from the teacher is not necessary.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:MacAvaney, Dr Sean and Macdonald, Professor Craig and Ounis, Professor Iadh and Wang, Ms Xiao
Authors: Wang, X., MacAvaney, S., Macdonald, C., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
ISBN:9781450387323
Copyright Holders:Copyright © 2022 Association for Computing Machinery
First Published:First published in SIGIR 2022: 45th International ACM SIGIR Conference on Research and Development in Information Retrieval: 2790=2800
Publisher Policy:Reproduced in accordance with the publisher copyright policy
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
300982Exploiting Closed-Loop Aspects in Computationally and Data Intensive AnalyticsRoderick Murray-SmithEngineering and Physical Sciences Research Council (EPSRC)EP/R018634/1Computing Science