On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval

Macdonald, C. and Tonellotto, N. (2021) On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval. In: 30th ACM International Conference on Information and Knowledge Management, 01-05 Nov 2021, pp. 3318-3322. ISBN 9781450384469 (doi: 10.1145/3459637.3482156)

[img] Text
249263.pdf - Accepted Version

1MB

Abstract

Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Tonellotto, Dr Nicola and Macdonald, Professor Craig
Authors: Macdonald, C., and Tonellotto, N.
College/School:College of Science and Engineering > School of Computing Science
ISBN:9781450384469
Published Online:30 October 2021
Copyright Holders:Copyright © 2021 Association for Computing Machinery
First Published:First published in CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management: 3318-3322
Publisher Policy:Reproduced in accordance with the publisher copyright policy
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
300982Exploiting Closed-Loop Aspects in Computationally and Data Intensive AnalyticsRoderick Murray-SmithEngineering and Physical Sciences Research Council (EPSRC)EP/R018634/1Computing Science