Doc2Query--: When Less is More

Gospodinov, M., MacAvaney, S. and Macdonald, C. (2023) Doc2Query--: When Less is More. In: 45th European Conference on Information Retrieval (ECIR'23), Dublin, Ireland, 02-06 Apr 2023, pp. 414-422. ISBN 9783031282379 (doi: 10.1007/978-3-031-28238-6_31)

[img] Text
287683.pdf - Accepted Version

846kB

Abstract

Doc2Query—the process of expanding the content of a document before indexing using a sequence-to-sequence model—has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to “hallucinating” content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 30% and cutting the index size by 48%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration (https://github.com/terrierteam/pyterrier_doc2query).

Item Type:Conference Proceedings
Additional Information:Sean MacAvaney and Craig Macdonald acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:MacAvaney, Dr Sean and Macdonald, Professor Craig
Authors: Gospodinov, M., MacAvaney, S., and Macdonald, C.
College/School:College of Science and Engineering > School of Computing Science
Research Centre:College of Science and Engineering > School of Computing Science > IDA Section
ISSN:0302-9743
ISBN:9783031282379
Copyright Holders:Copyright © 2023 The Authors, under exclusive license to Springer Nature Switzerland AG
First Published:First published in Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham., pp 414-422
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher

University Staff: Request a correction | Enlighten Editors: Update this record

Project CodeAward NoProject NamePrincipal InvestigatorFunder's NameFunder RefLead Dept
300982Exploiting Closed-Loop Aspects in Computationally and Data Intensive AnalyticsRoderick Murray-SmithEngineering and Physical Sciences Research Council (EPSRC)EP/R018634/1Computing Science