Gospodinov, M., MacAvaney, S. and Macdonald, C. (2023) Doc2Query--: When Less is More. In: 45th European Conference on Information Retrieval (ECIR'23), Dublin, Ireland, 02-06 Apr 2023, pp. 414-422. ISBN 9783031282379 (doi: 10.1007/978-3-031-28238-6_31)
![]() |
Text
287683.pdf - Accepted Version Restricted to Repository staff only until 17 March 2024. 766kB |
Abstract
Doc2Query—the process of expanding the content of a document before indexing using a sequence-to-sequence model—has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to “hallucinating” content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 30% and cutting the index size by 48%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration (https://github.com/terrierteam/pyterrier_doc2query).
Item Type: | Conference Proceedings |
---|---|
Additional Information: | Sean MacAvaney and Craig Macdonald acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics. |
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | MacAvaney, Dr Sean and Macdonald, Professor Craig |
Authors: | Gospodinov, M., MacAvaney, S., and Macdonald, C. |
College/School: | College of Science and Engineering > School of Computing Science |
Research Centre: | College of Science and Engineering > School of Computing Science > IDA Section |
ISSN: | 0302-9743 |
ISBN: | 9783031282379 |
Copyright Holders: | Copyright © 2023 The Authors, under exclusive license to Springer Nature Switzerland AG |
First Published: | First published in Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13981. Springer, Cham., pp 414-422 |
Publisher Policy: | Reproduced in accordance with the copyright policy of the publisher |
University Staff: Request a correction | Enlighten Editors: Update this record