Light syntactically-based index pruning for information retrieval

Lioma, C. and Ounis, I. (2007) Light syntactically-based index pruning for information retrieval. Lecture Notes in Computer Science, 4425, pp. 88-100. (doi: 10.1007/978-3-540-71496-5_11)



Publisher's URL:


Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of ‘blocks of parts of speech’ (<i>P</i><i>O</i><i>S</i> <i>blocks</i>) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain ‘non content-bearing parts of speech’, such as prepositions for example, correspond to sequences of content-poor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.

Item Type:Articles
Glasgow Author(s) Enlighten ID:Ounis, Professor Iadh
Authors: Lioma, C., and Ounis, I.
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Lecture Notes in Computer Science
Copyright Holders:Copyright © 2007 Springer
First Published:First published in Lecture Notes in Computer Science 4425:88-100
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher.

University Staff: Request a correction | Enlighten Editors: Update this record