Lioma, C. and Ounis, I. (2007) Light syntactically-based index pruning for information retrieval. Lecture Notes in Computer Science, 4425, pp. 88-100. (doi: 10.1007/978-3-540-71496-5_11)
|
Text
liomaounis3769.pdf 471kB |
Publisher's URL: http://dx.doi.org/10.1007/978-3-540-71496-5_11
Abstract
Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of ‘blocks of parts of speech’ (<i>P</i><i>O</i><i>S</i> <i>blocks</i>) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain ‘non content-bearing parts of speech’, such as prepositions for example, correspond to sequences of content-poor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general.
Item Type: | Articles |
---|---|
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Ounis, Professor Iadh |
Authors: | Lioma, C., and Ounis, I. |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
College/School: | College of Science and Engineering > School of Computing Science |
Journal Name: | Lecture Notes in Computer Science |
Publisher: | Springer |
ISSN: | 1611-3349 |
Copyright Holders: | Copyright © 2007 Springer |
First Published: | First published in Lecture Notes in Computer Science 4425:88-100 |
Publisher Policy: | Reproduced in accordance with the copyright policy of the publisher. |
University Staff: Request a correction | Enlighten Editors: Update this record