Amati, G. and Van Rijsbergen, C.J. (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), pp. 357-389. (doi: 10.1145/582415.582416)
|
Text
3798.pdf 189kB |
Publisher's URL: http://doi.acm.org/10.1145/582415.582416
Abstract
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.
Item Type: | Articles |
---|---|
Additional Information: | © ACM, 2002. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Information Systems 20(4):357-389 http://doi.acm.org/10.1145/582415.582416 |
Keywords: | Aftereffect model; BM25; Bose--Einstein statistics; Laplace; Poisson; binomial law; document length normalization; eliteness; idf; information retrieval; probabilistic models; randomness; succession law; term frequency normalization; term weighting |
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Van Rijsbergen, Professor Cornelis |
Authors: | Amati, G., and Van Rijsbergen, C.J. |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
College/School: | College of Science and Engineering > School of Computing Science |
Journal Name: | ACM Transactions on Information Systems |
Publisher: | ACM Press |
ISSN: | 1046-8188 |
Copyright Holders: | Copyright © 2002 ACM Press |
First Published: | First published in ACM Transactions on Information Systems 20(4):357-389 |
Publisher Policy: | Reproduced in accordance with the copyright policy of the publisher. |
University Staff: Request a correction | Enlighten Editors: Update this record