Assessing multivariate Bernoulli models for information retrieval

Losada, D. and Azzopardi, L. (2008) Assessing multivariate Bernoulli models for information retrieval. ACM Transactions on Information Systems, 26(3), (doi: 10.1145/1361684.1361690)

Full text not currently available from Enlighten.

Publisher's URL: http://portal.acm.org/citation.cfm?id=1361684.1361690

Abstract

Although the seminal proposal to introduce language modeling in information retrieval was based on a multivariate Bernoulli model, the predominant modeling approach is now centered on multinomial models. Language modeling for retrieval based on multivariate Bernoulli distributions is seen inefficient and believed less effective than the multinomial model. In this article, we examine the multivariate Bernoulli model with respect to its successor and examine its role in future retrieval systems. In the context of Bayesian learning, these two modeling approaches are described, contrasted, and compared both theoretically and computationally. We show that the query likelihood following a multivariate Bernoulli distribution introduces interesting retrieval features which may be useful for specific retrieval tasks such as sentence retrieval. Then, we address the efficiency aspect and show that algorithms can be designed to perform retrieval efficiently for multivariate Bernoulli models, before performing an empirical comparison to study the behaviorial aspects of the models. A series of comparisons is then conducted on a number of test collections and retrieval tasks to determine the empirical and practical differences between the different models. Our results indicate that for sentence retrieval the multivariate Bernoulli model can significantly outperform the multinomial model. However, for the other tasks the multinomial model provides consistently better performance (and in most cases significantly so). An analysis of the various retrieval characteristics reveals that the multivariate Bernoulli model tends to promote long documents whose nonquery terms are informative. While this is detrimental to the task of document retrieval (documents tend to contain considerable nonquery content), it is valuable for other tasks such as sentence retrieval, where the retrieved elements are very short and focused.

Item Type:Articles
Keywords:multivariate Bernoulli, information retrieval, multinomial, language models
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Azzopardi, Dr Leif
Authors: Losada, D., and Azzopardi, L.
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
College/School:College of Science and Engineering > School of Computing Science
Journal Name:ACM Transactions on Information Systems
Publisher:ACM Press
ISSN:1046-8188
ISSN (Online):1558-2868

University Staff: Request a correction | Enlighten Editors: Update this record