MapReduce indexing strategies: studying scalability and efficiency

Mccreadie, R. , Macdonald, C. and Ounis, I. (2012) MapReduce indexing strategies: studying scalability and efficiency. Information Processing and Management, 48(5), pp. 873-888. (doi: 10.1016/j.ipm.2010.12.003)

Full text not currently available from Enlighten.


In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.

Item Type:Articles
Glasgow Author(s) Enlighten ID:Mccreadie, Dr Richard and Macdonald, Professor Craig and Ounis, Professor Iadh
Authors: Mccreadie, R., Macdonald, C., and Ounis, I.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Information Processing and Management
ISSN (Online):0306-4573
Published Online:01 February 2011

University Staff: Request a correction | Enlighten Editors: Update this record