Combining Terrier with Apache Spark to Create Agile Experimental Information Retrieval Pipelines

Macdonald, C. (2018) Combining Terrier with Apache Spark to Create Agile Experimental Information Retrieval Pipelines. In: 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, Ann Arbor, MI, USA, 8-12 Jul 2018, pp. 1309-1312. ISBN 9781450356572 (doi:10.1145/3209978.3210174)

Macdonald, C. (2018) Combining Terrier with Apache Spark to Create Agile Experimental Information Retrieval Pipelines. In: 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, Ann Arbor, MI, USA, 8-12 Jul 2018, pp. 1309-1312. ISBN 9781450356572 (doi:10.1145/3209978.3210174)

[img]
Preview
Text
160904.pdf - Accepted Version

590kB

Abstract

Experimentation using IR systems has traditionally been a procedural and laborious process. Queries must be run on an index, with any parameters of the retrieval models suitably tuned. With the advent of learning-to-rank, such experimental processes (including the appropriate folding of queries to achieve cross-fold validation) have resulted in complicated experimental designs and hence scripting. At the same time, machine learning platforms such as Scikit Learn and Apache Spark have pioneered the notion of an experimental pipeline , which naturally allows a supervised classification experiment to be expressed a series of stages, which can be learned or transformed. In this demonstration, we detail Terrier-Spark, a recent adaptation to the Terrier Information Retrieval platform which permits it to be used within the experimental pipelines of Spark. We argue that this (1) provides an agile experimental platform for information retrieval, comparable to that enjoyed by other branches of data science; (2) aids research reproducibility in information retrieval by facilitating easily-distributable notebooks containing conducted experiments; and (3) facilitates the teaching of information retrieval experiments in educational environments.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Macdonald, Dr Craig
Authors: Macdonald, C.
College/School:College of Science and Engineering > School of Computing Science
ISBN:9781450356572
Copyright Holders:Copyright © 2018 The Author
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record