Scaling k-Nearest Neighbors Queries (The Right Way)

Cahsai, A., Anagnostopoulos, C., Ntarmos, N. and Triantafillou, P. (2017) Scaling k-Nearest Neighbors Queries (The Right Way). In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5-8 June 2017, pp. 1419-1430. ISBN 9781538617939 (doi:10.1109/ICDCS.2017.267)

[img]
Preview
Text
138493.pdf - Accepted Version

465kB

Abstract

Recently parallel / distributed processing approaches have been proposed for processing k-Nearest Neighbours (kNN) queries over very large (multidimensional) datasets aiming to ensure scalability. However, this is typically achieved at the expense of efficiency. With this paper we offer a novel approach that alleviates the performance problems associated with state of the art methods. The essence of our approach, which differentiates it from related research, rests on (i) adopting a coordinator-based distributed processing algorithm, instead of those employed over data-parallel executionengines (such as Hadoop/MapReduce or Spark), and (ii) on a way to organize data, to structure computation, and to index the stored datasets that ensures that only a very small number of data items are retrieved from the underlying data store, communicated over the network, and processed by the coordinatorfor every kNN query. Our approach also pays special attention to ensuring scalability in addition to low query processing times. Overall, kNN queries can be processed in just tens of milliseconds (as opposed to the tens of) seconds required by state of the art. We have implemented our approach, usinga NoSQL DB (HBase) as the data store, and we compare it against the state-of-the-art: the Hadoop-based Spatial Hadoop (SHadoop) and the Spark-based Simba methods. We employ different datasets of various sizes, showcasing the contributed performance advantages. Our approach outperforms the stateof the art, by 2-3 orders of magnitude, and consistently for dataset sizes ranging from hundreds of millions to hundreds of billions of data points. We also show that the key constituent performance overheads incurred during query processing (such as the number of data items retrieved from the data store, the required network bandwidth, and the processing time at the coordinator) scale very well, ensuring the overall scalability of the approach.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Cahsai, Mr Atoshum and Anagnostopoulos, Dr Christos and Triantafillou, Professor Peter and Ntarmos, Dr Nikolaos
Authors: Cahsai, A., Anagnostopoulos, C., Ntarmos, N., and Triantafillou, P.
College/School:College of Science and Engineering > School of Computing Science
ISSN:1063-6927
ISBN:9781538617939
Copyright Holders:Copyright © 2017 IEEE
First Published:First published in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS): 1419-141430
Publisher Policy:Reproduced in accordance with the publisher copyright policy
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record