Scalable aggregation predictive analytics: a query-driven machine learning approach

Anagnostopoulos, C. , Savva, F. and Triantafillou, P. (2018) Scalable aggregation predictive analytics: a query-driven machine learning approach. Applied Intelligence, 48(9), pp. 2546-2567. (doi: 10.1007/s10489-017-1093-y)

[img]
Preview
Text
150834.pdf - Published Version
Available under License Creative Commons Attribution.

1MB

Abstract

We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method.

Item Type:Articles
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Anagnostopoulos, Dr Christos and Triantafillou, Professor Peter and Savva, Mr Fotis
Authors: Anagnostopoulos, C., Savva, F., and Triantafillou, P.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Applied Intelligence
Publisher:Springer
ISSN:0924-669X
ISSN (Online):1573-7497
Published Online:15 December 2017
Copyright Holders:Copyright © 2017 The Authors
First Published:First published in Applied Intelligence 48(9):2546-2567
Publisher Policy:Reproduced under a Creative Commons License

University Staff: Request a correction | Enlighten Editors: Update this record