Anagnostopoulos, C. , Savva, F. and Triantafillou, P. (2018) Scalable aggregation predictive analytics: a query-driven machine learning approach. Applied Intelligence, 48(9), pp. 2546-2567. (doi: 10.1007/s10489-017-1093-y)
|
Text
150834.pdf - Published Version Available under License Creative Commons Attribution. 1MB |
Abstract
We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method.
Item Type: | Articles |
---|---|
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Anagnostopoulos, Dr Christos and Triantafillou, Professor Peter and Savva, Mr Fotis |
Authors: | Anagnostopoulos, C., Savva, F., and Triantafillou, P. |
College/School: | College of Science and Engineering > School of Computing Science |
Journal Name: | Applied Intelligence |
Publisher: | Springer |
ISSN: | 0924-669X |
ISSN (Online): | 1573-7497 |
Published Online: | 15 December 2017 |
Copyright Holders: | Copyright © 2017 The Authors |
First Published: | First published in Applied Intelligence 48(9):2546-2567 |
Publisher Policy: | Reproduced under a Creative Commons License |
University Staff: Request a correction | Enlighten Editors: Update this record