Learning to Accurately COUNT with Query-Driven Predictive Analytics

Anagnostopoulos, C. and Triantafillou, P. (2015) Learning to Accurately COUNT with Query-Driven Predictive Analytics. In: IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, USA, 29 Oct - 1 Nov 2015,

110123.pdf - Accepted Version



We study a novel solution to executing aggregation (and specifically COUNT) queries over large-scale data. The proposed solution is generally applicable, in the sense that it can be deployed in environments in which data owners may or may not restrict access to their data and allow only ‘aggregation operators’ to be executed over their data. For this, it is based on predictive analytics, driven by queries and their results. We propose a machine learning (ML) framework for the task (which can be adapted for different aggregates as well). We focus on the widely used set-cardinality (i.e., COUNT) aggregation operator, as it is a fundamental operator for both internal data system optimisations and for aggregation-query analytics. We contribute a novel, query-driven ML model whose goals are to: (i) learn the query space (access patterns), (ii) associate (complex) aggregation queries with the cardinality of their results, (iii) define query similarity and use it to predict the cardinality of the answer set of an ad-hoc incoming query. Our ML model incorporates incremental learning algorithms for ensuring high prediction accuracy even when both the querying patterns and the underlying data change. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general environments which include restrictedaccess data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for big data analytics, and (iii) offers a performance (in terms of prediction accuracy and time, and memory requirements) that is superior to datacentric approaches. We provide a comprehensive performance evaluation of our model, evaluating its sensitivity and comparative advantages versus acclaimed data-centric methods (self-tuning histograms, sampling, and multidimensional histograms).

Item Type:Conference Proceedings
Glasgow Author(s) Enlighten ID:Anagnostopoulos, Dr Christos and Triantafillou, Professor Peter
Authors: Anagnostopoulos, C., and Triantafillou, P.
College/School:College of Science and Engineering > School of Computing Science
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record