Scalable data quality for big data: the Pythia framework for handling missing values

Cahsai, A., Anagnostopoulos, C. and Triantafillou, P. (2015) Scalable data quality for big data: the Pythia framework for handling missing values. Big Data, 3(3), pp. 159-172. (doi: 10.1089/big.2015.0002)




Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such problem, in our previous work [1], we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a Missing Value substitution Algorithm (MVA). This selection relies on the principle that that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability and accuracy comparing against a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this paper is an extension to our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signatures construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely Knearest neighbor and expectation-maximization imputation algorithms) as well as two machine and neural computational leaning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework.

Item Type:Articles
Glasgow Author(s) Enlighten ID:Anagnostopoulos, Dr Christos and Triantafillou, Professor Peter
Authors: Cahsai, A., Anagnostopoulos, C., and Triantafillou, P.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Big Data
Publisher:Mary Ann Liebert Inc.
ISSN (Online):2167-647X
Copyright Holders:Copyright © 2015 Mary Ann Liebert
First Published:First published in Big Data 3(3)159-172
Publisher Policy:Reproduced in accordance with the copyright policy of the publisher
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record