Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Will, J., Arslan, O., Bader, J., Scheinert, D. and Thamsen, L. (2021) Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud. In: 2021 IEEE International Conference on Big Data (Big Data), 15-18 Dec 2021, pp. 3141-3146. ISBN 9781665439022 (doi: 10.1109/BigData52589.2021.9671742)

[img] Text
268162.pdf - Accepted Version

369kB

Abstract

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time.This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75% data reduction only increases prediction errors by one percentage point.

Item Type:Conference Proceedings
Additional Information:The 5th Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD).
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Thamsen, Dr Lauritz
Authors: Will, J., Arslan, O., Bader, J., Scheinert, D., and Thamsen, L.
College/School:College of Science and Engineering > School of Computing Science
Publisher:IEEE
ISBN:9781665439022
Published Online:13 January 2022
Copyright Holders:Copyright © 2021 IEEE
Publisher Policy:Reproduced in accordance with the publisher copyright policy
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record