C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Will, J., Thamsen, L., Scheinert, D., Bader, J. and Kao, O. (2021) C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds. In: 2021 IEEE International Conference on Cloud Engineering (IC2E), 04-08 Oct 2021, pp. 43-52. ISBN 9781665449700 (doi: 10.1109/IC2E52221.2021.00018)

[img] Text
268161.pdf - Accepted Version

430kB

Abstract

Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. We present C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data. The shared data is utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models. These models take the diverse execution contexts of different users into account and exhibit mean absolute errors below 3% in our experimental evaluation with 930 unique Spark jobs.

Item Type:Conference Proceedings
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Thamsen, Dr Lauritz
Authors: Will, J., Thamsen, L., Scheinert, D., Bader, J., and Kao, O.
College/School:College of Science and Engineering > School of Computing Science
Publisher:IEEE
ISBN:9781665449700
Published Online:22 November 2021
Copyright Holders:Copyright © 2021 IEEE
First Published:First published in 2021 IEEE International Conference on Cloud Engineering (IC2E): 43-52
Publisher Policy:Reproduced in accordance with the publisher copyright policy
Related URLs:

University Staff: Request a correction | Enlighten Editors: Update this record