Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures

Bader, J., Lehmann, F., Thamsen, L., Leser, U. and Kao, O. (2024) Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures. Future Generation Computer Systems, 150, pp. 171-185. (doi: 10.1016/j.future.2023.08.022)

[img] Text
305425.pdf - Accepted Version
Restricted to Repository staff only until 1 January 2026.
Available under License Creative Commons Attribution Non-commercial No Derivatives.

535kB

Abstract

Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific machines while the workflow is running, have to cope with a lack of measurements during start-up. Frequently, scientific workflows are executed on heterogeneous infrastructures consisting of machines with different CPU, I/O, and memory configurations, further complicating predicting runtimes due to different task runtimes on different machine types. This paper presents Lotaru, a method for locally predicting the runtimes of scientific workflow tasks before they are executed on heterogeneous compute clusters. Crucially, our approach does not rely on historical data and copes with a lack of training data during the start-up. To this end, we use microbenchmarks, reduce the input data to quickly profile the workflow locally, and predict a task’s runtime with a Bayesian linear regression based on the gathered data points from the local workflow execution and the microbenchmarks. Due to its Bayesian approach, Lotaru provides uncertainty estimates that can be used for advanced scheduling methods on distributed cluster infrastructures. In our evaluation with five real-world scientific workflows, our method outperforms two state-of-the-art runtime prediction baselines and decreases the absolute prediction error by more than 12.5%. In a second set of experiments, the prediction performance of our method, using the predicted runtimes for state-of-the-art scheduling, carbon reduction, and cost prediction, enables results close to those achieved with perfect prior knowledge of runtimes.

Item Type:Articles
Additional Information:Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) as FONDA (Project 414984028, SFB 1404).
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Thamsen, Dr Lauritz
Creator Roles:
Thamsen, L.Conceptualization, Writing – review and editing
Authors: Bader, J., Lehmann, F., Thamsen, L., Leser, U., and Kao, O.
College/School:College of Science and Engineering > School of Computing Science
Journal Name:Future Generation Computer Systems
Publisher:Elsevier
ISSN:0167-739X
ISSN (Online):1872-7115
Published Online:13 September 2023
Copyright Holders:Copyright: © 2023 Elsevier B.V.
First Published:First published in Future Generation Computer Systems 150: 171-185
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record