Matrix Multiplication Beyond Auto-Tuning: Rewrite-Based GPU Code Generation

Steuwer, M. , Remmelg, T. and Dubach, C. (2016) Matrix Multiplication Beyond Auto-Tuning: Rewrite-Based GPU Code Generation. In: CASES '16 Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, Pittsburgh, PA, USA, 01-07 Oct 2016, p. 15. (doi:10.1145/2968455.2968521)

[img]
Preview
Text
146599.pdf - Published Version

496kB

Abstract

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices. However, producing high performance GPU code is extremely difficult. Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has a very different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs. Auto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classes of GPUs, given the wide variety of hardware characteristics. In this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a high-level programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile GPUs. In particular, we achieve a speedup of 1.7x over a state-of-the-art auto-tuner on the ARM Mali GPU.

Item Type:Conference Proceedings
Additional Information:This work was partially funded by Google.
Status:Published
Refereed:Yes
Glasgow Author(s) Enlighten ID:Steuwer, Dr Michel
Authors: Steuwer, M., Remmelg, T., and Dubach, C.
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
College/School:College of Science and Engineering > School of Computing Science
Copyright Holders:Copyright © 2016 The Authors
First Published:First published in CASES '16 Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems: 15
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record