Performance Portable GPU Code Generation for Matrix Multiplication

Remmelg, T., Lutz, T., Steuwer, M. and Dubach, C. (2016) Performance Portable GPU Code Generation for Matrix Multiplication. GPGPU '16 Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, Barcelona, Spain, 12 Mar 2016. pp. 22-31. ISBN 9781450341950 (doi:10.1145/2884045.2884046)

146600.pdf - Accepted Version



Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device. Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device- specific forms, from which OpenCL code is generated. In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized – but provably correct – implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD’s clBLAS library.

Item Type:Conference or Workshop Item
Additional Information:This work was partially funded by Google and Oracle Labs.
Glasgow Author(s) Enlighten ID:Steuwer, Dr Michel
Authors: Remmelg, T., Lutz, T., Steuwer, M., and Dubach, C.
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
College/School:College of Science and Engineering > School of Computing Science
Copyright Holders:Copyright © 2016 ACM
First Published:First published in GPGPU '16 Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit: 22-31
Publisher Policy:Reproduced in accordance with the publisher copyright policy

University Staff: Request a correction | Enlighten Editors: Update this record