Assessing the Scalability Issues on Multi-Core NUMA Machines

Aljabri, M., Trinder, P. and Loidl, H.-W. (2016) Assessing the Scalability Issues on Multi-Core NUMA Machines. In: Alford, N. and Fréchet, J. (eds.) Proceedings of the Eights Saudi Students Conference in the UK. Imperial College Press: London, pp. 267-278. ISBN 9781783269143 (doi:10.1142/9781783269150_0023)

Full text not currently available from Enlighten.


Non-uniform memory access (NUMA) architectures are modern shared-memory, multi-core machines offering varying access time and latencies between different memory banks. The organisation of nodes across different regions with nodes in the same regions that share the same memory poses challenges to efficient shared-memory access, thus negatively affecting the scalability of parallel applications. This paper studies the effect of state-of-the-art physical shared-memory NUMA architectures on the performance scalability of parallel applications using a range of programs and various language technologies. In particular, different parallel programs are used with different communication libraries and patterns in two sets of experiments. The first experiment examines the performance of the mainstream, widely used parallel technologies MPI and OpenMP, which utilise message passing and shared-memory communication patterns respectively. In addition, the performance implications of message passing versus shared-memory access on NUMA are compared using a concordance application. The second experiment assesses the performance of two parallel Haskell implementations as examples of a high-level language with automatic memory management.The results revealed that in the case of OpenMP the scalability was good, with threads up to six representing threads allocated in the same NUMA node. Moreover, as the number of threads increased, the performance dramatically decreased, confirming the effect of inefficient memory access. Likewise, MPI demonstrated similar behaviours, with the optimum speedup at six cores. However, unlike OpenMP, performance did not decrease sharply beyond that point, illustrating the benefits of message passing as opposed to shared-memory access. In terms of the standard, shared-memory parallel Haskell implementation, the scalability was limited to between 10 to 25 cores out of 48 across three parallel programs, with high memory management overheads. On the other hand, our parallel Haskell implementation, GUMSMP, which combines both distributed and shared-heap abstractions, scaled consistently with a speedup of up to 24 on 48 cores and overall performance improvement of up to 57%, as compared with the shared-memory implementation.

Item Type:Book Sections
Glasgow Author(s) Enlighten ID:Trinder, Professor Phil
Authors: Aljabri, M., Trinder, P., and Loidl, H.-W.
College/School:College of Science and Engineering > School of Computing Science
Publisher:Imperial College Press

University Staff: Request a correction | Enlighten Editors: Update this record