PPoPP 2016
Sat 12 - Wed 16 March 2016 Barcelona, Spain
Sun 13 Mar 2016 11:30 - 11:50 at Menorca - Session 2

From High-Performance-Computing (HPC) systems, where combinations of CPUs, GPUs and FPGAs are increasingly common, to the mobile market, with highly-integrated SoC including multiple accelerators in a single dice, heterogeneous architectures are becoming the de-facto standard. These heterogeneous architectures present new opportunities to improve overall application performance and reduce energy consumption. These can be achieved by assigning some tasks to the CPU while offloading other tasks to the accelerators. However, while these highly parallel architectures provide opportunities to achieve higher raw performance, they require the use of more complex libraries and programming models. Thus, software developers need an extra knowledge to develop and maintain the code due to specific low level details of these architectures.

In this paper, we present a high-level parallel extension of the current C++ STL library. It is an implementation of the C++ Extensions for Parallelism Technical Specification proposal (TS) to be included in the ISO C++17. We aim to provide an easy to use interface that allows developers to focus on their own applications and mitigate the need to focus on the underlying processor architecture. In order to allow C++ developers to use our implementation on different accelerators, we rely on SYCL [2] which allows us to build an abstraction layer for different accelerator specific. In addition, by using our Parallel STL implementation, applications can be transparently accelerated on different accelerators such as Intel Xeon Phi, GPUs or FPGAs. Our implementation captures current STL iterators and automatically allocates and frees the required device memory space and performs transparent data movement between the host and devices, so developers do no have to worry about the memory layout of each device. We evaluate the performance of our Parallel STL implementation and show performance gains over the sequential implementation while targeting CPUs and GPUs.