PGAS and Hybrid MPI+PGAS Programming Models on Modern HPC Clusters with Accelerators
Multi-core processors, accelerators (GPGPUs), co-processors (Xeon Phis) and high-performance interconnects (InfiniBand, 10-40 GigE/iWARP and RoCE) with RDMA support are shaping the architectures for next generation clusters. Efficient programming models to design applications on these clusters as well as on future exascale systems are still evolving. The new MPI-3 standard brings enhancements to Remote Memory Access Model (RMA) as well as introduce non-blocking collectives. Partitioned Global Address Space (PGAS) Models provide an attractive alternative to the MPI model owing to their easy to use global shared memory abstractions and light-weight one-sided communication. At the same time, Hybrid MPI+PGAS programming models are gaining attention as a possible solution to programming exascale systems. These hybrid models help the transition of codes designed using MPI to take advantage of PGAS models without paying the prohibitive cost of re-designing complete applications. They also enable hierarchical design of applications using the different models to suite modern architectures.
In this tutorial, we provide an overview of the research and development taking place along the programming models (MPI, PGAS and Hybrid MPI+PGAS) and discuss associated opportunities and challenges in designing the associated runtimes as we head toward exascale computing with accelerator-based systems. We start with an in-depth overview of modern system architectures with multi-core processors, GPU accelerators, Xeon Phi co-processors and high-performance interconnects. We present an overview of the new MPI-3 RMA model, language based (UPC and CAF) and library based (OpenSHMEM) PGAS models. We introduce MPI+PGAS hybrid programming models and the associated unified runtime concept. We examine and contrast different challenges in designing high-performance MPI-3 compliant, OpenSHMEM and hybrid MPI+OpenSHMEM runtimes for both host-based and accelerator (GPU- and MIC-) based systems. We present case-studies using application kernels, to demonstrate how one can exploit hybrid MPI+PGAS programming models to achieve better performance without rewriting the complete code. Using the publicly available MVAPICH2-X, MVAPICH2-GDR and MVAPICH-MIC libraries, we present the challenges and opportunities to design efficient MPI, PGAS and hybrid MPI+PGAS runtimes for next generation systems. We introduce the concept of ’CUDA-Aware MPI/PGAS’ to combine high productivity and high performance. We present how to take advantage of GPU features such as Unified Virtual Address, CUDA-IPC and GPUDirect RDMA technologies to design efficient MPI, OpenSHMEM and Hybrid MPI+OpenSHMEM runtimes. Similarly, using MVAPICH2-MIC runtime, we expose optimized data movement schemes for different system configurations including multiple MICs per-node in same socket and/or different sockets configurations.
There is no fixed pre-requisite. As long as the attendee has a general knowledge in high performance computing, networking, programming models, parallel applications, and related issues, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner.
HPC systems are marked by the usage of multi-cores, accelerators (GPGPUs), co-processors (Xeon Phis) and high-performance interconnects (InfiniBand, 10-40 GigE/iWARP and RoCE) with RDMA support. Efficient programming models to design applications on these clusters as well as on future exascale systems are still evolving. However, programming models, runtimes and associated application designs are not fully taking advantage of these trends. Highlighting these emerging trends and the associated challenges, this tutorial is proposed with the following goals:
- Teach designers, developers and users how to efficiently design and use parallel programming models (MPI and PGAS) and accelerators (GPU and MIC)
- Guide scientists, engineers, researchers and students engaged in designing next-generation HPC systems and applications
- Help newcomers to the field of HPC and exascale computing to understand the concepts and designs of parallel programming models, accelerators, networking, and RDMA
- Demonstrate the impact advanced optimizations and tuning of middlewares can have on application performance through case studies with representative benchmarks and applications.
The tutorial is organized along the following topics:
- Overview of the Modern HPC System Architectures
- Multi-core Processors
- High Performance Interconnects (InfiniBand, 10-40 GigE/iWARP and RDMA over Converged Enhanced Ethernet)
- Heterogeneity with Accelerators (GPUs) and Coprocessors (Xeon Phis)
- MPI-3 Features including RMA and Non-blocking collectives
- Libraries-based Models: Case Study with OpenSHMEM
- Language-based Models: Case Study with UPC
- Overview of MPI+PGAS Hybrid Programming Models and Benefits
Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. His research interests include parallel computer architecture, high performance networking, InfiniBand, exascale computing, programming models, GPUs and accelerators, high performance file systems and storage, virtualization, cloud computing and Big Data. He has published over 350 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, High-Speed Ethernet and RDMA over Converged Enhanced Ethernet (RoCE). The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X software libraries, developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,450 organizations worldwide (in 76 countries). This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 285,000 downloads of this software have taken place from the project’s website alone. This software package is also available with the software stacks of many network and server vendors, and Linux distributors. The new RDMA-enabled Apache Hadoop package and RDMA-enabled Memcached package are publicly available from http://hibd.cse.ohio-state.edu. Dr. Panda’s research has been supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, Cray, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/ ̃panda.
Khaled Hamidouche is a Senior Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He is a member of the Network-Based Computing Laboratory lead by Dr. D. K. Panda. His research interests include high-performance interconnects, parallel programming models, accelerator computing and high-end computing 2applications. His current focus is on designing high performance unified MPI, PGAS and hybrid MPI+PGAS runtimes for InfiniBand clusters and their support for accelerators. Dr. Hamidouche is involved in the design and development of the popular MVAPICH2 library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR and MVAPICH2-X. He has published over 35 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of ACM. More details about Dr. Hamidouche are available http://www.cse.ohio-state.edu/∼hamidouc.
Please visit http://web.cse.ohio-state.edu/~panda/ppopp16_hybrid_tutorial.html for more detailed information about this Tutorial.
Sun 13 Mar
|14:00 - 15:30|
|16:00 - 17:30|