- 3rd Workshop on Programming Models for SIMD/Vector ProcessingWPMVP 2016
SIMD processing is currently a main driver of performance in general purpose processor architectures besides multi-core technology. Both technologies increase the potential performance by factors, but have to be explicitly utilised by the software. To expose those different levels of parallelism in a productive and manageable way is still an active area of research. NVIDIA stirred the programming interface scene with the development of a simple yet efficient performance-oriented application programmer interface. OpenACC, OpenMP 4.0, OpenCL, Cilk+ and ispc are just examples for many choices available. Additionally, established optimising compilers still improve significantly in unleashing the SIMD potential. Notable developments on the hardware side include relaxation of alignment requirements and more powerful scatter/gather and shuffle instructions.
Call for Papers
You may download the slides of all talks here:
The purpose of this workshop is to bring together practitioners and researchers from academia and industry to discuss issues, solutions, and opportunities in enabling application developers to effectively exploit SIMD/vector processing in modern processors. We seek submissions that cover all aspects of SIMD/vector processing. Topics of interests include, but are not restricted to:
- Programming models for SIMD/vector processing
- C/C++/Fortran extensions for SIMD (e.g., OpenMP, OpenACC, OpenCL, SIMD intrinsics)
- New data parallel or streaming programming models for SIMD
- Exploitation of SIMD/vector in Java, scripting languages, and domain-specific languages
- Compilers & tools to discover and optimize SIMD parallelism
- Case study, experience report, and performance analysis of SIMD/vector applications
Submitted papers must be no more than 8 pages in length. Authors are encouraged to use the ACM two-column format here. Papers should be submitted in PDF format and should be legible when printed on a black-and-white printer. Each submission will receive at least three reviews from the technical program committee. Selected submissions will be invited to present at the workshop and be published in the workshop proceedings. Accepted papers will be published in the ACM digital library after the workshop. We also maintain a Google+ group to act as a community hub for the workshop.
We especially encourage students to submit papers. There will be a special PPoPP 2016 student travel grant award which you can apply for.
Authors must register and submit the paper through online submission system, if you have problems accessing the system, e-mail your submission to email@example.com .
Sun 13 MarDisplayed time zone: Belfast change
09:00 - 10:30
Session 1WPMVP at Mallorca
Chair(s): Jan Eitzinger University of Erlangen-Nuremberg, Germany
Jan Eitzinger University of Erlangen-Nuremberg, Germany
|Keynote - AnyDSL: Building Domain-Specific Languages for Productivity and Performance|
|A new SIMD iterative connected component labeling algorithm|
Lionel Lacassagne University Paris 6
11:00 - 12:30
Session 2 - Programming ModelsWPMVP at Mallorca
Chair(s): Joel Falcou LRI, Université Paris-Sud
|Support for Data Parallelism in the CAL Actor Language|
Essayas Gebrewahid Halmstad University
|An Evaluation of Current SIMD Programming Models for C++|
Angela Pohl TU Berlin
|Compilers, Hands-Off My Hands-On Optimizations|
Richard Veras Carnegie Mellon University
14:00 - 15:30
|Keynote - SIMD Vectorization Essentials: Learnings, Successes and Advances|
Xinmin Tian Intel
16:00 - 17:30
Session 4WPMVP at Mallorca
Chair(s): Roland Leißa Saarland University
|Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application|
|Code Vectorization using Intel Array Notation|
Olaf Krzikalla TU Dresden, Germany
AnyDSL: Building Domain-Specific Languages for Productivity and Performance
Sebastian Hack (Compiler Design Lab, Saarland University)
Abstract To achieve good performance, programmers have to carefully tune their application for the target architecture. Optimizing compilers fail to produce the “optimal” code because their hardware models are too coarse-grained. Even more, many important compiler optimizations are computationally hard even for simple cost models. It is unlikely that compilers will ever be able to produce high-performance code automatically for today’s and future machines.
Therefore, programmers often optimize their code manually. While manual optimization is often successful in achieving good performance, it is cumbersome, error-prone, and unportable. Creating and debugging dozens of variants of the same original code for different target platform is just an engineering nightmare.
An appealing solution to this problem are domain-specific languages (DSLs). A DSL offers language constructs that can express the abstractions used in the particular application domain. This way, programmers can write their code productively, on a high level of abstraction. Very often, DSL programs look similar to textbook algorithms. Domain and machine experts then provide efficient implementations of these abstractions. This way, DSLs enable the programmer to productively write portable and maintainable code that can be compiled to efficient implementations. However, writing a compiler for a DSL is a huge effort that people are often not willing to make. Therefore, DSLs are often embedded into existing languages to save some of the effort of writing a compiler.
In this talk, I will present the AnyDSL framework we have developed over the last three years. AnyDSL provides the core language Impala that can serve as a starting point for almost “any” DSL. New DSL constructs can be embedded into Impala in a shallow way, that is just by implementing the functionality as a (potentially higher-order) function. AnyDSL uses online partial evaluation remove the overhead of the embedding entirely .
To demonstrate the effectiveness of our approach, we generated code from generic, high-level text-book image-processing algorithms that has, on each and every hardware platform tested (Nvidia/AMD/Intel GPUs, SIMD CPUs), beaten the industry standard benchmark (OpenCV) by 10-35% (!), a standard that has been carefully hand-optimized for each architecture over many years. Furthermore, the implementation in Impala has one order of magnitude less lines of code than a corresponding hand-tuned expert code. We also obtained similar first results in other domains.
SIMD Vectorization Essentials: Learnings, Successes and Advances
Xinmin Tian (Intel Corporation)
Abstract SIMD Vectorization has received significant attention in the past decade as one of the most important methods to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel SSE, AVX, IBM AltiVec and ARM Neon. However, the recent proliferation of modern SIMD architectures poses new constraints such as control flow divergence , memory access divergence, data alignment, mixed data type, and wider fixed-length nature of SIMD vectors, that demand advanced SIMD vectorization compiler technologies and SIMD vectorization friendly language extensions. In this talk, we take a look back on what we have learned in the past decades, and what we have achieved on the path of successful SIMD vectorization for exploiting effective SIMD parallelism in real large applications in the past few years at Intel. We share Intel’s vision on explicit SIMD programming model and compiler technology evolution for SIMD vectorization.