- Workshop on Parallel Programming for Resilience and Energy EfficiencyPP4REE 2016
Nowadays, the number of components in High Performance Computing (HPC) systems increases at the pace dictated by Moore’s Law, but the mean time between failures (MTBF) for the complete system is significantly shrinking. For example, when accounting for the instruction & data caches and register files, the mean time between soft errors for the Sequoia supercomputer at Lawrence Livermore National Laboratory is estimated to be 1.5 days. As HPC systems move into the Exascale era, the number of system components will increase by up to three orders of magnitude, and MTBF will further deteriorate, thus promoting resilience into a fundamental challenge. This scenario renders current system solutions to resilience, such as coordinated checkpointing, unfeasible, and motivates the use of algorithmic, programming model, or runtime system approaches to improve the resilience of parallel applications at scale.
While a resilience crisis is looming in the HPC domain, the end of Dennard scaling (i.e., the ability to shrink the feature size of integrated circuits while maintaining a constant power density) has pushed energy consumption into a primary design principle, in par with performance, for which holistic solutions are currently necessary, from the hardware to the application software. The Green500 ranking, based on the LINPACK benchmark, shows remarkable improvements in the MFLOPS/W (millions of floating-point arithmetic operations per Joule) of recent HPC facilities. However, with the cost of 1 MW being close to $1 million, any improvement on this metric will surely have an enormous positive impact on the deployment of future Exascale systems. Despite a flurry of research in recent years on techniques that improve the energy-efficiency of HPC systems via software intervention, energy remains transparent to existing parallel programming models used in production settings.
The quest for higher energy-efficiency in future HPC systems is inherently connected to the quest for enhanced resilience for two reasons: First, resilience techniques have a non-trivial energy cost. Second, ongoing efforts to further improve the energy-efficiency of hardware at the device level (such as operating hardware below its nominal margins or replacing DDR technology with non-volatile memory technologies) may compromise hardware reliability.
Call for Papers
The purpose of this workshop is to explore the space of techniques for improving the resilience and energy-efficiency (REE) of parallel programs at the algorithmic and language levels. We are particularly interested in papers that present cross-cutting techniques that trade energy-efficiency with resilience. We solicit original papers that include but are not limited to the following topics:
- Programming languages, interfaces, and general software techniques for REE.
- Scheduling and mapping for REE.
- Run-times for REE.
- Algorithmic techniques for REE.
- Programming models for computing paradigms that improve REE, such as near-threshold computing, approximate computing, or neuromorphic computing.
- Applications and cases studies of success.
Papers should not exceed ten single-space double-column pages (including figures, tables and references) using a 10-point font on 8.5x11-inch pages. We suggest to use IEEE two-column template for conference proceedings. Submissions will be judged based on correctness, originality, technical strength, significance, presentation, quality and appropriateness. Submitted papers should not have appeared in or be under consideration for another venue. A full peer-process will be followed with each paper being reviewed by at least 3 members of the program committee. Submissions will be made through EasyChair.
Extended versions of best papers will appear, after an additional review process, in a special issue of Elsevier Parallel Computing journal.
Sat 12 MarDisplayed time zone: Belfast change
09:00 - 10:30
|Keynote - Reliability and Energy-efficiency optimizations using Significance-Based Computing|
Nikolaos Bellas University of Thessaly, Greece
|Distributed Coordinated Checkpoints with Replication for Automatic Recovery|