PPoPP 2016
Sat 12 - Wed 16 March 2016 Barcelona, Spain

Nowadays, the number of components in High Performance Computing (HPC) systems increases at the pace dictated by Moore’s Law, but the mean time between failures (MTBF) for the complete system is significantly shrinking. For example, when accounting for the instruction & data caches and register files, the mean time between soft errors for the Sequoia supercomputer at Lawrence Livermore National Laboratory is estimated to be 1.5 days. As HPC systems move into the Exascale era, the number of system components will increase by up to three orders of magnitude, and MTBF will further deteriorate, thus promoting resilience into a fundamental challenge. This scenario renders current system solutions to resilience, such as coordinated checkpointing, unfeasible, and motivates the use of algorithmic, programming model, or runtime system approaches to improve the resilience of parallel applications at scale.

While a resilience crisis is looming in the HPC domain, the end of Dennard scaling (i.e., the ability to shrink the feature size of integrated circuits while maintaining a constant power density) has pushed energy consumption into a primary design principle, in par with performance, for which holistic solutions are currently necessary, from the hardware to the application software. The Green500 ranking, based on the LINPACK benchmark, shows remarkable improvements in the MFLOPS/W (millions of floating-point arithmetic operations per Joule) of recent HPC facilities. However, with the cost of 1 MW being close to $1 million, any improvement on this metric will surely have an enormous positive impact on the deployment of future Exascale systems. Despite a flurry of research in recent years on techniques that improve the energy-efficiency of HPC systems via software intervention, energy remains transparent to existing parallel programming models used in production settings.

The quest for higher energy-efficiency in future HPC systems is inherently connected to the quest for enhanced resilience for two reasons: First, resilience techniques have a non-trivial energy cost. Second, ongoing efforts to further improve the energy-efficiency of hardware at the device level (such as operating hardware below its nominal margins or replacing DDR technology with non-volatile memory technologies) may compromise hardware reliability.

Accepted Papers

Title
AcHEe: A Benchmark Suite at the Meeting Point of Heterogeneous and Approximate Computing
PP4REE
Distributed Coordinated Checkpoints with Replication for Automatic Recovery
PP4REE
Exploring the Interplay of Resilience and Energy Consumption for a Task-Based Partial Differential Equations Preconditioner
PP4REE
On the Energy Costs of Fault Tolerance for Matrix Multiplication on Low-Power Multicore Architectures
PP4REE

Call for Papers

The purpose of this workshop is to explore the space of techniques for improving the resilience and energy-efficiency (REE) of parallel programs at the algorithmic and language levels. We are particularly interested in papers that present cross-cutting techniques that trade energy-efficiency with resilience. We solicit original papers that include but are not limited to the following topics:

  • Programming languages, interfaces, and general software techniques for REE.
  • Scheduling and mapping for REE.
  • Run-times for REE.
  • Algorithmic techniques for REE.
  • Programming models for computing paradigms that improve REE, such as near-threshold computing, approximate computing, or neuromorphic computing.
  • Applications and cases studies of success.

Submissions

Papers should not exceed ten single-space double-column pages (including figures, tables and references) using a 10-point font on 8.5x11-inch pages. We suggest to use IEEE two-column template for conference proceedings. Submissions will be judged based on correctness, originality, technical strength, significance, presentation, quality and appropriateness. Submitted papers should not have appeared in or be under consideration for another venue. A full peer-process will be followed with each paper being reviewed by at least 3 members of the program committee. Submissions will be made through EasyChair.

Special Issue

Extended versions of best papers will appear, after an additional review process, in a special issue of Elsevier Parallel Computing journal.

You're viewing the program in a time zone which is different from your device's time zone change time zone

Sat 12 Mar

Displayed time zone: Belfast change

09:00 - 10:30
Session 1PP4REE at Ibiza
Chair(s): Christos Antonopoulos Department of Electrical and Computer Engineering, University of Thessaly, Greece
09:00
60m
Talk
Keynote - Reliability and Energy-efficiency optimizations using Significance-Based Computing
PP4REE
Nikolaos Bellas University of Thessaly, Greece
10:00
30m
Talk
Distributed Coordinated Checkpoints with Replication for Automatic Recovery
PP4REE
Jorge Villamayor Universidad Autónoma de Barcelona, Dolores Rexachs , Emilio Luque