An overview of fault-tolerant techniques for HPC (PPoPP 2016 - Tutorials)

Who

Thomas Herault, Yves Robert

Track

PPoPP 2016 Tutorials

Time Zone

The program is currently displayed in (GMT) Belfast.

Use conference time zone: (GMT) BelfastSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 12 Mar 2016 14:00 - 15:30 at Ibiza - Fault-tolerant techniques for HPC
Sat 12 Mar 2016 16:00 - 17:30 at Ibiza - Fault-tolerant techniques for HPC

Abstract

Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on fault-tolerant techniques for high-performance computing. It is organized along four main topics: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction, and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications, user-level checkpointing in memory; and (iv) Relevant execution scenarios will be evaluated and compared through quantitative models (from Young’s approximation to Daly’s formulas and recent work).

Prerrequisites

The tutorial is open to all PPoPP’15 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models.

Goals and outline

This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance computing systems. The main goal is to provide the attendees with a clear picture of this important topic by expounding on the techniques and how they work, and providing an overview of the basic tools and knowledge necessary for quantitative and qualitative evaluation.

The tutorial is organized in four parts:

An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal).
Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications.
General-purpose techniques, which include several checkpoint and rollback recovery protocols, possibly combined with replication.
Relevant execution scenarios will be evaluated and compared through quantitative models (from Young’s approximation to Daly’s formulas and recent work).

Who

Thomas Herault

Assistant Professor at the University Paris-Sud Orsay, September 2004
Delegated at INRIA (National Institute for Research in Informatics and Automatics), Saclay, September 2008 - August 2010
Visiting Scholar, University of Tennessee, Innovative Computing Laboratory, November 2008 - August 2010
Detached from the Assistant Professor position at the University Paris-Sud Orsay, since September 2010
Research Scientist II, University of Tennessee, Innovative Computing Laboratory, since September 2010

Yves Robert

Permanent Researcher at CNRS, Computer Science department, October 1983
Visiting Scientist, IBM ECSEC Rome, 1986-1987 (contributed to the ESSL scientific library)
Professor, Ecole Normale Superieure de Lyon (1988-present)
Head of CNRS–INRIA project ReMaP, 1995-2000
Vice-head, then head of LIP, the computer science lab, of ENS Lyon, 1993-1997 and 1997-2001
Professor of exceptional class since 2003 – highest position of the French system
Visiting Professor, Univ. Tennessee Knoxville, 1996-1997 (contributed to ScaLAPACK)
Adjunct Professor at Ecole Polytechnique Paris, 1999-2001
Responsibility of Computer Science postgraduate studies at ENS Lyon, 1989-1995, and 2004-2007
Head of Computer Science Department at ENS Lyon, 2005-2008
Visiting Professor, Univ. Tennessee Knoxville, 2011-2012 (algorithms for petascale platforms)
Institut Universitaire de France (IUF): junior member, 1993–1998 and senior member, 2007-2012, renewed, 2012-2017.

Thomas Herault

Laboratoire de Recherche en Informatique