Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on fault-tolerant techniques for high-performance computing. It is organized along four main topics: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction, and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications, user-level checkpointing in memory; and (iv) Relevant execution scenarios will be evaluated and compared through quantitative models (from Young’s approximation to Daly’s formulas and recent work).
The tutorial is open to all PPoPP’15 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models.
Goals and outline
This tutorial will present a comprehensive survey of the techniques proposed to deal with failures in high performance computing systems. The main goal is to provide the attendees with a clear picture of this important topic by expounding on the techniques and how they work, and providing an overview of the basic tools and knowledge necessary for quantitative and qualitative evaluation.
The tutorial is organized in four parts:
- An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal).
- Application-specific techniques, such as ABFT for grid-based algorithm or fixed-point convergence for iterative applications.
- General-purpose techniques, which include several checkpoint and rollback recovery protocols, possibly combined with replication.
- Relevant execution scenarios will be evaluated and compared through quantitative models (from Young’s approximation to Daly’s formulas and recent work).
- Assistant Professor at the University Paris-Sud Orsay, September 2004
- Delegated at INRIA (National Institute for Research in Informatics and Automatics), Saclay, September 2008 - August 2010
- Visiting Scholar, University of Tennessee, Innovative Computing Laboratory, November 2008 - August 2010
- Detached from the Assistant Professor position at the University Paris-Sud Orsay, since September 2010
- Research Scientist II, University of Tennessee, Innovative Computing Laboratory, since September 2010
- Permanent Researcher at CNRS, Computer Science department, October 1983
- Visiting Scientist, IBM ECSEC Rome, 1986-1987 (contributed to the ESSL scientific library)
- Professor, Ecole Normale Superieure de Lyon (1988-present)
- Head of CNRS–INRIA project ReMaP, 1995-2000
- Vice-head, then head of LIP, the computer science lab, of ENS Lyon, 1993-1997 and 1997-2001
- Professor of exceptional class since 2003 – highest position of the French system
- Visiting Professor, Univ. Tennessee Knoxville, 1996-1997 (contributed to ScaLAPACK)
- Adjunct Professor at Ecole Polytechnique Paris, 1999-2001
- Responsibility of Computer Science postgraduate studies at ENS Lyon, 1989-1995, and 2004-2007
- Head of Computer Science Department at ENS Lyon, 2005-2008
- Visiting Professor, Univ. Tennessee Knoxville, 2011-2012 (algorithms for petascale platforms)
- Institut Universitaire de France (IUF): junior member, 1993–1998 and senior member, 2007-2012, renewed, 2012-2017.
Sat 12 Mar Times are displayed in time zone: Greenwich Mean Time : Belfast change
|14:00 - 15:30|
|An overview of fault-tolerant techniques for HPC|