Algorithmic Based Fault Tolerance Applied to High Performance Computing

Bosilca, George and Delmas, Remi and Dongarra, Jack and Langou, Julien (2009) Algorithmic Based Fault Tolerance Applied to High Performance Computing. [MIMS Preprint]

[thumbnail of bosilca_delmas_dongarra_langou_190608.pdf] PDF
bosilca_delmas_dongarra_langou_190608.pdf

Download (321kB)

Abstract

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrixmatrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly.

Item Type: MIMS Preprint
Additional Information: Appears also as Technical Report UT-CS-08-620, Department of Computer Science, University of Tennessee, Knoxville, TN, USA, June 2008 and as LAPACK Working Note 205"
Subjects: MSC 2010, the AMS's Mathematics Subject Classification > 65 Numerical analysis
MSC 2010, the AMS's Mathematics Subject Classification > 68 Computer science
Depositing User: Ms Lucy van Russelt
Date Deposited: 13 Jan 2009
Last Modified: 20 Oct 2017 14:12
URI: https://eprints.maths.manchester.ac.uk/id/eprint/1209

Actions (login required)

View Item View Item