Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

Bosilca, G and Chen, Z and Dongarra, J and Langou, J (2007) Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. [MIMS Preprint]

[thumbnail of Recovery_Patterns.pdf] PDF
Recovery_Patterns.pdf

Download (247kB)

Abstract

Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then, a simple checkpoint-free fault tolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is recovered by computing a new approximate solution using the data of the non-failed processors. The iterative method is then restarted with this new vector. The main advantage of the lossy approach over standard checkpoint algorithms is that it does not increase the computational cost of the iterative solver, when no failure occurs. Experiments are presented that compare the different techniques. The fault tolerant FT-MPI library is used. Both iterative linear solvers and eigensolvers are considered.

Item Type: MIMS Preprint
Additional Information: Accepted in SIAM SISC, May 2007
Subjects: MSC 2010, the AMS's Mathematics Subject Classification > 65 Numerical analysis
MSC 2010, the AMS's Mathematics Subject Classification > 68 Computer science
Depositing User: Ms Lucy van Russelt
Date Deposited: 10 Oct 2007
Last Modified: 08 Nov 2017 18:18
URI: https://eprints.maths.manchester.ac.uk/id/eprint/859

Actions (login required)

View Item View Item