Fault Tolerance in Message Passing Interface Programs
- 1 August 2004
- journal article
- research article
- Published by SAGE Publications in The International Journal of High Performance Computing Applications
- Vol. 18 (3), 363-372
- https://doi.org/10.1177/1094342004046045
Abstract
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.Keywords
This publication has 5 references indexed in Scilit:
- HARNESS and fault tolerant MPIParallel Computing, 2001
- Components and interfaces of a process management system for parallel programsParallel Computing, 2001
- MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPIParallel Processing Letters, 2000
- FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic WorldLecture Notes in Computer Science, 2000
- Low-latency, concurrent checkpointing for parallel programsIEEE Transactions on Parallel and Distributed Systems, 1994