Cumulvs: Providing Fault Toler. Ance, Visualization, and Steer Ing of Parallel Applications

Abstract
The use of visualization and computational steering can often assist scientists in analyzing large-scale scientific applications. Fault tolerance to failures is of great impor tance when running on a distributed system. However, the details of implementing these features are complex and tedious, leaving many scientists with inadequate develop ment tools. CUMULVS is a library that enables program mers to easily incorporate interactive visualization and computational steering into existing parallel programs. Built on the PVM virtual machine framework, CUMULVS is portable and interoperable with all the computer archi tectures that PVM works with—a growing list that now stands at about 60 architectures. The CUMULVS library is divided into two pieces: one for the application program and one for the possibly commercial, visualization, and steering front end. Together, these two libraries encom pass all the connection and data protocols needed to dynamically attach multiple, independent viewer front ends to a running parallel application. Viewer programs can also steer one or more user-defined parameters to "close the loop" for computational experiments and analy ses. CUMULVS allows the programmer to specify user- directed checkpoints for saving an important program state in case of failures and also provides a mechanism to migrate tasks across heterogeneous machine architec tures to achieve improved performance. Details of the CUMULVS design goals and compromises as well as future directions are given.

This publication has 2 references indexed in Scilit: