Abstract
The authors present a measurement-based study of software failures and recovery in the Tandem GUARDIAN90 operating system using a collection of memory dump analyses of field software failures. They identify the effects of software faults on the processor state and trace the propagation of the effects to other areas of the system. They also evaluate the role of the defensive programming techniques and the software fault tolerance of the process pair mechanism implemented in the Tandem system. Results show that the Tandem system tolerates nearly 82% of reported field software faults, thus demonstrating the effectiveness of the system against software faults. Consistency checks made by the operating system detect 52% of software problems and prevent any error propagation in 31% of software problems. Results also show that 72% of reported field software failures are recurrences of known software faults and 70% of the recurrence groups have identical characteristics.

This publication has 14 references indexed in Scilit: