Analysis of a composite performance reliability measure for fault-tolerant systems

Abstract
Today's concomitant needs for higher computing power and reliability has increased the relevance of multiple-processor fault-tolerant systems. Multiple functional units improve the raw performance (throughput, response time, etc.) of the system, and, as units fail, the system may continue to function albeit with degraded performance. Such systems and other fault-tolerant systems are not adequately characterized by separate performance and reliability measures. A composite measure for the performance and reliability of a fault-tolerant system observed over a finite mission time is analyzed. A Markov chain model is used for system state-space representation, and transient analysis is performed to obtain closed-form solutions for the density and moments of the composite measure. Only failures that cannot be repaired until the end of the mission are modeled. The time spent in a specific system configuration is assumed to be large enough to permit the use of a hierarchical model and static measures to quantify the performance of the system in individual configurations. For a multiple-processor system, where performance measures are usually associated with and aggregated over many jobs, this is tantamount to assuming that the time to process a job is much smaller than the time between failures. An extension of the results to general acyclic Markov chain models is included.