Failure analysis and modeling of a VAXcluster system
- 4 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 244-251
- https://doi.org/10.1109/ftcs.1990.89372
Abstract
The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<>Keywords
This publication has 12 references indexed in Scilit:
- A STATISTICAL LOAD DEPENDENCY MODEL FOR CPU ERRORS AT SLACPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Analysis of workload influence on dependabilityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Markov and Markov reward model transient analysis: An overview of numerical approachesEuropean Journal of Operational Research, 1989
- Approximate availability analysis of VAXcluster systemsIEEE Transactions on Reliability, 1989
- Probabilistic modeling of computer system availabilityAnnals of Operations Research, 1987
- Measurement and modeling of computer reliability as affected by system activityACM Transactions on Computer Systems, 1986
- VAXclusterACM Transactions on Computer Systems, 1986
- Effect of System Workload on Operating System Reliability: A Study on IBM 3081IEEE Transactions on Software Engineering, 1985
- Closed-Form Solutions of PerformabilityIEEE Transactions on Computers, 1982
- A Unified Reliability Model for Fault-Tolerant ComputersIEEE Transactions on Computers, 1980