Error/failure analysis using event logs from fault tolerant systems
- 10 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
A methodology for the analysis of automatically generated event logs from fault tolerant systems is presented. The methodology is illustrated using event log data from three Tandem systems. Two are experimental systems, with nonstandard hardware and software components causing accelerated stresses and failures. Errors are identified on the basis of knowledge of the architectural and operational characteristics of the measured systems. The methodology takes a raw event log and reduces the data by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained, and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components. The dependency analysis is illustrated using processor halt data from one of the measured systems. It is found that the number of errors is small, even though the measurement period is relatively long. This reflects the high dependability of the measured systems.<>Keywords
This publication has 6 references indexed in Scilit:
- Failure analysis and modeling of a VAXcluster systemPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Anomaly detection for diagnosisPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Automatic recognition of intermittent failures: an experimental study of field dataIEEE Transactions on Computers, 1990
- Error log analysis: statistical modeling and heuristic trend analysisIEEE Transactions on Reliability, 1990
- A census of Tandem system availability between 1985 and 1990IEEE Transactions on Reliability, 1990
- Performance modeling based on real data: a case studyIEEE Transactions on Computers, 1988