Study of failure detection and recovery in manufacturing grid resource service scheduling

Abstract
A Manufacturing Grid (MGrid) system is different from a conventional distributed computing system due to its focus on larger-scale, distributed and heterogeneous manufacturing resource and service sharing, including equipment resources, application system resources, material resources, technical resources, public service resources, etc., where remote resource service scheduling (e.g., remote accessing, information communication) has a large influence on the MGrid quality of service (QoS). The probability of failure is higher in MGrid resource service scheduling and failures affect task execution and the quality of service of the MGrid fatally. Therefore, this paper focuses on failures in the MGrid resource service scheduling process. The potential failures that can occur during MGrid resource service scheduling are investigated. Thirteen failures are first defined in detail and are classified into four categories: (a) virtual-link-related failures, (b) resource-service-related failures, (c) task-related failures, and (d) application-related failures. A failure management system of the MGrid system is presented associated with its architecture. Corresponding detection mechanisms and methods for each defined failure are presented in detail, as well as the corresponding failure recovery methods. The implementation and simulation results indicate that our approaches are sound for promoting a successful scheduling rate and shortening the total execution time of the MGrid resource service.

This publication has 25 references indexed in Scilit: