Study of failure detection and recovery in manufacturing grid resource service scheduling

31 October 2008

journal article
research article
Published by Taylor & Francis in International Journal of Production Research

Vol. 48 (1), 69-94
https://doi.org/10.1080/00207540802275871

Abstract

A Manufacturing Grid (MGrid) system is different from a conventional distributed computing system due to its focus on larger-scale, distributed and heterogeneous manufacturing resource and service sharing, including equipment resources, application system resources, material resources, technical resources, public service resources, etc., where remote resource service scheduling (e.g., remote accessing, information communication) has a large influence on the MGrid quality of service (QoS). The probability of failure is higher in MGrid resource service scheduling and failures affect task execution and the quality of service of the MGrid fatally. Therefore, this paper focuses on failures in the MGrid resource service scheduling process. The potential failures that can occur during MGrid resource service scheduling are investigated. Thirteen failures are first defined in detail and are classified into four categories: (a) virtual-link-related failures, (b) resource-service-related failures, (c) task-related failures, and (d) application-related failures. A failure management system of the MGrid system is presented associated with its architecture. Corresponding detection mechanisms and methods for each defined failure are presented in detail, as well as the corresponding failure recovery methods. The implementation and simulation results indicate that our approaches are sound for promoting a successful scheduling rate and shortening the total execution time of the MGrid resource service.

Keywords

This publication has 25 references indexed in Scilit:

An ECA-based framework for decentralized coordination of ubiquitous web services
Information and Software Technology, 2007
Integrated risk minimization methodology for high volume manufacture
Journal of Manufacturing Technology Management, 2007
Distributed sensor system for fault detection and isolation in multistage manufacturing systems
International Journal of Computer Applications in Technology, 2006
Comparative analysis of vision systems for electroplating surface quality inspection
International Journal of Production Research, 2005
A framework for adaptive execution in grids
Software: Practice and Experience, 2004
QoS-aware middleware for Web services composition
IEEE Transactions on Software Engineering, 2004
The Anatomy of the Grid: Enabling Scalable Virtual Organizations
The International Journal of High Performance Computing Applications, 2001
Fault-tolerant holonic manufacturing systems
Concurrency and Computation: Practice and Experience, 2001
Fault recovery in distributed manufacturing systems by emergent holonic re-configuration: A fuzzy multi-agent modeling approach
Information Sciences, 2000
A resource management architecture for metacomputing systems
Lecture Notes in Computer Science, 1998

Cited by 21 articles