Study of failure detection and recovery in manufacturing grid resource service scheduling
- 31 October 2008
- journal article
- research article
- Published by Taylor & Francis in International Journal of Production Research
- Vol. 48 (1), 69-94
- https://doi.org/10.1080/00207540802275871
Abstract
A Manufacturing Grid (MGrid) system is different from a conventional distributed computing system due to its focus on larger-scale, distributed and heterogeneous manufacturing resource and service sharing, including equipment resources, application system resources, material resources, technical resources, public service resources, etc., where remote resource service scheduling (e.g., remote accessing, information communication) has a large influence on the MGrid quality of service (QoS). The probability of failure is higher in MGrid resource service scheduling and failures affect task execution and the quality of service of the MGrid fatally. Therefore, this paper focuses on failures in the MGrid resource service scheduling process. The potential failures that can occur during MGrid resource service scheduling are investigated. Thirteen failures are first defined in detail and are classified into four categories: (a) virtual-link-related failures, (b) resource-service-related failures, (c) task-related failures, and (d) application-related failures. A failure management system of the MGrid system is presented associated with its architecture. Corresponding detection mechanisms and methods for each defined failure are presented in detail, as well as the corresponding failure recovery methods. The implementation and simulation results indicate that our approaches are sound for promoting a successful scheduling rate and shortening the total execution time of the MGrid resource service.Keywords
This publication has 25 references indexed in Scilit:
- An ECA-based framework for decentralized coordination of ubiquitous web servicesInformation and Software Technology, 2007
- Integrated risk minimization methodology for high volume manufactureJournal of Manufacturing Technology Management, 2007
- Distributed sensor system for fault detection and isolation in multistage manufacturing systemsInternational Journal of Computer Applications in Technology, 2006
- Comparative analysis of vision systems for electroplating surface quality inspectionInternational Journal of Production Research, 2005
- A framework for adaptive execution in gridsSoftware: Practice and Experience, 2004
- QoS-aware middleware for Web services compositionIEEE Transactions on Software Engineering, 2004
- The Anatomy of the Grid: Enabling Scalable Virtual OrganizationsThe International Journal of High Performance Computing Applications, 2001
- Fault-tolerant holonic manufacturing systemsConcurrency and Computation: Practice and Experience, 2001
- Fault recovery in distributed manufacturing systems by emergent holonic re-configuration: A fuzzy multi-agent modeling approachInformation Sciences, 2000
- A resource management architecture for metacomputing systemsLecture Notes in Computer Science, 1998