Debugging Parallel Programs with Instant Replay
- 1 April 1987
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. C-36 (4), 471-482
- https://doi.org/10.1109/tc.1987.1676929
Abstract
The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally consistent logical time. We describe a prototype implementation of Instant Replay on the BBN Butterfly™ Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs.Keywords
This publication has 9 references indexed in Scilit:
- Atomic shared register access by asynchronous hardwarePublished by Institute of Electrical and Electronics Engineers (IEEE) ,1986
- Development of a debugger for a concurrent languageIEEE Transactions on Software Engineering, 1986
- Distributed snapshotsACM Transactions on Computer Systems, 1985
- Debugging a Distributed Computing SystemIEEE Transactions on Software Engineering, 1984
- Concurrent Reading While WritingACM Transactions on Programming Languages and Systems, 1983
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978
- MonitorsCommunications of the ACM, 1974
- Concurrent control with “readers” and “writers”Communications of the ACM, 1971
- The structure of the “THE”-multiprogramming systemCommunications of the ACM, 1968