Debugging Parallel Programs with Instant Replay

1 April 1987

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers

Vol. C-36 (4), 471-482
https://doi.org/10.1109/tc.1987.1676929

Abstract

The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally consistent logical time. We describe a prototype implementation of Instant Replay on the BBN Butterfly™ Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs.

Keywords

This publication has 9 references indexed in Scilit:

Atomic shared register access by asynchronous hardware
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1986
Development of a debugger for a concurrent language
IEEE Transactions on Software Engineering, 1986
Distributed snapshots
ACM Transactions on Computer Systems, 1985
Debugging a Distributed Computing System
IEEE Transactions on Software Engineering, 1984
Concurrent Reading While Writing
ACM Transactions on Programming Languages and Systems, 1983
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM, 1978
Monitors
Communications of the ACM, 1974
Concurrent control with “readers” and “writers”
Communications of the ACM, 1971
The structure of the “THE”-multiprogramming system
Communications of the ACM, 1968

Cited by 549 articles