GGrantIndex
← Search

DMTCP: Checkpoint-Restart on the Desktop

$376,870FY2010CSENSF

Northeastern University, Boston MA

Investigators

Abstract

Title: DMTCP: Checkpoint-Restart on the Desktop PI: Gene Cooperman NSF Proposal Number: 0960978 ABSTRACT: This work builds upon the existing open source, user-space DMTCP package for transparent, distributed checkpointing. Three goals will be accomplished: (i) checkpoint-restart of long-running computations on the desktop; (ii) save-restore of interactive software packages; and (iii) a universal reversible debugger. The first two goals will allow software development teams to add to their package a reliable "save workspace" feature --- with no requirement for a kernel module or other privileged operations. The third goal is to enhance any debugger with reversibility (e.g. a back-step command), and with a reverse expression watchpoint command to move backwards from a software error to the original software fault. INTELLECTUAL MERIT: While checkpointing has existed for over 20 years, earlier packages were difficult to maintain. The unprivileged, user-space design of DMTCP has a five-year track record. It is ideal for integration into other software, where any end-user requirement for installation of a kernel module or other administrative privilege is incompatible with widespread distribution. Finally, DMTCP is the first package able to directly checkpoint a gdb session (the gdb process and its target process) -- a key feature for the envisioned new type of reversible debugger. BROADER IMPACT: Checkpointing and process migration have long been of interest for science and engineering, but too often suffered from software fragility or special requirements. The DMTCP approach removes these obstacles. Further, the wider use of ``time-traveling (reversible) debuggers'' will greatly accelerate software development due to the greater ease of finding bugs. A NIST report estimates the cost of software bugs to the economy at $59.5 billion per year. Finally, the excitement factor of checkpoint-restart on the desktop helps attract and motivate students toward the learning of sometimes arcane systems issues in this critical technology.

View original record on NSF Award Search →