XPS: FULL: CCA: Production-Run Failure Recovery Based Approach to Reliable Parallel Software
University Of Chicago, Chicago IL
Investigators
Abstract
Title: XPS: FULL: CCA: Production-Run Failure Recovery Based Approach to Reliable Parallel Software Concurrency bugs are a severe threat to system reliability in the multi-core era. Approaches to handling concurrency bugs and improving the reliability of production-run parallel software are sorely needed. This project aims to create a new parallel computing paradigm. The intellectual merits are that the project will pioneer treating run-time failure recovery as default for parallel programs, and reshaping every aspect of parallel-program development and maintenance. The project's broader significance and importance are that it will help lower the costs of software development, in-house testing, failure diagnosis, and bug repair, broadly benefiting society through better-performing parallel software. Specifically, the proposed framework will include five components: (1) a feather-weight run-time recovery framework that utilize natural program idempotence to obtain natural concurrency-bug failure recovery; (2) a new code-development system that guide developers to write software with improved recoverability; (3) a new in-house testing system, where the testing focus is shifted towards hard-to-recover code; (4) a new on-demand run-time monitoring system that leverages on-demand run-time monitoring for run-time recovery; (5) a new off-line failure diagnosis system that leverages the feedback from recovery for failure diagnosis and fixing. These five components will work together to significantly improve the reliability and lower the development cost of parallel software.
View original record on NSF Award Search →