CSR-PSCE,SM: Recovery Aware Parallel Computing
Illinois Institute Of Technology, Chicago IL
Investigators
Abstract
As the scale and complexity of parallel systems continue to grow, failures are inevitable. For years research focused on pre-failure prediction and tolerance - predicting failures and taking precautionary actions before failure occurrence. Despite progress on failure prediction, unexpected failures occur in practice, especially in modern systems with unprecedented sizes and complexities. Relying on pre-failure prediction and tolerance alone is insufficient for fault management because of the inevitability of failures. Just as failures need to be carefully avoided and managed when they occur, post-failure diagnosis and recovery is of equal importance and has a profound impact on almost every aspect of parallel computing. The goal of this research project is to develop RAPS, a Recovery Aware Parallel computing System for post-failure diagnosis and recovery. The research focuses on how to quickly and effectively resume parallel computing after a failure has occurred. The ultimate goal is to seamlessly integrate post-failure diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution for parallel computing. The approach consists of (1) development of new diagnosis mechanisms for fast failure detection and root cause analysis, (2) development of system-wide orchestration for recovery coordination, (3) design of new recovery techniques for quick restoration of parallel applications, and (4) a comprehensive evaluation. The results of this project can significantly improve the productivity of parallel systems. This project also enhances the CS curriculum at IIT and broadens the participation by underrepresented groups.
View original record on NSF Award Search →