GGrantIndex
← Search

SHF: Small: Program Analysis-based Makeover for HPC Application Resilience

$420,000FY2016CSENSF

University Of Southern California, Los Angeles CA

Investigators

Abstract

HPC resilience in the presence of increased system failures is a major technical hurdle for realizing the vision of the US National Research Council for conducting exascale science. Existing techniques, based primarily on checkpoint and replay, are no longer effective for emerging systems with orders-of-magnitude more hardware and software components. This project aims to overcome the main limitation of existing techniques: the detection and mitigation of silent errors by developing and leveraging automated software analysis and synthesis techniques. The new methods under development can compile a tunable degree of resilience into the application software code, and have potential to transform the development of future generations of HPC applications. By treating the software code as white-boxes, as opposed to black-boxes, these new methods can provide significantly more economical solutions to the HPC resilience problem compared to existing techniques. The project will help realize the US NRC's vision of conducting exascale science, which is crucial for addressing the nation?s urgent needs in frontiers such as new energy, health care, and national security. This project develops automated program analysis techniques for identifying invariants from software code, and leveraging these invariants to detect and mitigate silent errors at run time. By treating the application software code as white-boxes, it seeks to generate invariants that capture the expected program behavior. By leveraging the invariants as correctness conditions, it overcomes the major hurdle in detecting silent errors, which is the lack of visible symptoms. In addition to detecting errors, the invariants are also used by runtime monitors to intelligently perturb the execution order or memory state to proactively avoid failures at run time. When the rollback recovery becomes inevitable, the invariants are used as guidance to minimize the re-execution overhead. The proposed methods and software tools are evaluated on real applications from the research community as well as sources such as SciDAC.

View original record on NSF Award Search →