GGrantIndex
← Search

OAC Core: Interpretable Resilience Analysis Platform for Scientific Workflow Applications

$584,640FY2022CSENSF

Kent State University, Kent OH

Investigators

Abstract

For years, scientists have continued to improve the performance of the simulations where resilience was neglected. This approach was driven by a lack of understanding of the "cause and effect" in resilience analysis. While the current resilience analysis tool continues to lack transparency and interpretability, it is critical that the importance of resilience analysis is promoted and that scientists are educated on its criticality. This project's novelties are redefining the resilience analysis in terms of interpretability and explainability. The approach is significantly different from existing endeavors. It can explain or identify the logic behind these predictions and differentiate the functions and usages of the existing tools built on different theories. The project's impacts include designing a new resilience assessment system using visualization and DevOps to enable transparent resilience analysis, vulnerability positioning, and automation of resilience continuous integration. The project work with NSF and DoE-sponsored supercomputing centers to adopt the system with proven success. Graduate and undergraduate students, especially from underrepresented groups, will be trained in multiple disciplines that will enable them to have successful careers in computing/scientific research areas that are becoming increasingly interdisciplinary. This project builds upon existing knowledge to create a new insightful approach that enables the resilience property of scientific applications to be assessed under the inevitable existence of surging soft errors in next-generation high-performance computing systems. This project will bring further clarity, insight, and understanding into how systems behave while running high-performance computing scientific workloads composed of parallel simulations for data generation, big data analytics, and machine learning to extract data insights in scientific research. The project proposes 1.) the design and implementation of an error propagation analysis platform, which creates interpretable visualization of the critical paths and critical sections of the codes; 2.) analytics to allow domain scientists to compare and contrast the different resilience models on the simulation codes; 3.) a continuous resilience assessment (Resilience CI) that can be integrated into a standard continuous integration to automate the procedure; whereby the resilience property between committed versions will be delivered to developers as a standard report and to support the DevOps of exa-scale scientific applications; and 4.) quantum chemistry workflow will participate in the evaluation as the driver applications. The project's outcomes, such as tutorials, collected data, and the visualization software system, can encourage the application developers to incorporate cost-effective fault tolerance strategies. In addition, the investigators will incorporate research outcomes in new courses and tutorials for the workforce training. The project will engage and advance the partnership with the industry for commercialization. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →