SHF: Small: Collaborative Research: ALETHEIA: A Framework for Automatic Detection/Correction of Corruptions in Extreme Scale Scientific Executions
Northwestern University, Evanston IL
Investigators
Abstract
Trusting scientific applications requires guaranteeing the validity of computed results. Unfortunately, many examples of scientific computations have led to incorrect results, sometimes with catastrophic consequences. Currently known validation techniques cover only a fraction of the possible corruptions that numerical simulation and data analytics applications may suffer during execution. As science processes grow in size and complexity, the reliability and validity of their constituent steps is increasingly difficult to ascertain. Assessing validity in the presence of potential data corruptions is a serious and insufficiently recognized problem. Corruption may occur at all levels of computing, from the hardware to the application. An important aspect of these corruptions is that until they are discovered, all executions are at risk of being corrupted silently. In some documented cases, months have elapsed between the discovery of a corruption and notification to users. In the meantime, a potentially large number of executions may be corrupted, and incorrect conclusions may result. It may be difficult, after the fact, to check whether executions have actually been corrupted or not, so that even if corruptions do not lead to mistakes, they may lead to significant productivity losses. Virtually all simulations producing very large results need to reduce their data volume in some way before saving it --one technique is called lossy compression. This project strives to validate the end result of the simulation coupled with lossy compression. This approach is useful for scientific simulations in such diverse areas as climate, cosmology, fluid dynamics, weather, and astrophysics --the drivers of this project. This collaborative project applies the principle of an external algorithmic observer (EAO), where the product of a scientific application is compared with that of a surrogate function of much lower complexity. Corruptions are corrected using a variation of triple modular redundancy: if a corruption is detected, a second surrogate function is executed, and the correct value is chosen from the two results that are most in agreement. This new online detection/correction approach involves approximate comparison of the lossy compressed results of the scientific application and the surrogate function. The project explores the detection performance of surrogate functions, lossy compressors, and approximate comparison techniques. The project also explores how to select the surrogate, lossy compression, and approximate functions to optimize objectives and constraints set by the users. The evaluation considers a set of five applications spanning different computational methods, producing large datasets with I/O bottlenecks, and covering a variety of science problem domains relevant to the NSF. In addition to serving the needs of scientists working in the fields listed above, this project will enhance the research experience of undergraduate students. A summer school focused on resilience is planned for summer 2016, and corruption detection/correction will be a major topic. The project is also organizing tutorials in major science conferences that include online detection/correction of numerical simulations.
View original record on NSF Award Search →