XPS: EXPL: CCA: Collaborative Research: Nixing Scale Bugs in HPC Applications
University Of Utah, Salt Lake City UT
Investigators
Abstract
Large-scale simulation is a fundamental component of modern science and engineering. Unfortunately, programs written to perform simulations on large-scale parallel computers frequently suffer from software defects that result from the sheer scale and the variety of parallelization approaches employed. Especially egregious are software bugs that occur when large resource allocations (e.g., memory requests) are made. Formally based active-testing techniques are essential to locate such defects. However, these testing tools are themselves seldom run on parallel machines, let alone at large scale, making it difficult and very time consuming to find scale bugs with high assurance. Efforts to parallelize verification tools should reuse existing technology for easy parallelization, result collection, and fault handling. Key innovations of this project include the insight that large-scale verification runs can be described through work-flows, which makes it possible to take advantage of already available distributed computing platforms, in particular Swift/T from Argonne. The complementary backgrounds of the PIs are well matched with the need to push both formal aspects and distributed verification in the context of three widely-used concurrency models, namely MPI, OpenMP, and CUDA. This work will help create a public distributed formal active testing framework. The tools and case-study software driving this research will be maintained by the PIs and released freely under open-source licenses through websites and repositories. They will facilitate large-scale debugging of scientific simulation codes by researchers and software developers in academia, government labs, and industry. The project will also generate pedagogical material and best practices, helping educate students in the use of existing work-flow based problem solving approaches. It will help train present and future scientists, engineers, and programmers, thus assisting in maintaining our nation's leadership in computing, homeland and energy security, and STEM education.
View original record on NSF Award Search →