GGrantIndex
← Search

SHF: Medium: Collaborative Research: Next-Generation Message Passing for Parallel Programming: Resiliency, Time-to-Solution, Performance-Portability, Scalability, and QoS

$659,277FY2017CSENSF

University Of Tennessee Chattanooga, Chattanooga TN

Investigators

Abstract

Parallel programming based on MPI is being used with increased frequency in academia, government (defense and non-defense uses), as well as emerging uses in scalable machine learning and big data analytics. Emerging supercomputer systems will have more faults and MPI needs to be able to workaround such faults to be appropriate to these emerging situations, rather than causing an entire application to fail. Collaborative, transformative message passing research for High Performance Computing (HPC) critical to performance-portable parallel programming in new and forthcoming scalable systems (with a strategy of "best practice-first, standardization-later") is being reduced to practice. A substantial subset of the Message Passing Interface (MPI-3/4) application programmer interface is being made fault tolerant through extensions with weak collective transactions that synchronize between parallel tasks. This research studies the novel model that localizes faults, provides tunable fault-free overhead, allows for multiple kinds of faults, enables hierarchical recovery, and is data-parallel relevant. Fault modeling of underlying networks is being studied. Application developers control the granularity and fault-free overhead in this effort. Performance and scalability results of the middleware prototype are being demonstrated principally through compact applications that relate to real use cases of practical and academic interest. The impact of this work ranges from users of the largest supercomputers in government labs to practical clusters that have long-running, time-critical applications, and to space-based and other parallel processing in "hostile" environments where faults occur more frequently than in past years. The project is producing usable free software that will be widely shared in the community as well as guidance on how better parallel programs can be written in academia, industry, and government. The project also provides guidelines for how to update existing or legacy programs to use the new capabilities that are being reduced to practice.

View original record on NSF Award Search →