EAGER: A Transactional, Fault-Aware MPI Prototype to Enable MPI-3 Standardization
University Of Alabama At Birmingham, Birmingham AL
Investigators
Abstract
Parallelism is everywhere in computing systems (such as multicore laptops), and in high end supercomputers. To program these machines, Message Passing Interface (MPI), which is a de facto standard of portable application programmer interfaces, was being developed by industry, academia, and government participants since 1993. However, MPI-based parallel programs have traditionally failed to continue to work when hardware faults occur. Hardware faults are becoming a bigger problem for parallel systems as they grow in size and speed. This work introduces a new solution to fault tolerance at scale for MPI-based programs. This research advances current state of fault-tolerant MPI by exploring a possibility of applying transactional parallel programming techniques. The main idea is to explore how MPI-based parallel applications benefit from transactions to achieve fault tolerance. The success of the proposed approach will change the design of middleware, application programmer interfaces, and scalable subsystems that support such parallel programs (such as input-output systems).
View original record on NSF Award Search →