CAREER: Rethinking HPC Resilience in the Exascale Era
Virginia Polytechnic Institute And State University, Blacksburg VA
Investigators
Abstract
Resilience is one of the key exascale research challenges in high-performance computing (HPC). Due to much high error rates, exascale supercomputers could make little progress in computations, or might generate incorrect results due to failures, rendering the exascale performance useless. The challenge is how to achieve a complete HPC resilience at exascale in a way that does not increase the performance overhead, the power consumption, and the complexity of underlying hardware. To this end, this research project designs and develops low-cost hardware/software cooperative techniques for HPC resilience in the exascale era. This project involves four research goals: (1) low-cost soft error resilience for CPUs; intelligent compiler-architecture interaction can validate the lack of errors and performs fine-grained recovery, thus eliminating SDC. (2) compiler-directed soft error resilience for commodity GPUs; it can remove the power-hungry error-correcting code (ECC) logic from the GPU register files without compromising their resilience. (3) lightweight nonvolatile memory (NVM) persistence; it can mitigate the overhead of traditional heavyweight HPC checkpointing and support whole-system persistence for applications without irrevocable operations. (4) low-cost timing error resilience for aggressive voltage scaling to maximize the energy-efficiency with program correctness guarantee. The resulting artifacts and technologies are expected to contribute to the nation's competitiveness by addressing the challenge of building reliable HPC systems. The research outcome impacts a broad range of any disciplines that need correct computation results thus requiring reliable computing systems covering from embedded systems to HPC cloud. Consequently, use of the proposed techniques will make the execution of current and emerging applications much more reliable, and therefore directly affect our way of life. There will be three types of data generated from this research project: (1) algorithms and models, (2) software prototype, (3) testing infrastructure including simulators and evaluation benchmarks and their traces, (4) educational materials. All of our software tools will be open source and made available to the public, laboratories and industry.
View original record on NSF Award Search →