GGrantIndex
← Search

SHF: Medium: Collaborative Research: Toward Extreme Scale Fault-Tolerance: Exploration Methods, Comparative Studies and Decision Processes

$342,000FY2017CSENSF

Emory University, Atlanta GA

Investigators

Abstract

Current high-performance computing (HPC) research target computer systems with exaflop (1018 or a quintillion floating point operations per second) capabilities. Such computational power will enable new, important discoveries across all basic science domains. Application resilience to computer faults and failures is a major challenge to the realization of extreme scale computing systems. This project, Simulation and Modeling for Understanding Resilience and Faults at Scale (SMURFS), addresses this challenge by developing methods to improve our predictive understanding of the complex interactions amongst a given application, a given real or hypothetical hardware and software system environment, and a given fault-tolerance strategy at extreme scale. Specifically, SMURFS develops: 1. New simulation and modeling capabilities for studying application resilience at scale; 2. Capabilities to execute a comprehensive set of comparative fault-tolerance studies; and 3. Effective prescriptions to guide application developers, hardware architects and system designers to realize efficient, resilient extreme scale capabilities. SMURFS explores the impact of faults and failures, fault mitigation strategies and emerging technologies by providing new analytical and component models for predicting fault-tolerant application behavior at scale. The Iron simulation framework integrates these models for validation and comprehensive performance studies over a wide range of representative applications, application proxies, fault-tolerance protocols and hardware configurations. These studies inform a rule-based system for prescribing best fault-tolerance practices and configurations for new candidate applications and scenarios. SMURFS renders (1) new simulation and analytical models that predict application performance at scale; (2) detailed understandings of how application features interplay with different fault-tolerance strategies and hardware technologies; (3) new knowledge about application behavior at scale; and (4) valuable insight and prescriptions for designing, developing and deploying future extreme scale HPC systems. More broadly, artifacts like the Iron framework and the public suite of application traces will be valuable to the HPC research, engineering, development, procurement and administrative communities. Researchers can use these artifacts for their own research that can impact the HPC exploration and design space. For example, this framework can be instrumental in the co-design of cohesive extreme scale applications, software environments and hardware platforms. Additionally, Iron-based research can inform and improve scientific computing practices, accelerating the rate of scientific discovery. Finally, Iron will be useful as an instructional device to teach about HPC issues both in classroom and tutorial contexts and other programs that engage diverse populations of middle, high school and college students in New Mexico and Tennessee.

View original record on NSF Award Search →