GGrantIndex
← Search

SHF: Small: Collaborative Research: Uncovering Vulnerabilities in Parallel File Systems for Reliable High Performance Computing

$233,000FY2017CSENSF

Texas Tech University, Lubbock TX

Investigators

Abstract

Many scientific problems (e.g., computational biology, high-energy physics, climate science) rely on high performance computing (HPC) systems to manage and process massive amounts of data. However, with the rapid increase in scale and complexity, even the specially-designed and well-maintained HPC platforms may fail. This research aims to design innovative methodologies that scrutinize parallel file systems, the major storage software which empowers HPC platforms, and uncover the issues in parallel file systems that can lead to data loss under various failure scenarios. Such an effort is a fundamental step towards building highly reliable HPC systems and meet the demand of data-driven scientific discovery. In addition, this project integrates the research activities with education and outreach efforts to train broadly inclusive and globally competitive science workforce. More specifically, this project includes two synergistic research tasks, which enables automatic testing as well as diagnosing the issues in parallel file systems. The first task focuses on testing parallel file systems through a single-fault injection framework, which interrupts the normal workloads of the target parallel file system automatically, and examines if the interruption could lead to any issues that cannot be fixed by the corresponding checker of the parallel file system. Building on the first task, the second task focuses on diagnosing the issues uncovered in the previous task through a two-level provenance-based analysis. The first level analysis builds the coarse-grain, inter-node provenance, which provides a high-level picture of the entire system behavior. The second level analysis creates the fine-grain, intra-node provenance that contains causal paths within each individual node. In addition, multiple provenance traces are aligned and compared automatically to help locate the problematic code region with minimal human efforts.

View original record on NSF Award Search →
SHF: Small: Collaborative Research: Uncovering Vulnerabilities in Parallel File Systems for Reliable High Performance Computing · GrantIndex