CRII: CSR: Towards Pinpointing the Root Causes of Failures in Flash-based Storage Systems
New Mexico State University, Las Cruces NM
Investigators
Abstract
Storage systems that hold financial transactions, scientific computation results, a family's photos, and more, are undergoing dramatic changes. Triggered by the advances in flash memory technology, almost every layer of storage systems is being optimized aggressively to make full use of flash. This trend increases the complexity of the systems and increases the chance of obscure failures. Failures in flash-based storage systems are challenging to diagnose even for data center professionals, and can lead to severe downtime or even data loss. Moreover, the difficulty of failure diagnosis will increase in the foreseeable future with the rapid growth of flash and other non-volatile memory technologies. Thus, a framework for diagnosing the whole system and pinpointing the root causes of failures is urgently needed. The goal of this research project is to enable precise, end-to-end diagnosis of failures in flash-based storage systems. One key observation is that more and more kernel components are being optimized aggressively for flash, so a diagnosis framework is needed that minimizes the dependence on the kernel. On the other hand, although many components are changing, some fundamental abstractions such as files and logical blocks, and interfaces such as system calls and command sets, are relatively stable. A diagnosis framework can focus on the invariants to analyze the behavior of the rapidly evolving systems. Based on these observations, this research project investigates a novel failure diagnosis framework that separates failure domains based on the interfaces and automatically correlates the fundamental operations across layers. More specifically, the framework consists of two synergistic components: (1) an on-drive device agent to record and reproduce the host-device interactions without any dependence on the host, and (2) an end-to-end analyzer to provide in-depth diagnosis support along the whole data path from the application to the device. This research project will advance the dependability of storage systems for various important data in modern society. The improvements to data storage will in turn benefit the society from many other aspects, including examples such as reducing maintenance costs and human efforts and avoiding financial loss due to service downtime or data corruption. This research project will also contribute to the curriculum of operating systems and other related courses, and will engage undergraduate and graduate students in systems research. In addition, the project will be integrated into the Alliance for Minority Participation Program and the Young Women in Computing Program to engage students from traditionally underrepresented groups, serving our national, regional, and local interest in increasing diversity in computing research.
View original record on NSF Award Search →