CI: NEW: Collaborative Research: Computer System Failure Data Repository to Enable Data-Driven Dependability

$370,905FY2015CSENSF

University Of Illinois At Urbana-Champaign, Urbana IL

Investigators

Ravishankar K Iyercontact Zbigniew Kalbarczyk

Abstract

Dependability has become a necessary requisite property for many of the computer systems that surround us or work behind the scenes to support our personal and professional lives. Heroic progress has been made by computer systems researchers and practitioners working together to build and deploy dependable systems. However, an overwhelming majority of this work is not based on real publicly available failure data. Unfortunately, an open failure data repository for any recent computing infrastructure that is large enough, diverse enough and with enough information about the infrastructure and the applications that run on them does not exist. This project will address this pressing need. The research team appreciates that this effort is challenging on many levels. Failure data are considered sensitive and are usually unveiled only before trusting eyes of a small subset of the people at the organization. As part of a current one-year planning grant, this team has collected specific requirements for the repository from a wide audience, collected failure and usage data from the largest centrally managed computing cluster at Purdue and performed preliminary analysis to reveal the workload usage patterns. The goal of this full-scale project is to collect data from a variety of computational infrastructure at the two participating universities, and from several of the NSF-funded large cyberinfrastructure projects. The project will collect, curate, and present public failure data of large-scale computing systems in a repository called FRESCO. The data sets will include static information, dynamic information about the workloads, and failure information for both planned and unplanned outages. The data collection from production machines will have to obey several practical constraints -- no changes to the workload, little performance perturbation, and minimal changes to the operating system. Further, the data have to be sanitized for removing sensitive information and processed to make it interpretable by a broad group of researchers. This project will also provide analysis tools to answer certain commonly occurring questions, such as the correlation between workload and failure and the performance implications of using one library over another, as well as an intuitive graphical front-end which will allow people to explore the data sets and download the relevant ones. Widespread use of the data and the associated analysis tools will give computer systems researchers an unprecedented ability to do data-driven research and offer computing infrastructure providers an analytic-driven capability to run more efficient reliable infrastructures.

View original record on NSF Award Search →