Collaborative Research: CSR-PDOS: Designing Large-Scale Distributed Systems for Realistic Failure Models

$235,233FY2005CSENSF

University Of Hawaii, Honolulu

Investigators

Abstract

Large-scale distributed systems are now being regularly built and deployed. Being large-scale, they are comprised of many components - computers, communications infrastructure, storage devices, and so on - all of which are prone to failure (that is, exceptional nondeterministic behavior). Designing large-scale systems to cope efficiently with such failures is a difficult and ongoing problem; to do so effectively requires a much deeper understanding of how they actually fail. To gain such understanding, the PIs are collecting information on the failures of three actual large-scale systems: a data grid, a desktop grid, and a peer to peer cooperative backup system. This data is being collected either through cooperation with other funded projects (eg, the BIRN project) or by actually deploying the system themselves. The collected failure data is being used to develop more abstract failure models that can be used as the basis of algorithm and system development. The PIs are using these failure models to understand how the systems being studied can be improved (for example, by having higher availability, lower overhead, or better performance than the original systems which are based on much less precise failure models). The PIs are making available all the failure information that they collect (after being anonymized) via the web. Failure models and protocols are being made available through papers, which are also distributed when ready via the project web site. This information is of interest to those designing, constructing and deploying new large-scale systems.

View original record on NSF Award Search →