DC: Small: Collaborative Research: DARE: Declarative and Scalable Recovery
University Of Chicago, Chicago IL
Investigators
Abstract
One dominant characteristic of today's large-scale computing systems is the prevalence of large storage clusters. Storage clusters at the scale of hundreds or thousands of commodity machines are increasingly being deployed. At companies like Amazon, Google, Yahoo, and others, thousands of nodes are managed as a single system. As large clusters have brought many benefits, they also bring a new challenge: a growing number and frequency of failures that must be managed. Bits, sectors, disks, machines, racks, and many other components fail. With millions of servers and hundreds of data centers, there are millions of opportunities for these components to fail. Failing to deal with failures will directly impact the reliability and availability of data and jobs. Unfortunately, we still hear data-loss stories even recently. For example, in March 2009, Facebook lost millions of photos due to simultaneous disk failures that "should" rarely happen at the same time (but it happened); in July 2009, a large bank was fined a record total of 3 millions pounds after losing data on thousands of its customers; more recently, in October 2009, T-Mobile Sidekick, which uses Microsoft's cloud service, also lost its customer data. These incidents have shown that existing large-scale storage systems are still fragile to failures. To address the challenges of large-scale recovery, the goal of this project is to: (1) seek the fundamental problems of recovery in today's scalable world of computing, (2) improve the reliability, performance, and scalability of existing large-scale recovery, and (3) explore formally grounded languages to empower rigorous specification of recovery properties and behaviors. Our vision is to build systems that "DARE to fail": systems that deliberately fail themselves, exercise recovery routinely, and enable easy and correct deployment of new recovery policies. For more information, please visit this website: http://boom.cs.berkeley.edu/dare/
View original record on NSF Award Search →