GGrantIndex
← Search

CAREER: Rethinking Replication in Highly Available and Reliable Data Stores

$396,699FY2023CSENSF

Suny At Stony Brook, Stony Brook NY

Investigators

Abstract

This project targets long-lasting and increasingly pervasive challenges in building highly reliable and available distributed storage systems, which are the key components in large-scale Internet services. The proposal aims to study and address insufficiencies in current distributed storage system designs from several important aspects including their abstraction, performance, communication, and failure models. These insufficiencies typically arise from the conventional designs in distributed systems that were not prepared for today’s Internet scale. This research may produce a major impact on industry and society because distributed systems are the cornerstones of modern computing infrastructures such as cloud computing, serverless computing, and high-performance computing. In particular, this work will be done in collaboration with widely used distributed storage systems built in Microsoft, Google, MongoDB, and Cockroach. The PI is working with their department to broaden the course offerings with multidisciplinary courses in the general area of cloud computing, distributed systems, reliable systems, and software engineering. The PI will incorporate the topics in this proposal into the courses he is teaching. Traditionally, systems often use a method called state machine replication (SMR) to achieve reliability. SMR is often treated as an independent module, with a clear boundary such as a key-value interface isolating it from the rest of the system. However, this creates problems in today’s systems. For example, using SMR as a blackbox in geo-replicated systems could incur a great performance penalty because each application request could translate into multiple SMR operations and each operation takes at least a wide-area round-trip to finish. This proposal summarizes four impediments of existing systems from different aspects in practice (abstraction, performance, communication model, and failure model), and proposes four research thrusts targeting these impediments: 1) use transactional data structures to enrich the abstraction layer between applications and the data store, 2) investigate leveraging multicore resources to improve performance for geo-replication, 3) study advanced SMR techniques under a more generalized communication model than the status quo, 4) study tolerating a new failure model, silent data corruption. The proposed work will develop a set of novel, transformative technologies, including a new replicated transactional data structure library that supports building general types of applications, other new system designs and implementations for providing fast and consistent replication that will be encapsulated in state-of-the-art research products and will dramatically improve the performance and fault-tolerance of modern distributed systems. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →