CIF: Small: Collaborative Research:Synchronization and Deduplication of Distributed Coded Data: Fundamental Limits and Algorithms
Rutgers University New Brunswick, New Brunswick NJ
Investigators
Abstract
Part 1: Coding for distributed storage systems has garnered significant attention in the past few years due to the rapid development of information technologies and the emergence of Big Data formats that need to be stored and disseminated across large-scale networks. As typical distributed systems need to ensure low-latency data access and store a large number of files over a set of nodes connected through a communication network, it is imperative to develop new distributed coding schemes that protect the systems from undesired component failures. The two key functionalities of codes used in distributed systems, namely the reconstruction of files via access to a subset of the nodes and repair of failed nodes, need to be retained when the files are accessed and processed by the users via symbol/block insertion, deletion, or substitution edits. Deletions frequently arise due to system-level data deduplication: when parts of files are deduplicated or edited, the changes in the information content need to be communicated to the redundant storage nodes with minimum communication cost. Current solutions for synchronizing data that underwent edits assume that data is uncoded and they do not fully exploit the distributed nature of information. Furthermore, they mostly ignore the presence of deduplication protocols. This makes distributed storage architectures inefficient in terms of storage, user access times, and error protection. Hence, the goals of the proposed research program are to develop a new set of protocols and coding schemes that will support a new generation of versatile and updatable coded distributed storage systems. Part 2: Building on the preliminary work of the investigators, this proposal aims to set the foundations of the new field of coded synchronization and deduplication, with the goal of deriving fundamental performance limits, developing efficient algorithmic solutions for the two families of problems, and constructing new distributed storage codes that enable synchronization of coded data and coded deduplication. In particular, the proposal addresses the following comprehensive issues: 1) Characterizing the communication rate limits of known and new (un)coded synchronization schemes, trade-offs between deduplication and data repair performance for different structured or encoded data formats and different types of communication channels. 2) Introducing and analyzing the communication rate-distortion (CRD) function for approximate synchronization and deduplication of structured/encoded data, with a special focus on delay-sensitive applications. 3) Developing dynamically updatable synchronization and deduplication algorithms cognizant of the network topology and of different prioritization needs of the users, as encountered in image and video data coding.
View original record on NSF Award Search →