A Scalable On-Line Associative Deep Store

$205,001FY2003CSENSF

University Of California-Santa Cruz, Santa Cruz CA

Investigators

Abstract

A Scalable On-Line Associative Deep Store The deep store architecture solves the problems of being able to store data on inexpensive magnetic hard disk at high rates of compression by storing similar data based on content and compressing files using differential compression. Deep store file retrieval performance obviates the need for traditional archival storage media such as magnetic tape. The deep store computes data fingerprints, and then summaries, from files. Files are organized into data clusters by using a content-based file similarity metric; file content, which is immutable, is addressed by content and stored using hashing. The research comprises development of a scalable system architecture, development of new algorithms, integrating existing technologies, and identifying and addressing new problems: searching for similar files in a very large corpus to improve compression, maximizing storage throughput, distributing a large system for throughput and reliability, and managing file similarity data for billions of files. Improved archival data storage density and retrieval performance (latency and data throughput) will impact all computing environments including scientific experimentation and simulation, commercial and government document management, and expand knowledge of information retrieval by content.

View original record on NSF Award Search →