GGrantIndex
← Search

III: Small: Low-Cost Deduplication and Search for Versioned Datasets

$515,998FY2015CSENSF

University Of California-Santa Barbara, Santa Barbara CA

Investigators

Abstract

Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. After deduplication, the structure of an inverted index for versioned data becomes complex and it is expensive to search relevant results. This project will study low-cost solutions for compact archiving and indexing and develop efficient algorithms and systems techniques for searching versioned datasets. It will also consider that the archived data can be stored in an untrusted server environment and investigate tradeoffs in efficiency and privacy-preservation for search. The developed solutions will bring significant computing and storage cost advantages for application users involving large-scale versioned data management and search. The developed software will be made public for research communities. The research effort will be integrated with an educational plan containing research mentoring, instruction improvement, and outreach activities. This project will be focused on studying key challenges and cost-sensitive technical aspects in integrated archival and search support for managing large versioned datasets. The main tasks include efficient software architecture and optimization for detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.

View original record on NSF Award Search →