III: Small: Rethinking the Data Organization and Lifecycle in LSM Storage Systems

$600,000FY2023CSENSF

University Of California-Riverside, Riverside CA

Investigators

Abstract

To support the efficient storing of large amounts of data, many modern database systems use the Log Structured Merge tree (LSM) technology. This technology allows grouping many data updates together, before applying them to the database. This project has identified several limitations of LSM storage, which cause reduced rates of reads and writes to the database system. Specifically, current LSM systems do not consider the hotness of a data record when deciding how to store it, and may also suffer from periodic stalls, where the system may become unresponsive while large maintenance operations, called merges, are performed. Further, LSM systems are inefficient at exploiting larger computer memories. The developed techniques create novel data organization and flow patterns in the LSM storage, which leverage modern hardware capabilities to boost the read and write capabilities of the storage system. Improving the performance of database systems will allow storing larger data at lower costs, thus making storage systems more accessible to scientists and general users. This project will also strengthen and extend the ongoing undergraduate research and high school outreach activities of the investigators. The project has several research aims. First, algorithms will be developed to store frequently accessed records in more accessible locations for faster retrieval. This will facilitate a bi-directional LSM tree architecture, where records flow both top-down and bottom-up. This will allow naturally maintaining hot records together, for faster querying and more effective caching. Second, new algorithms will be created to improve the speed of data merges. Periodic merges are used to maintain the stored data organized and consistent. This aim will study how to universally partition LSM runs to facilitate splitting a large merge into multiple disjoint sub-merges, thus reducing stall periods. The third aim will create algorithms to better utilize large memory sizes and multithreading parallelism, and develop a mixed memory-disk LSM tree. The key idea is that, instead of directly enlarging the MemTable, where recent writes are buffered, some components can be pinned in memory, with a more efficient organization, and enable parallel execution on queries. The project includes theoretical analysis, experimental study and software development. The developed algorithms are tested on real-world data and integrated in real database systems. This integration may increase the impact of the project. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →