GGrantIndex
← Search

III: Small: RIOT: Statistical Computing with Efficient, Transparent I/O

$516,000FY2009CSENSF

Duke University, Durham NC

Investigators

Abstract

Recent technological advances enable collection of massive amounts of data in science, commerce, and society. These datasets bring us closer than ever before to solving important problems such as decoding human genomes and coping with climate changes. Meanwhile, the exponential growth in data volume creates an urgent challenge. Many existing analysis tools assume datasets fit in memory; when applied to massive datasets, they become unacceptably slow because of excessive disk input/output (I/O) operations. Across application domains, much of advanced data analysis is done with custom programming by statisticians. Progress has been hindered by the lack of easy-to-use statistical computing environments that support I/O-efficient processing of large datasets. There have been many approaches toward I/O-efficiency, but none has gained traction with statisticians because of issues ranging from efficiency to usability. Disk-based storage engines and I/O-efficient function libraries are only a partial solution, because many sources of I/O-inefficiency in programs remain at a higher, inter-operation level. Database systems seem to be a natural solution, with efficient I/O and a declarative language (SQL) enabling high-level optimizations. However, much work in integrating databases and statistical computing remains database-centric, forcing statisticians to learn unfamiliar languages and deal with their impedance mismatch with host languages. To make a practical impact on statistical computing, this project postulates that a better approach is to make it transparent to users how I/O-efficiency is achieved. Transparency means no SQL, or any new language to learn. Transparency means that existing code should run without modification, and automatically gain I/O-efficiency. The project, nicknamed RIOT, aims at extending R---a widely popular open-source statistical computing environment---to transparently provide efficient I/O. Achieving transparency is challenging; RIOT does so with an end-to-end solution addressing issues on all fronts: I/O-efficient algorithms, pipelined execution, deferred evaluation, I/O-cost-driven expression optimization, smart storage and materialization, and seamless integration with an interpreted host language. RIOT integrates research and education, and continues the tradition of involving undergraduates through REU and independent studies. As a database researcher, the PI is committed to learning and drawing from work from programming languages and high-performance computing. Findings from RIOT help create synergy and seed further collaboration with these communities. To ensure practical impact on statistical computing, RIOT has enlisted collaboration from statisticians and the R core development team on developing, evaluating, and disseminating RIOT. Further information can be found at: http://www.cs.duke.edu/dbgroup/Main/RIOT

View original record on NSF Award Search →