Performance Evaluation of On-Demand Provisioning of Data Intensive Applications

$450,000FY2009CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Abstract

This project is studying the effectiveness of dynamic strategies for provisioning data intensive applications by conducting a thorough performance evaluation of alternative provisioning strategies using the NSF CluE facility and the Apache Hadoop programming environment. Current systems adopt a "one size fits all" and relatively static solution approach even as they serve applications with a wide range of access and processing needs. The reference data sets in the study are high-resolution topographic data sets from airborne LiDAR surveys. The performance of alternative solutions using parallel database technology versus the Hadoop environment is being evaluated. Hybrid strategies, which blend the parallel database approach with the Hadoop-based approach?based, for example, on user directives and/or workload analysis?are also being tested and evaluated. The research will contribute to an understanding of the performance tradeoffs in dynamic provisioning strategies for data intensive applications. The potential impact is a reassessment of how data archives are implemented and data sets served to a broad user community based on on-demand and dynamic approaches to provisioning data sets, as opposed to the current static approaches. The results from this study will include a thorough performance evaluation and recommendations on the best use of large cluster computing environments for supporting data intensive applications, and an evaluation of a dynamic, blended approach to data management. Results will be disseminated via professional conferences and journals. The recommended approaches will be implemented in real data intensive environments, such as the OpenTopography.org portal.

View original record on NSF Award Search →