EAGER: Autonomous Data Partitioning Using Data Mining for High End Computing

$150,000FY2009CSENSF

University Of Oklahoma Norman Campus, Norman OK

Investigators

Abstract

Query response time and system throughput are the most important metrics when it comes to database and file access performance. Because of data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time and system throughput. One of the common ways to reduce disk I/Os and therefore improve query response time is database clustering, which is a process that partitions the database/file vertically (attribute clustering) and/or horizontally (record clustering). To take advantage of parallelism to improve system throughput, clusters can be placed on different nodes in a cluster machine. This project develops a novel algorithm, AutoClust, for database/file clustering that dynamically and automatically generates attribute and record clusters based on closed item sets mined from the attributes and records sets found in the queries running against the database/files. The algorithm is capable of re-clustering the database/file in order to continue achieving good system performance despite changes in the data and/or query sets. The project then develops innovative ways to implement AutoClust using the cluster computing paradigm to reduce query response time and system throughput even further through parallelism and data redundancy. The algorithms are prototyped on a Dell Linux Cluster computer with 486 compute nodes available at the University of Oklahoma. For broader impacts, performance studies are conducted using not only the decision support system database benchmark (TPC-H) but also real data recorded in database and file formats collected from science and healthcare applications in collaboration with domain experts, including scientists at the Center for Analysis and Prediction of Storms (CAPS) at the University of Oklahoma. The project also makes important impacts on education as it provides training for graduate and undergraduate students working on this project in the areas of national critical needs: database and file management systems, and high-end computing and applications. The developed algorithm and prototype, real datasets and performance evaluation results are made available to the public at the Website: http://www.cs.ou.edu/~database/AutoClust.html.

View original record on NSF Award Search →