EAGER: Efficient Algorithms for Dimensionality Reduction and Clustering Using Disk-Based Matrices

$100,000FY2009CSENSF

University Of Houston, Houston TX

Investigators

Abstract

EAGER: Efficient Algorithms for Dimensionality Reduction and Clustering Using Disk-Based Matrices Carlos Ordonez 1. Research and Education Proposal Linear Gaussian models on large data sets computation is characterized by heavy matrix manipulation and iterative methods with slow convergence. Efficiency issues become worse considering the fact that large data sets are stored and retrieved from disk and that in a database system models are manipulated on disk as well. Despite their importance there is scarce research work that attempts to adapt this big family of Gaussian models exploiting database systems techniques. This proposal studies how to improve algorithms for linear Gaussian models to analyze large, high dimensional, data sets, manipulating matrices on secondary storage (i.e. disk), using a small amount of primary storage (i.e. RAM memory). The models studied herein include maximum likelihood factor analysis for dimensionality reduction and mixtures of Gaussian distributions to perform clustering. The educational component of this proposal involves two main activities. The first activity is to develop a plan to expose disadvantaged and minority high school students to data mining research and practice in order to encourage them to study computer science. The second activity involves enhancing current research and teaching of data mining at the University of Houston. 2. Intellectual Merit This research project requires the discovery of common algorithmic principles to perform incremental matrix computations for a family of statistical models, understanding how to summarize large data sets, preserving their statistical properties required by multiple models and proposing new database techniques tailored for such models, capable of performing efficient matrix manipulation on secondary storage. Incremental computations are difficult to attain because methods for linear Gaussian models require iterations on the entire data set. Summarization requires transforming complex matrix equations considering high dimensionality, large data set size and numerical stability, preserving model accuracy. Developing matrix optimizations combining primary and secondary storage is quite different from optimizing a matrix algorithm that works only on primary storage. This research work requires mathematical knowledge to generalize, optimize and transform the computation of linear Gaussian models. On the other hand, it needs database systems expertise on how to organize and index diverse matrices on secondary storage for efficient reading and writing. 3. Broader Impact This proposal will have a broad impact on the analysis of large, complex, high dimensional scientific data sets and enhancing database systems with incremental model computation capabilities. We plan to apply and test our proposed algorithms and techniques on scientific data sets, including geographical, medical and biological data sets, among others.

View original record on NSF Award Search →