General Purpose Methods for Unsupervised Exploration of Large Datasets
University Of Minnesota-Twin Cities, Minneapolis MN
Investigators
Abstract
With the enormous stream of electronic data being continually generated, it is imperative to use unsupervised methods to organize and explore the data. The enormous quantity of data precludes the use of supervised methods. The goal of this project is to develop unsupervised methods capable of organizing and annotating large datasets of unknown structure, facilitating further exploration and analysis of the data. With the recent advent of general purpose, scalable, unsupervised clustering methods such as Principal Direction Divisive Partitioning (PDDP), whole new vistas open up in the uses and applications of unsupervised methods. This particular method yields additional by-products for free: a hierarchical structure, and identification of the most distinctive attributes. These by-products are just the items needed to impose a structure on a dataset and annotate the computed structure at various levels of detail. This naturally leads to this project: to develop scalable general purpose clustering methods, to extract the information needed to annotate the datasets so users can effectively navigate through the data, and to perform the associated statistical and theoretical analyses. The methods will be validated on a wide variety of domains, including the WWW, specialized legal and/or medical databases, astronomical catalogs, and genomics.
View original record on NSF Award Search →