Reproducing Kernel Hilbert Space Methods in Statistical Model Building and Data Analysis

$54,000FY2005MPSNSF

University Of Wisconsin-Madison, Madison WI

Investigators

Abstract

Abstract Wahba, Grace G. Proposal ID: DMS - 0505636 Original Title: Reproducing Kernel Hilbert Space Methods in Statistical Model Building and Data Analysis New Title: Positive Definite Kernel Methods in Statistical Model Building and Data Analysis Positive definite functions (a.k.a."kernels") play a key role in statistical model building, classification, clustering and data mining. Such kernels provide a distance metric for elements in their domain, which may be functions (as in Reproducing Kernel Hilbert Spaces), or, more recently, trees, graphs, images, sounds, DNA and protein sequences, microarray gene expression data, text messages and other objects. A reasonable distance metric is a prerequesite for prediction, classification, and clustering, and methods for obtaining such metrics are an area of active research. The proposer will introduce, develop and study the properties of a new class of nonparametric methods for obtaining kernels in situations where noisy, crude, incomplete information related to dissimilarity between pairs of objects in arbitrary sets is available. The methods are called regularized kernel estimation, since they involve a tradeoff between fitting the crude information available and a penalty or complexity functional on the kernel, analogous to, but not the same as classical regularization and the bias-variance tradeoff. Optimal tuning and dimensionality reduction procedures will be proposed and their properties studied. The methods proposed are believed to have new and important computational and theoretical advantages, and these will be demonstrated, by development of efficient computational algorithms, by simulation, by development of the theory, and by application to a variety of scientific problems. This work is motivated by the goal of obtaining better methods for clustering and classifying objects mentioned above, by obtaining improved ways to describe the "distance" betweein objects. For example, microarray gene chips may contain information concerning, e. g. the type of tumor whose DMA is being studied, and it is anticipated that this research will provide improved methods for extracting this information in cases where it is difficult to identify the type of tumor, and this will ultimately result in better diagnostic and treatment outcomes. Similarly, it is of interest to cluster protein sequences into functional classes, with the goal of identifying function by associating sequences that are "nearby". It is anticipated that the present research will provide a more efficient way of extracting information from crude or incomplete dissimilarity data and, contribute to the long-term technology of understanding protein function. Other potential applications include improved classification of weather states, with the goal of clustering and classifying local situations that have similar outcomes, signal detection in large neutrino detectors, classification of astronomical bodies, and classification and clustering problems in a variety of other scientific fields.

View original record on NSF Award Search →