CompBio:Collaborative Research: Development of Effective Gene Selection Algorithms for Microarray Data Analysis
University Of Texas At Dallas, Richardson TX
Investigators
Abstract
With the success of the Human Genome Project, a microarray can now potentially handle the genes in an entire genome scale. A typical microarray data set involves a massive number of genes. A dramatic dimension reduction to a much smaller number of significant genes, responsible for specific conditions, can potentially increase the possibility of further biological study and knowledge regarding the roles of specific genes. Any methodology that can improve our recognition of significant genes among a large number of genes, and often a limited set of available experimental results, could have a significant impact on our understanding of diseased and normal states, and eventually on diagnosis, prognosis, and drug design. The method that we propose to investigate here is intended to provide critical information on the roles of genes where the key component of our approach is subspace-based methods, which have demonstrated great success in numerous pattern recognition tasks including efficient classification, clustering, and fast search. The development of effective computer-based algorithms for gene selection is indispensable since it is virtually impossible to rely solely on biological testing due to the enormous complexity of the problems. What is novel and unique in our proposed research is that we seek to find a mathematically rigorous framework that models gene selection problems, with careful consideration of the significance of the biological characteristics of the problem. Utilizing our knowledge and previous results on feature extraction, and by discovering their mathematical relationship to feature selection, efficient and effective nonparametric methods for gene selection will be designed. An important role will be played by the nonnegative matrix factorization in building a mathematically rigorous bridge between feature extraction and feature selection in our proposed research. In the process, we will also explore novel methods for estimating missing values as a preprocessing stage of gene selection based on the alternating least squares and the structured total least norm formulations. All results obtained, the new algorithms and software developed, as well as the new data sets generated and compiled will be made available to the research community, to teaching faculty, and to both graduate and undergraduate students, using existing Web servers at the Georgia Institute of Technology and University of Texas at Dallas. Intellectual Merit: This research will produce methods that will have a great impact on computational microarray analysis. The gene selection and missing value estimation methods developed in this research allow significant reduction in complexity of biological testing due to the initial reduction of the problem dimension, thus substantially improve detailed study of significant genes. The feature selection and feature extraction algorithms developed in this research will be applicable to many other problems where data sets in high dimensional spaces need to be handled efficiently and effectively, such as text processing, facial recognition, finger print classification, iris recognition. The missing value estimation methods designed in this research can also be utilized in recovering missing data such as in collaborative filtering. Broader Impact: The research will enhance advanced theory of computational biology and bioinformatics. The developed techniques will also have potential applications in database management, medical examination and diagnosis, bio-chemical selection, and biological networks. The graduate student involvement in this research will have numerous future benefits. The discovery and research experience of the students will prepare them for productive careers in academia, research labs, and industry in highly important current research areas in bioinformatics.
View original record on NSF Award Search →