ITR Collaborative Research: Combinatorial Algorithms for Biological Data Clustering

$1,295,000FY2003CSENSF

University Of Georgia Research Foundation Inc, Athens GA

Investigators

Abstract

Project Summary The Human Genome Project has opened the flood-gate of biological data, which has resulted in the generation of enormous amount of sequence, structure, expression, and interaction data at rates that far exceed our current capability of analyzing and interpreting them. New ideas and approaches are urgently needed to establish greatly improved capabilities for biological data analysis. Data clustering is fundamental to mining a large quantity of biological data. The goals of this project are (a) to develop a highly effective and general framework for biological data clustering, which is applicable to a large class of biological data analysis problems; (b) to demonstrate the effectiveness of this framework as a general-purpose clustering tool, through application to four challenging biological data analysis problems; (c) to implement this clustering framework as a set of library functions, in a similar fashion to LINPACK/LAPACK, with which other researchers can build their own clustering capabilities more efficiently; (d) to provide insight on several biological problems through clustering analysis; and (e) to train students/postdocs how to build biological data analysis tools, using our clustering framework as a training ground. The foundation of our framework is a minimum spanning tree (MST) representation of a data set and its relationships with clustering. Our preliminary studies have revealed that (i) there is a natural connection between MSTs and the concept of clustering, which can help to reduce a multi-dimensional data clustering problem to a tree-partitioning problem; (ii) clustering problems with general objective functions, defined on (minimum spanning) trees, can be solved optimally and efficiently; and (iii) MSTs provide a natural framework for solving a more general class of clustering problems, i.e., extracting data clusters from a noisy background. Additional preliminary studies have also revealed that MSTs have such rich properties related to clustering that further investigation could lead to significantly more effective ways of clustering and analyzing biological data. Our research will be organized and carried out in five tasks. o Investigation of fundamental properties of MSTs versus clustering: We will investigate fundamental relationships between MSTs and clustering. New insights and discoveries about their relationships will be used to lay the foundation for development of more effective ways of clustering. o Investigation and development of MST-based clustering algorithms and statistical analysis methods: We will investigate and develop a large class of MST-based algorithms for several clustering related problems. In addition, we will investigate and develop effective statistical analysis tools for assessing statistical significance and robustness of clustering results. o Development of improved analysis capabilities for four selected application problems: We will apply our clustering framework to four biological data analysis problems: (1) gene expression data analysis, (2) regulatory binding site identification, (3) two-hybrid data analysis, and (4) phylogenetic tree clustering analysis. o Implementation of our MST-based clustering framework as library functions: We will implement our MST-based clustering-related algorithms as APIs (Application Programming Interface), which can be used easily by other researchers in their own data analysis software. In addition, we will implement our clustering tools as a Web server for community service. o Training and education: As MST provides such a rich set of attractive properties relevant to clustering, we will use our MST-based clustering framework as a training platform to teach students/postdocs how to develop biological data analysis tools. Our proposed study and development directly address the research challenges of the ITR program in the following areas: o providing new computational, simulation and data-analysis methods and tools to model physical, biological,social, behavioral and mathematical phenomena, and o improving our ability to understand, model and control the behavior of complex systems.

View original record on NSF Award Search →