III: Small: Fast and Efficient Algorithms for Matrix Decompositions and Applications to Human Genetics

$204,360FY2016CSENSF

Purdue University, West Lafayette IN

Investigators

Abstract

Linear algebraic algorithms, and in particular matrix decompositions, have proven extremely successful in the analysis of datasets in the form of matrices. Tools such as the Singular Value Decomposition (SVD) and the related Principal Components Analysis (PCA) have had a profound impact in diverse areas, ranging from web search engines to the physical sciences. Over the last decade, the introduction of randomization provided a new paradigm for the design and analysis of such algorithms. On the other hand, human genetics researchers are now finding out how truly different we are from one another. Large datasets describing the common patterns of human genetic variation may be easily thought of as matrices, with the rows representing individuals and the columns representing loci in the genome that correspond to common polymorphisms. The broader impact of such datasets can not be overemphasized: they are expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors, as well as understanding the evolutionary and biological history of our species. The main objective of this proposal is to bridge the gap between state-of-the-art algorithms for data analysis developed in the theoretical computer science and applied mathematics communities and the application of such algorithms to the analysis of the increasingly larger volume of datasets in the human genetics community. The particular focus of our proposal is, from an algorithmic perspective, the design and analysis of (supervised and unsupervised) randomized algorithms for the so-called CX matrix factorization, and, from a population genetics perspective, the selection of ancestry informative and disorder associated markers, as well as ancestry and affection status prediction. This work will have immediate impact in the analysis of population genetics data. The results will be disseminated to a broad community of applied mathematicians, theoretical computer scientists, and population geneticists

View original record on NSF Award Search →