BSF:2012304:Methods for Preprocessing Population Sequence Data
University Of California-Los Angeles, Los Angeles CA
Investigators
Abstract
This project is funded as part of the United States-Israel Collaboration in Computer Science (USICCS) program. Through this program, NSF and the United States - Israel Binational Science Foundation (BSF) jointly support collaborations among US-based researchers and Israel-based researchers. In recent years, many genetic studies have been performed, revealing many new associations between human genetic variation and complex diseases. These studies, referred to as genome-wide association studies, are limited to common genetic variants because the technology which collected the genetic variation was limited to only collecting common variants. There is evidence suggesting that rare variants have an important role in disease architectures. Recently, sequencing technologies have been introduced which are capable of collecting both genetic common and rare genetic variation. Sequencing technologies generate enormous amounts of data, raising new computational challenges. In this project, the PIs will develop methods for addressing these computational challenges including the design of efficient algorithms and the modeling of the sequencing process. In addition, the researchers will develop methods for incorporating rare variants into the analysis of genetic studies. The immediate broader impact of our project is the availability of these tools for general use by geneticists, leading to an improved understanding of the disease genetics. Particularly, the PIs will apply their methods to studies of non-Hodgkin's lymphoma, bipolar, dyslipidemia, neurodegenerative dementia, and Tourette syndrome, which will result in a direct impact on our understanding of these particular conditions. Current computational methods for the analysis of sequencing data exist, however they are limited to the analysis of a single sample. In this project the PIs will design efficient computational methods for the analysis of sequence data across a population. For population samples, the tremendous size of the data requires the design of highly efficient algorithms in terms of memory and runtime. Specifically, the PIs propose to design algorithms for the compression of sequencing data, for the search of regions identical by descent across multiple samples, and for high-resolution haplotype inference from sequence data. The PIs will explicitly model rare variants and the sequencing process, and use machine learning techniques and convex optimization to estimate the model parameters efficiently. These methods will allow for a fine-scale analysis of population data, resulting in improved understanding of complex diseases and human history. The collaborative nature of the project will expose the students involved in the project to the medical and genetics worlds, both in Israel and in the US, and it will improve their abilities to design and implement solutions to complex algorithmic problems. The methods developed in this project will be part of the teaching material of courses in UCLA and Tel-Aviv, and these materials will be made publicly available.
View original record on NSF Award Search →