BIGDATA: DA: Interpreting massive genomic data sets via summarization

$207,764R01FY2014CANIH

University Of Washington, Seattle WA

Investigators

Linked publications & trials

Paper 29993658 Paper 29345009 Paper 27846892 Paper 25948244

Abstract

Genomic data is big and getting ever bigger, but current analysis methods will not scale to the analysis of thousands or millions of genomes. Consequently, a critical technical challenge is to develop new methods that can analyze these enormous data sets. In this proposal, we describe a new computational framework for drawing inferences from massive genomic data sets. Our approach leverages submodular summarization methods that have been developed for analyzing text corpora. We will apply these methods to five big data problems in genomics: 1) identifying functional elements characteristic o f a given human cell type; 2) identifying genomic features associated with a particular subclass of cancer; 3-4) identifying genomic variants representative of ancestrally or phenotypically defined human populations; and 5) finding a set of microbial genes that characterize a given site on the human body. This project will advance discovery and understanding on two fronts. First, we will develop novel methods for summarizing genomic, epigenomic and metagenomic data sets. Indeed, to our knowledge, this grant proposes the first application of summarization methods to genomic data of any kind. The proposed research will significantly advance our ability to apply submodularity to these summarization tasks, particularly with respect to identifying and creating a library of distance functions that have bee validated with respect to the five tasks outlined in the proposal. Second, we will apply our novel methods to problems of profound importance. Indeed, significant progress toward any one of our five tasks would represent an important advance in our scientific understanding of human history, biology or disease. The impact of this project will grow as the big data problem grows, even after the project is complete. The results of this project, both the software that we develop and the summaries that we produce, will be useful for answering a wide array of questions in any field that must cope with big data.

View original record on NIH RePORTER →