Analytical Approaches to Massive Data Computation with Applications to Genomics

$69,189R01FY2014CANIH

Brown University, Providence RI

Investigators

Linked publications, trials & patents

Paper 29950014 Paper 29700472 Paper 29112732 Paper 28882002 Paper 28810144 Paper 28751478 Paper 27587696 Paper 27489002 Paper 27467246 Paper 26645471 Paper 26253137 Paper 26072510 Paper 25950620 Paper 25785493 Paper 25501392 Paper 24932008 Paper 24479672

Abstract

DESCRIPTION (provided by applicant): We propose to design and test mathematically well founded algorithmic and statistical tectonics for analyzing large scale, heterogeneous and noisy data. We focus on fully analytical evaluation of algorithms' performance and rigorous statistical guarantees on the analysis results. This project will leverage on the PIs' recent work on cancer genomics data analysis and rigorous data mining techniques. Those works were driven by specific applications, while in the current project we aim at developing general principles and techniques that will apply to a broad sets of applications. The proposed research is transformative in its emphasis on rigorous analytical evaluation of algorithms' performance and statistical measures of output uncertainty, in contrast to the primarily heuristic approaches currently used in data ming and machine learning. While we cannot expect full mathematical analysis of all data mining and machine learning techniques, any progress in that direction will have significant contribution to the reliability and scientific impact of this discipline. While ou work is motivated by molecular biology data, we expect the techniques to be useful for other scientific communities with massive multi-variate data analysis challenges. Molecular biology provides an excellent source of data for testing advance data analysis techniques: specifically, DNA/RNA sequence data repositories are growing at a super-exponential rate. The data is typically large and noisy, and it includes both genotype and phenotype features that permit experimental validation of the analysis. One such data repository is The Cancer Genome Atlas (TCGA), which we will use for initial testing of the proposed approaches.

View original record on NIH RePORTER →