A Cloud-based Application for the Joint Analysis of Multiple Big Data Types
Mathnimbus, Inc., Southlake TX
Investigators
Abstract
Project Summary Technology advances now enable the cost-effective acquisition of K>1 distinct data types from a common set of N bio-samples where: i) the kth data type is represented by a data matrix with columns containing Pk measurements in N samples for k=1,2,?,K; and ii) at least one of the data types is ?big?, i.e., Pk is much bigger than N for some k. The rapid accumulation of such multi-modal data sets (MMDS) in private and public databases has slowed the development of a more predictive, precise, and personalized approach to detecting and treating cancer and other complex diseases. This problem is due in large part to the lack of easily accessible, computationally efficient software that can identify small sets of biologically informative variables (i.e., signatures) in MMDS that are also predictive of clinical outcomes. The primary aim of this project is to develop a cloud-based application based on a novel algorithm called the Joint Analysis of Many Matrices via ITeration (JAMMIT) that exploits a key property of genomic signatures called sparsity to enhance their detection in big data matrices. The sparsity assumption asserts that the number of variables needed to explain a key biological and/or clinical attribute of the samples constitutes only a very small fraction of the 10?s of thousands measured. JAMMIT computes sparse, rank-1 matrix approximations that automatically ?zoom-in? on sparse signatures that are shared by the data matrices of a MMDS. False discovery rate is used to select the ?best? signatures for further downstream data reduction and modeling. The JAMMIT algorithm has been validated in data simulations and real experimental data for ovarian and liver cancer. A novel cloud-based platform called AlgorithmHub will be used to implement JAMMIT as a secure, computationally efficient Software-as-a-Service (SaaS) on Amazon Web Services (AWS). Researchers will be able to access the application 24/7 from any device with internet access to upload, pre-process, and analyze up to three big data types in a timely manner. Post-processing tools will be implemented in AWS that facilitate the training of neural network (NN) predictors on ?eigen-wavelet? (EW) features extracted from JAMMIT-derived signatures using genetic programming and backpropagation to optimize network topology and connection weights, respectively. Signatures derived by JAMMIT as a SaaS will be compared with published results generated by a version of JAMMIT implemented on local servers in Matlab. NNs trained on raw signature and EW features will be assessed and compared using ROC curves, confusion matrices and cross-validation. The Phase I implementation of JAMMIT as a cloud-based application will set the stage for a Phase II effort to: 1) extend JAMMIT to handle an arbitrary number of data matrices; 2) automate the selection of the best sparsity parameter based on FDR; 3) enhance ease-of-use based on user feedback; and 4) utilize genetic programming to optimize both EW features and network topology along with network connection weights.
View original record on NIH RePORTER →