CAREER: A Framework for Mining Multimode, Non-Homogeneous Tensor Data Sets With Linear and Non-Linear Degrees of Freedom
Rensselaer Polytechnic Institute, Troy NY
Investigators
Abstract
Abstract 0545538 Drineas, Petros Rensselaer Polytech Inst Intellectual Merit. Massive data sets are ubiquitous and making effective use of such data sets is essential for the advancement of todays science, engineering, and business worlds. This proposal concerns the development and analysis of novel algorithms for structuring and interrogating multimode, heterogeneous data sets modelled by multimode arrays (tensors), which arise in many applications. The proposed algorithms use random sampling to extract representative data elements from the data sets and, subsequently, they operate on the sample. In order to yield improved results for problems with heterogeneities and non-uniformities, importance sampling and other sophisticated sampling techniques are employed. This proposal contains three specific research directions (the first three years) and an early career plan (the last two years). First, it is proposed to address the wellstudied and broadly applicable 1 and 2 regression problems from a sampling perspective in order to prove that a small, judiciously chosen sample contains the information necessary to approximately solve the problem. Second, the above results will be extended to derive simple, intuitive matrix decompositions, of the form CUR, where C contains a few columns of the matrix, R contains a few rows of the matrix, and U is a carefully chosen matrix. CUR-type decompositions provide summaries of the data that are very appealing from a data interpretation perspective, since they express all columns of A as linear combinations of a small number of basis columns (the columns in C), which come directly from the data, and similarly for the rows. Third, CUR-type decompositions will be extended to tensors and a basis for dealing with complex, multimode, heterogeneous, tensor data sets exhibiting multilinear structure will be developed. In the last two years, it is proposed to focus on alleviating the assumption of linear degrees of freedom across the various modes of tensor data sets, and seek sampling techniques that extract non-linear degrees of freedom. This proposal will extend in significant ways the previous work of the PI on fast Monte Carlo algorithms for performing computations on large matrices. Broader impact. This work will have broader impact in fields as diverse as the Social Sciences, Bioinformatics and Computer Science. The PI has initiated collaborations with three labs and will apply the proposed techniques on their data. The PI in collaboration with Prof. Alter and her lab at UT Austin will apply and interpret CUR-type decompositions on cell-cycle gene expression data in an effort to identify physiological pathways that are active during the different phases of the cell cycle. The PI will also collaborate with Prof. Kidds lab at the Genetics Department of Yale University to apply and interpret CUR-type decompositions on genotyping data from samples from 42 populations from around the world. Finally, the PI will collaborate with M. Maggioni at the Yale Math Department and faculty at the Yale Medical School to apply the proposed tensor decompositions on colon cancer datacubes. These three collaborations will serve as proofs of concept for the practical applicability of these techniques, and will lead the PI towards the broader goal of developing a toolbox for tensor analysis for use by the multitude of researchers in fields where multimode data naturally arise. Finally, the PIs work will be of interest to motivated undergraduates with a wide range of interests. It is proposed to disseminate this knowledge to undergraduates by continuing our previously proven commitment to involving undergraduates, even in their sophomore year, in cutting edge research. Extra effort will be made to involve underrepresented groups, in collaboration with the aforementioned labs.
View original record on NSF Award Search →