Optimization Techniques for Geometrizing Real-World Data

$50,615FY2019MPSNSF

New York University, New York NY

Investigators

Abstract

Data is a common denominator to scientific fields, governments, and private enterprises. Being able to exploit data to find patterns has produced scientific breakthroughs and shifted business paradigms in the last several decades. This project focuses on mathematical and algorithmic techniques for specific data science problems, tailored to currently relevant domain problems, technologies, and volumes of data. The theoretical problems we consider are (i) clustering (which essentially consists on grouping data according to similarity in an unsupervised way), (ii) dimensionality reduction (reducing the volume of the data while preserving its relevant features), and (iii) quadratic assignment (finding correspondences between different datasets). The main underlying application we consider in this project is computational biology, in particular the processing of single-cell sequencing data. The technology for single-cell sequencing has been very recently developed and it is improving quickly, producing new datasets, problems and challenges that are interesting from a mathematical point of view and have potentially enormous impact. The project will have mathematicians working closely to computational biologists with the goal of identifying data science problems occurring in the scientific domain and to develop appropriate algorithms and mathematical tools. Given single-cell genetic expression data indicating how many times each gene is expressed in each cell, one objective is to select a few genes that can be used to identify different classes of cells. This problem is known in the computational biology literature as genetic marker selection. In a first approach we assume the class of each cell is known and the problem can be posed as supervised dimensionality reduction. We model it as a projection factor recovery problem, and we approach it using optimization tools such as semidefinite and linear programming. The objective is two-fold, we aim to study mathematical properties of the model we devise, and to develop an efficient tool to be used by practitioners. A second stage of the project is to make the problem unsupervised, therefore clustering will be a fundamental step. We will study stability properties of clustering methods and we will provide an efficient algorithm to evaluate the quality of clusters, based on statistical and optimization techniques. The potential use of this tool is general to data science and not just gene expression datasets. Finally, a third objective is to align datasets coming from different experiments. This problem is ubiquitous in data science, with graph matching and shape matching as some particular cases. In the context of computational biology the alignment problem is known as batch correction and it can be modeled with optimal transport or as a quadratic assignment problem. We will develop alignment algorithms and study their convergence and recovery properties under different data models. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →