CAREER: accelerating machine learning with low dimensional structure

$167,229FY2020CSENSF

Cornell University, Ithaca NY

Investigators

Abstract

Big datasets are everywhere: in science, in health, in commerce, and in government, data is becoming easier and cheaper to collect. Yet extracting value from this data is a challenge; every step requires human intervention: cleaning the data, identifying useful features, and choosing a machine learning model. The goal of this project is to develop new methods to accelerate and automate the basic machine learning (ML) workflow. Automation frees data scientists from data cleaning and parameter twiddling to concentrate on the important questions: are we solving the right problems, and do we have the right data? This project will help democratize machine learning and promote data-driven decision making by developing automated methods to clean data and to choose ML models, including open source software packages, that make these methods widely available and easy to use. The project also advances these goals by training data scientists in how to use these models and understand their potential risks. Low dimensional structure provides the key to meeting the diverse challenges required to automate machine learning. This project relies on the central insight is that measurements of a complex object, such as a patient in a hospital, respondent on a survey, or even a ML dataset, can be well described as simple functions (or even linear functions) of an underlying low dimensional latent vector. The project develops new algorithms and software to identify low dimensional latent vectors and to use them to a) clean the data by denoising observations or imputing missing entries, b) reduce the dimensionality of feature vectors, and c) recommend better algorithms. This project will develop new techniques to identify low dimensional latent vectors from sparse observations via nonlinear (even, discontinuous) functions, with efficient algorithms and with theoretical guarantees. To enable more efficient automated machine, the project will develop methods localize similar datasets near each other in a low dimensional space, so that nearness in this space predicts similar performance of machine learning methods. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →