CAREER: Learning and Selecting Low-Dimensional Models from Incomplete Data

$478,608FY2023CSENSF

University Of Wisconsin-Madison, Madison WI

Investigators

Abstract

Big datasets often have an underlying structure. Identifying such a structure allows predicting outcomes of interest based on a few variables, for example, predicting the effectiveness of a drug or vaccine based on the drug’s molecular structure. There exists a wide variety of methods to learn the underlying structure of a dataset and make accurate predictions. However, when data is severely incomplete, as is the case in many modern datasets, existing methods consistently fail to identify the correct structure of the data. More alarmingly, the existing methodology has no means to verify whether the structure found is correct or not. In other words, whenever data is incomplete, the structure learned by any existing method cannot be trusted and may result in undetectable, arbitrarily wrong predictions. This project will (i) develop methods to learn structures specifically tailored to handle missing data and (ii) develop a theory to verify whether the structure learned by any method (including existing ones) is correct or not. In turn, this research will enable scientists to learn the structures governing their incomplete datasets in a plethora of applications to the benefit of society, including drug discovery, metagenomics, and opportunistic screening. Furthermore, this project will support outreach activities to engage more students in machine learning, both locally and nationally, through hands-on activities, social media campaigns, symposia, courses, and mentoring. The technical aims of the project are divided into three main thrusts. The first thrust will investigate a new approach that maps incomplete data to the Grassmann manifold of subspaces, wherein the data’s underlying structure can be revealed by solving a constrained optimization over the Schubert varieties defined by the observed data. The second thrust will develop model-selection criteria to determine the structure that best fits an incomplete dataset, among a collection of candidate structures. These criteria will be generalizations of the Akaike and Bayes information criteria and the minimum effective dimension, adapted to account for missing data. These criteria will be complemented with a goodness-of-fit test to determine if the winning structure is, indeed, a good fit for the data. These are non-trivial tasks that require special considerations in light of missing data, which can consistently cause spurious structures fit arbitrarily large datasets with the same degree of error as the correct structures. Ultimately, the results from this thrust will allow determining whether the predictions stemming from a specific structure can be trusted or not. The third thrust will implement our methodology in open-source, easy-to-use software to benefit of the broader scientific community and test it on datasets related to our ongoing interdisciplinary collaborations in metagenomics, single-cell sequencing, sonotypes classification, bacteria classification, drug discovery, and clinical opportunistic screening. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →