CAREER: Deep representation learning for exploration and inference in biomedical data

$586,187FY2021CSENSF

Yale University, New Haven CT

Investigators

Abstract

Biological systems are inherently complex. Increasingly sophisticated technologies are being used in biomedical science in order to make sense of this complexity and to understand the underlying factors that cause disease. These technologies generate vast amounts of data in many different forms, from changes in how genes and proteins are expressed in individual cells over time, to detailed clinical imaging data on large patient populations and whole genome sequencing studies across hundreds of thousands of people. These newly developed datatypes could help uncover important mechanisms and pathways that underpin health and disease. However, there is a large gap between the information contained in these datasets and the ability to extract meaningful insights. Here the PI proposes to address this by developing new machine learning approaches based on mathematical foundations that will allow us to make sense of these complex datasets. The PI will develop deep representation learning techniques that focus on gaining overall insight into the structures, dynamics, interactions, and predictive features of the data, and will allow specific hypotheses regarding the underlying regulatory mechanisms that drive disease in different contexts to derived. The proposal will also involve training a postdoc, graduate student, and mentorship of local high school students. In addition, it will enable the development of an online workshop to widely disseminate knowledge of unsupervised data analysis to a diverse array of participants from across the country. This project proposes to advance biomedical data analysis via three main thrusts. The first thrust is focused on forming deep multiscale representations of the data based on data geometry, graph signal processing, and topological concepts, in combination with powerful, deep learning systems. Such representations will allow for exploration of structure and meaningful, predictive abstractions of the data in a scalable fashion. Our second thrust is focused on integrating multiple modalities of data and organizing multitudes of related datasets using optimal transport and generative models to gain insight into entire cohorts of patients or perturbation conditions. Our third thrust is focused on learning high dimensional stochastic dynamics of the data using neural SDE (stochastic differential equation) and graph ODE (ordinary differential equation) networks to gain insight into underlying gene regulatory networks. We apply our approaches in the context of several specific biomedical challenges. Achieving these aims will enable integration and exploration of a large volume of data for explaining underlying regulatory mechanisms and dynamic phenotypic changes. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →