Computational approaches for elucidating proteomes and interactomes

$1,968,575R35FY2025GMNIH

Princeton University, Princeton NJ

Investigators

Abstract

Proteins and their interaction networks play crucial roles in virtually all cellular processes. As such, knowledge of their functioning is essential for understanding the molecular basis of life. While high-throughput experimental methods are routinely used to characterize proteins and catalog their interactions, the overwhelming amount of genomic variation both within and across organismsâand in healthy and disease statesânecessitates the development and use of efficient computational methods. Our overarching goal is to devise algorithms and machine learning methods that yield a predictive understanding of proteins and their specificities, interactions, and networks, and of how these attributes are altered by genetic variation. This is an especially exciting time to develop methods for analyzing protein sequences, as in recent years the field has been transformed by new artificial intelligence technologies, including protein language models that, akin to the progress in natural language processing, learn the âlanguageâ of proteins. In the next five years, we will combine these groundbreaking technologies with our years of domain expertise on proteins and their networks to tackle fundamental problems in three important areas. First, we will develop powerful new machine learning methods to predict DNA-binding specificities for broad classes of transcription factor proteins; such methods will newly enable the inference of regulatory interactions mediated by uncharacterized transcription factors or those mutated in disease, significantly advancing upon current work that is focused just on specific transcription factor families. Second, we will develop highly scalable, sequence- based approaches to predict the specific functional effects of variants within proteins; these approaches will be a great aid in interpreting the millions of coding variants observed across human populations and will be a step towards obtaining a mechanistic understanding of disease mutations. Third, we will develop novel algorithmic and machine learning approaches to predict the targets of kinase proteins and the pathways they regulate; these methods will elucidate kinase signaling networks and help decipher the growing body of complex phosphoproteomics datasets. A final, cross-cutting goal of our research is to rigorously evaluate current protein language models in order to uncover their strengths and limitations, and to develop innovative strategies to improve their capacity to capture the syntax, grammar, and semantics of protein sequences. We will release open source software for all developed methods.

View original record on NIH RePORTER →