DMS/NIGMS 2: Deep learning for repository-scale analysis of tandem mass spectrometry proteomics data

$905,320FY2023MPSNSF

University Of Washington, Seattle WA

Investigators

Abstract

The field of proteomics studies the primary functional molecules in the cell, identifying and quantifying proteins in complex biological samples with the goal of understanding their roles in health and disease. Proteomics is also fundamental to studies of microorganisms in diverse environment, ranging from soil samples to oceanwater samples. The primary technology driving the rapid growth of this field is tandem mass spectrometry. In addition to technological advances in mass spectrometry hardware, accurate and efficient analysis of the complex data produced by a tandem mass spectrometer requires increasingly sophisticated algorithmic tools. The project will develop these tools. In particular, the project team will develop machine learning software that aims to improve scientists' ability to infer the identities and quantities of thousands of proteins in a complex sample. Successful adoption by the proteomics research community of the tools developed by this project will impact a huge range of studies, including model organism proteomics to understand basic molecular function, human disease cohort studies, and environmental proteomics analyses. The tools produced by this project will allow scientists to to detect more proteins and to more accurately quantify how their abundances change in health and disease and across different environmental conditions. The central hypothesis driving this project is that statistical power in interpreting bottom-up tandem mass spectrometry data can be increased by using deep neural networks to leverage data in public repositories. The project addresses a series of project tasks, each of which uses deep neural networks to solve a different core problem in mass spectrometry analysis, and each of which can be improved by making use of massive and rapidly growing repositories of public mass spectrometry data, such as PRIDE and MassIVE. The four tasks address large-scale clustering of spectra, assigning peptides to observed spectra in a de novo fashion, imputing missing values in cohorts of quantitative mass spectrometry data, and de-noising mass spectrometry measurements. These tasks are important because (1) each one represents a fundamental analysis challenge, a solution for which has the potential to impact a wide variety of downstream applications in mass spectrometry proteomics, (2) each task allows for innovative applications of machine learning from repository-scale data, and (3) the project team has existing mass spectrometry collaborations that will directly benefit from solutions to these problems. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →