Robust Accurate Identification of peptides from tandem mass spectrometry data
National Library Of Medicine
Investigators
Linked publications, trials & patents
Abstract
Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed, and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of MS-based proteomics, this implies that accurate statistics must be obtained in peptide identification, then built on if one can hope to have protein identification method(s) with accurate statistical significance assignment. However, although heavily concentrated upon and studied, the statistical accuracy of peptide/protein identification remains challenging. There are many peptide identification methods using database searches and assigning E-values to peptide hits; however, the E-values reported by different methods do not agree with each other and few of them, if any, agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods, particularly if one wishes to combine methods with user-assigned weights. In our earlier publications, we have developed peptide identifications methods with accurate statistical significance assignment founded on the extension of the central limit theorem, and all possible peptide statistics; we have provided a way to combine search results democratically in our earlier publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate; we have devised a mathematical framework to eliminate this instability. We also developed a protein identification method that combines weighted P-values of evidence peptides, solving the long-standing problem of precise type-I error control in protein identification; it also correctly reports the proportion of false discoveries, indicating accurate type-II error control. We also developed a new peptide significance assignment method based on the extreme value statistics to provide accurate peptide identification confidence for methods that use scoring functions that cannot be expressed as a sum of independent contributions. Over the past several years, we have worked on a large cross-institute project involving scientists from NHLBI, NCI, and the Clinical Center, focusing on pathogen identification using MS-based proteomics. The central concept is to use each pathogenâs peptidome as a unique identifier, enabling species or genus ranking based on peptidome similarity to the peptides identifiedârelying on accurate statistical significance assignment. Between 2017 and 2020, we completed the first three phases of the study: identification of a single microbe, simultaneous identification of multiple microbes, and identification of multiple microbes along with protein biomass estimation. These results were published in the Journal of the American Society for Mass Spectrometry. In 2021 and 2022, we focused on improving tool accessibility and fostering international collaboration. We developed and maintained a graphical user interface (GUI) for our microbial identification and biomass estimation methods and collaborated with the CCUG group at the University of Gothenburg in Sweden to address antibiotic resistance identification in pathogensâa crucial step toward avoiding ineffective antibiotic treatments. These findings were also published in the Journal of the American Society for Mass Spectrometry. Building on our previous work, from 2023 through 2024 we focused on significantly enhancing the graphical user interface (GUI) of our microorganism classification and identification software, MiCId, and accelerating the identification of pathogens and microorganisms from MS proteomics data. Guided by feedback from users and collaborators, we introduced new tools to support workflow management, reproducibility, record keeping, data analysis, and visualization. Additional features added to the GUI include modules for analyzing peptide fragmentation patterns, peptide/protein isotopic distributions, and the implementation of the lowest common ancestor (LCA) algorithm for proteotyping biomarker design. In parallel, we collaborated with colleagues at NCI and the Clinical Center to improve the speed and accuracy of pathogen identification by introducing tandem mass tags (TMTs) for multiplexed analysis of multiple TMT-labeled samples in a single MS/MS experiment. A major challenge in TMT-based analysis is the interference observed in the intensities of TMT reporter ions. To address this, we developed and applied a modified expectation-maximization (EM) algorithm that redistributes interfering signals to their correct TMT-labeled samples. Both the GUI enhancements and the new TMT-based identification method have been fully integrated into MiCId. The GUI development work was published in the Journal of Computational Biology in 2024, and the TMT-based method was published in the Journal of the American Society for Mass Spectrometry in the same year. Since 2023, members of my group have served as committee members of the Metaproteomics Initiative, an international effort focused on standardizing and evaluating the performance of existing metaproteomics workflows. Results from this effort is to be published soon. In 2023, I also established a collaborative project with the University of Gothenburg (Sweden) to identify pathogens and antibiotic resistance in patients with urinary tract infections using mass spectrometry-based proteomics. In addition, our group is actively collaborating with the Robert Koch Institute (Berlin) on a comprehensive evaluation of computational workflows for pathogen identification. We have also completed a project in partnership with the National Security Directorate at Pacific Northwest National Laboratory (PNNL), where we investigated computational algorithms for microbial forensics. The results of this study are currently under review in Nature Scientific Reports. In 2025, we addressed a critical challenge in MSâbased metaproteomics: accurately identifying and quantifying proteins and biological functions across the full taxonomic lineage of microorganisms. This challenge arises from the so-called âshared confidently identified peptide problemâ. Most existing metaproteomics tools rely on the lowest common ancestor (LCA) algorithm to assign biological functions, which often results in incomplete function assignments across the entire taxonomic hierarchy. To overcome this limitation, we implemented a constrained expectation-maximization (EM) algorithm combined with a biological function database within the MiCId workflow. Using synthetic datasets, our study demonstrates that the enhanced MiCId workflow offers improved accuracy and better control of false discoveries in biological function identification along with reliable computation of function abundances across the full taxonomic lineage of identified microorganisms. The first part of this work was published in the Journal of Proteome Research in 2025, and a manuscript detailing the second partâfocused on protein quantification across taxonomic levelsâis currently in preparation for submission. The recent emergence of advanced deep learning-based peptide spectrum prediction methods has inspired us to revisit peptide identification. Such methods can be useful as sensitivity and specificity of peptide, and consequently protein/microorganism, identification could potentially be improved if fragmentation intensity profiles of candidate peptides were known. To this end, several deep learning-based methods have recently been proposed. Last year we systematically and comprehensively assessed six existing deep learning-based methods. This year we proposed our own spectrum prediction method called FastSpel (fast spectral library). It was found that, in terms of improving peptide identification, FastSpel performs comparably with the state-of-the-art methods, while incurring only 1% of other methods' computational costs. Another advantage of FastSpel is that its parameters are interpretable, unlike deep learning-based methods whose parameters are notoriously difficult to interpret. In fact, analysis of parameters of FastSpel corroborated known fragmentation rules, such as the ``proline effect'', namely promotion of certain fragments by proline via cleavage at its N-terminal side. Moreover, examining the model parameters suggested novel fragmentation patterns that could be experimentally and/or theoretically verified. This work has been published in the Journal of Proteome Research in 2025.
View original record on NIH RePORTER →