Robust Accurate Identification of peptides from tandem mass spectrometry data
National Library Of Medicine
Investigators
Linked publications & trials
Abstract
Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, this implies that accurate statistics must be obtained in peptide identification, then built on if one can hopefully have protein identification method(s) with accurate statistical significance assignment. However, although heavily concentrated and studied, the statistical accuracy of peptide/protein identification remains challenging. There are many peptide identification methods using database searches and assigning the E-value to peptide hits, however, the E-values reported by different methods do not agree with each other and few of them, if any, agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods, particularly if one wishes to combine methods with user-assigned weights. When prior knowledge is available, it is often desirable to weight search methods differently before combining their search results. In our earlier publications, we have developed peptide identifications methods with accurate statistical significance assignment founded on the extension of central limit theorem, and all possible peptide statistics ; we have provided a way to combine search results democratically in one of our earlier publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate; we have devised a mathematical framework to completely eliminate this instability. In 2015, we published a protein identification method that combines weighted P-values of evidence peptides. This new method solves the long-standing problem of precise type-I error control in protein identification. In addition, it also reports correctly the proportion of false discoveries, indication of accurate type-II error control. In 2016, we published a new peptide significance assignment method based on the extreme value statistics. The motivation of this work is to provide accurate peptide identification confidence for methods that use scoring functions that cannot be expressed as a sum of independent contributions. In the past years, we also worked on a large cross-institute project, involving scientists in NHLBI and Clinical Center, in pathogen identifications using mass spectrometry. The fundamental idea is to use each pathogen's peptidome to represent that pathogen. Through mass spectrometry analysis, if the statistical significance assignment is accurate, one will be able to correctly rank the species/genus according to their peptidome similarity compared with the peptides identified. Again, we have to weigh the evidence peptides associated with a given species/genus as one peptide often maps to multiple species/genus. For the past few years, we have finished the first two phases of the study: namely, identification of a single microbe and simultaneous identifications of multiple microbes. Both results were published in Journal of American Society of Mass Spectrometry. In addition, we had designed an analysis pipeline that requires minimum human interventions. In the years 2019 and 2020, we made a substantial progress in simultaneous identification of multiple microbes and their protein biomasses estimates. This is made possible by introducing several new ideas into the analysis pipeline: taxon priors, ownership, participation ratios, and degree of independence. With these quantities properly computed, one is able to estimate the taxa protein biomass contributions, the number of taxa to keep in a taxa cluster and to split off a sufficiently independent taxon off a cluster. The last point is important in alleviating the effect due to aggressive clustering. The results were published in Journal of American Society of Mass Spectrometry in 2020. This year we focus on making our tools more accessible and on establishing international collaborations. In addition to writing and keeping up the graphic user interface program for our microbial identification and biomass estimate method, we have also collaborated with the Camodro group of University of Paris and the CUUG group at U. Gothenburg in Sweden.
View original record on NIH RePORTER →