Reproducibility statistics and machine learning methods for systematic phenotyping and model integration across animals, organs, and technologies

$3,339,048R01FY2025ODNIH

Weill Medical Coll Of Cornell Univ, New York NY

Investigators

Abstract

Project Summary/Abstract Modern high throughput biomedical research is collecting an ever-increasing number of variables spanning different physical scales, systems, and species. In this era of âmulti-omicsâ, multiple data sets are often collected on the same animal, spanning modalities that may variously include genomics, transcriptomics, proteomics, epigenomics, metabolomics, connectomics, and other âomicsâ domains, with each individual omics data set consisting of hundreds up to hundreds of thousands of variables. These exciting developments in data acquisition have led animal model researchers into a new data paradigm, with many labs now relying on machine learning methods for dimensionality reduction, clustering, phenotyping, data integration, and similar techniques assembled into pipelines as part of standard practice. Frequently in this scenario, the number of variables collected p (e.g., metabolites) becomes as large as (or larger than) the number of observations n (e.g., mice). In statistics, this is referred to the Large Dimensional Limit (LDL) regime2. This dramatic increase in the number of variables relative to observations changes the statistical behavior of the data, leading to significant problems with machine learning model fitting behavior that are not widely appreciated outside certain domains of mathematical statistics and physics. Estimates that should fundamentally describe the structure of data, such as the principal components, no longer converge to their true underlying values (as they do in the classical âsmall dataâ case when p is much smaller than n). Further, similar failures extend far beyond principal components impacting many other widely used machine learning approaches, and multi-stage analysis pipelines and complex animal data distributions further compound these problems, collectively resulting in a modern reproducibility crisis in animal research. Here we show that these issues also broadly impact the test-retest reproducibility of machine learning methods on held out data, with reproducibility on new data following a âuniversal reproducibility curveâ in which performance changes rapidly from non-reproducible to reproducible as a function of sample size and number of variables. A direct consequence of this phase transition is that modern animal model research studies are subject to a reproducibility transition such that below a certain value p/n machine learning algorithms fail to yield reproducible results. Furthermore, for ratios of p/n above this value, collecting additional data will yield rapidly diminishing marginal returns in reproducibility that may not be justified by data acquisition costs or trade-offs. Together these phenomena define what we call âuniversal reproducibility curvesâ that depends on the âaspect ratioâ p/n of the data (as well as the strength of the multivariate biological signal that the data contain). Gaining a mechanistic understanding of such curves will allow us to systematically solve important open problems in machine learning such as how to measure reproducibility across machine learning methods and pipelines, how many data samples to collect and how many variables to measure to ensure reproducibility, when to stop collecting data, and how to design better embedding algorithms for data integration and systematic phenotyping in real-world animal research models.

View original record on NIH RePORTER →