Curation and analysis of publicly available, molecular profiles from people with Down Syndrome

$155,316R03FY2025HLNIH

Brigham Young University, Provo UT

Investigators

Abstract

PROJECT SUMMARY Due to mandates from funding agencies and publishers, high-throughput, molecular data from Down syndrome individuals and controls (mostly humans and mice) are available in public repositories. Researchers can use such data to corroborate their own ï¬ndings and pose new research questions. Doing so would help to leverage prior investments and complement efforts by the INCLUDE Data Coordinating Center (DCC) to generate data for new cohorts. Our proposal focuses speciï¬cally on mRNA expression and DNA methylation data. These data types shed light on how genes are regulated, how molecular aberrations lead to medical conditions, and how medical outcomes can be predicted, potentially leading to improved diagnostics, treatments, and insights into human health and disease. However, many data-generation platforms are used for these data types, and researchers use a wide range of techniques for normalizing the data, checking data quality (if they check at all), and mapping to gene annotations. To reuse the data most effectively, the data must be reprocessed from its original form; normalized and quality checked consistently; and mapped to current annotations. Agencies who manage public repositories lack resources and expertise to perform these steps. In our ï¬rst aim, we will address this problem using a data-curation approach. We have identiï¬ed 148 datasets speciï¬c to Down Syndrome that we believe should be prioritized for reuse. Using our expertise in molecular-data processing and bioinformatics, we will re-normalize, quality-check, summarize, and annotate the data using an approach that maximizes consistency for all of the datasets. Additionally, we will map the metadata to biomedical-ontology terms in collaboration with the INCLUDE DCC. We expect that these efforts will reduce barriers for researchers in the Down syndrome community to reuse the data and accelerate progress in the ï¬eld. Our second aim focuses on interoperability. For many research questions, a single dataset is insufï¬cient. Sample sizes may be small and/or a single dataset may not represent the range of phenotypes or other factors necessary to answer a given question. Therefore, it is often crucial to integrate datasets from multiple sources. However, systematic differences between datasets are inevitable due to differences in populations, laboratory conditions, and environmental factors. Failing to adjust for these differences will likely lead to biased conclusions. We will evaluate the feasibility of using generative neural networks, a type of algorithm that is highly conï¬gurable and is behind many of the most inï¬uential artiï¬cial-intelligence advances of the past decade. We will apply these algorithms in the context of studying medical conditions that co-occur with DS, such as autoimmune conditions, dementia-related disease, congenital heart defects, and leukemias. Our algorithms will search for systematic patterns that differ between datasets and generate a modiï¬ed version of the data in which those differences have been minimized yet the biologically relevant patterns have been retained.

View original record on NIH RePORTER →