Novel methods for large-scale genomic interval comparison

$314,711R01FY2023HGNIH

University Of Virginia, Charlottesville VA

Investigators

Linked publications & trials

Paper 39180401 Paper 39131817 Paper 38991851 Paper 38974799 Paper 38950178 Paper 38534537 Paper 37645717 Paper 37395102

Abstract

ABSTRACT This administrative supplement creates AI/ML-ready resources for epigenome genomic interval data. Epigenome data summarized as sets of genomic intervals are now available for thousands of variations of cell type, disease, condition, etc. This data holds tremendous promise to understand gene regulation and disease be- cause many health outcomes are affected by genetic variation or epigenetic perturbation in regulatory DNA. The parent R01 develops novel, scalable algorithms and measures of similarity between genomic interval datasets. These advances will improve both the efï¬ciency and accuracy of existing biomedical research approaches that rely on analyzing genomic region data. They will open the door to new ways of exploring the vast and growing corpus of genome interval data. In this administrative supplement, we seek to take this rich data source and produce AI/ML-ready resources for the community. While there has been some effort to create uniformly processed databases of genomic interval data, there are few high-quality genomic interval currently available that are designed for machine learning applications. One of the ï¬rst steps to integrating epigenome data across data sources is deï¬ning consensus regions that ï¬t the original data well. Many downstream analyses, particularly learning tasks, rely on such a consensus region set. However, choosing a good consensus can be a time-consuming and confusing process, and also has potential to lose substantial information and introduce errors into results. To help alleviate this challenge, this proposal will take several datasets through a principled approach to generate AI/ML-ready resources. This process will include 1) deï¬ning consensus regions; 2) projecting raw data into the consensus to standardize it; and 3) standardizing annotation. Finally, we will make these available to the community with user-friendly and well-documented interfaces. The outcome will be a series of datasets that are ready for use for the community to build ML models.

View original record on NIH RePORTER →