GGrantIndex
← Search

Sequence-to-function models for mechanistic investigations of personal genomes

$2,217,249R01FY2025HGNIH

University Of Washington, Seattle WA

Investigators

Abstract

Recent advances in functional genomics and computational modeling and hardware have ushered in a new era of sequence-to-function (S2F) deep learning models, allowing the direct prediction of functional outcomes from DNA sequences. We are poised to build comprehensive models of gene regulation and precisely predict gene expression consequences of arbitrary genotypes. However, two significant challenges limit the potential of current models. Firstly, current models are not able to correctly predict the subtle gene expression consequences that arise from natural genetic variation. Secondly, current models can only make accurate predictions for the cellular contexts with abundant functional genomic training data and are not applicable to contexts important for many diseases such as early developmental time points and difficult to isolate cell types. Our proposal aims to address these challenges and advance the field. Aim 1 develops S2F models that are performant on the full spectrum of natural genetic variations. Current S2F models generalize across genomic regions, but struggle to generalize across diverse genotypes represented in personal genomes, due to the subtleties in gene expression changes. We propose a novel and general learning framework that integrates allele-resolved functional genomic data, enhancing the model's ability to predict variant effects. This framework employs custom loss functions and learning algorithms for efficient utilization of personal genomes and allele-resolved datasets. Aim 2 introduces modularized model architectures that improve generalization and adaptability, particularly in scenarios with limited data. Gene expression regulation involves complex processes, each with a distinct relationship to DNA sequence. We propose factorized models that utilize biological prior knowledge to constrain the model and decouple its parameters to correspond to distinct biological processes. Regularizing the model with known biological principles enhances generalization, yields interpretable predictions, and enables context-specific fine-tuning in data-limited settings. Aim 3 applies these enhanced models to study regulatory mechanisms underlying neurodevelopmental disease. Collaborating with domain experts, we will investigate how sequence variations affect gene expression in autism spectrum disorder (ASD) and schizophrenia (SCZ). ASD and SCZ involve abnormalities in brain development that may occur even before birth. Hence, functional data from the relevant cell-types and developmental time points is limited. Here, we will apply our models to Whole-Genome Sequencing data collected from SCZ and ASD patients to predict gene expression values across genome and brain cell types across early developmental time points. By associating the resulting imputed gene expression values with disease outcome, our approach will drastically reduce the multiple testing burden that is currently hampering rare non-coding variant investigation in complex disease. In summary, our methods will offer powerful computational tools to study and interpret personal genomes, enabling mechanistic investigation of the full spectrum of genetic variation in complex disease.

View original record on NIH RePORTER →