Interpretable machine learning methods for the analysis of Alzheimers disease genetics

$762,636R01FY2025AGNIH

Stanford University, Stanford CA

Investigators

Abstract

Alzheimerâs disease (AD) is the most common cause of dementia among people over the age of 65, affecting an estimated 5.5 million Americans. It is now well-established that drug trials based on evidence with a genetic basis are more likely to succeed. However, known genetic markers identified by current genome-wide association studies (GWAS) only explain a small fraction of heritability for late-onset AD (8-17% out of 56-79%). Among them, the causal variants and their effect are only known for a minority of cases. To date, over 400 million genetic variants in the human genome have been sequenced. The causal variants tend to be sparsely spread across the genome with multiplex nonlinear effects on AD. This is far more than the current analytical and experimental approaches can analyze with adequate power. Machine learning (ML) approaches, including deep learning/neural networks, can efficiently learn linear and nonlinear relationships and have been used successfully in many scientific problems. Meanwhile, recent advances in genome sequencing provide an exciting opportunity to apply ML methods to genetic data analysis of AD. However, for application of ML to genetic studies, it is generally difficult to quantify how changes to the genetic variants influence the disease outcome. Although explainable artificial intelligence (XAI) methods have been developed to improve the interpretability of ML and to quantify the relative importance of input features (e.g. genetic variants), there is little to no development for rigorous control of the error rate of selected features - a property critical for reproducible science but less studied in existing XAI methods. The objective of this proposal is to develop rigorous feature selection in ML methods and pair them with causal inference to discover causal genetic variants of AD that could lead to novel targets for the development of new AD therapies. At the Stanford Alzheimer's Disease Research Center, we have curated a database that combines large-scale genetic and multi-omics datasets. The proposed methods will be applied to genetic data from a total of roughly 500,000 samples harmonized across ADGC, ADSP and UKBB. The findings will be validated using real functional experiments, single-cell RNAseq data and proteomics data. We expect that the application of the proposed methods will significantly improve our understanding of the multiplex nature of AD and, critically, provide a credible set of well-defined, novel targets for the development of genomic-driven therapies.

View original record on NIH RePORTER →