CRII: III: Self-Supervised Graph Neural Network Meta-Learning for Cancer Multi-Omics and Driver Discovery
Oakland University, Rochester MI
Investigators
Abstract
Cancer is the second leading cause of death worldwide, killing more than 8 million people every year. Cancer is caused by driver mutations, which are changes in the DNA sequence of genes that lead to the development of abnormal cells that divide uncontrollably and can infiltrate and destroy normal body tissue. One main goal of cancer research has been the discovery of all the driver mutations across cancer types. However, cancer driver discovery is very challenging as each person’s cancer has a unique combination of mutations, among which only a few are driver mutations while the vast majority are passenger mutations that do not promote abnormal cell growth. Existing cancer driver discovery methods can detect drivers that are common among many tumors but fail to detect rare drivers that only occur in very few tumor cases. This project will apply state-of-the-art machine learning approaches to identify the remaining elusive rare drivers. The identified driver mutations may provide critical information for planning treatment to stop cancer cells from growing, including drugs that target a specific driver mutation. The acquired knowledge and framework from this project will contribute to saving numerous lives by providing the capability to better manage cancer treatments. One major barrier for applying advanced machine learning approaches, such as deep neural network models, to cancer driver discovery is the lack of large-scale high-quality labeled training data suitable for supervised learning. To address this challenge, this project will combine Graph Neural Network (GNN), Self-Supervised Learning, and Meta Learning techniques to integrate biological domain knowledge with large-scale heterogeneous data (specifically known as multi-omics data in this community) for cancer driver discovery. First, a GNN model will be constructed based on a unified knowledge graph representing biological domain knowledge. Incorporating domain knowledge into the model as a form of inductive bias will help train the model effectively with less labeled data. Second, self-supervised learning will be employed to pre-train the GNN model on multi-omics data, also reducing the need for labeled data. Meanwhile, the learned node and edge embeddings for the GNN model can be treated as high-level transferrable features, removing heterogeneity and noise while facilitating cross-dataset integration and meta learning. Third, meta learning will be applied to a pan-cancer dataset comprising dozens of cancer types to increase model generalizability and detect novel drivers across cancer types. The project will further improve the results of existing pan-cancer integrative analysis of dozens of cancer types and may lead to a repeatable process for tackling other difficult biological problems through integrating heterogeneous data and knowledge sources with machine learning. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →