CRII: III: Learning to Integrate Heterogeneous Data from Disparate Sources for Disease Subtyping

$174,987FY2020CSENSF

Wayne State University, Detroit MI

Investigators

Abstract

Patients suffering from the same cancer disease often not only experience a high degree of symptomatic variability but also respond significantly different to the same treatment. As a result, many cancers are over-diagnosed causing patients to receive unnecessary and costly cancer treatments, while some patients do not receive the needed treatment. These can be greatly reduced with targeted treatments that result in greater efficacy and fewer debilitating or dose limiting side effects. Hence, the discovery of patient subgroups and/or disease subtypes that differ in cancer progression will help better tailor treatments with reduced lethality and improve the quality of life by avoiding over-treatment. Studies have shown that, in addition to the genetic factors, such factors as clinical history, hormonal exposure, lifestyle, and epidemiologic factors may play an important role in the onset and progression of cancers. Hence, investigation of clinically relevant disease subtypes cannot be achieved by solely analyzing data from a single source, i.e. only the genetic composition. This project will build innovative machine learning technologies to integrate knowledge collected from multiple repositories from different cohorts of multiple data types (e.g. genomic, clinical, and epidemiologic) for a more accurate and robust discovery of disease subgroups and/or patient subtypes. Its activities will thus advance the field of multi-type data integration from disparate sources with different cohorts. Anticipated results will greatly benefit the society by reducing the health care costs and improving patient care by distinguishing between patients who are at higher-risk and need the most aggressive treatments from those who will never progress, recur, or develop resistance to treatments. The interdisciplinary nature of the project will further the education of undergraduate and graduate students through cross-disciplinary training by bridging the fields of computer science, biology, and engineering. To meet these goals, the project will investigate a novel clinically-relevant disease subtyping system by flexibly integrating collective information available through multiple studies with different cohorts and heterogeneous data types using innovative deep learning models and algorithms. Here, the investigator hypothesizes that if the information in partially coupled data from disparate sources is integrated in a meaningful way that is statistically sound and robust, then the analysis on the integrated data would lead to a more unified picture and global view of the system, and thus, to a more accurate and robust discovery of disease subtypes. Existing data fusion approaches suffer from such challenges as uniqueness, interpretability, high-dimensionality of the feature space and linearity assumptions. This project will develop theory, algorithms, and implementation of a deep machine learning technique capable of discovering the salient knowledge of the learning task during the integration of disparate datasets. Therefore, it is expected to overcome such challenges as high dimensionality of genomic datasets. The project would result in a valuable tool with broad implications and utility in cancer research. These ﬁndings will not only provide useful and valuable models to identify patient subgroups and/or disease subtypes, but will also result in a valuable precision medicine resource for the wider scientiﬁc community on other diseases. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →