GGrantIndex
← Search

Leveraging large language models and knowledge graphs on clinical, pathological, and sequencing data to inform precision cancer therapy

$299,999P30FY2023CANIH

Sloan-Kettering Inst Can Research, New York NY

Investigators

Linked publications & trials

Abstract

Project Summary: Precision medicine and targeted therapy are emerging domains in cancer biology that aim to incorporate individual-level clinical, pathological and genomic profiles to tailor treatment strategies for cancer patients. Several precision oncology knowledge bases, like OncoKB, My Cancer Genome, have been established to democratize clinical decision-making by leveraging expert curation of biological and clinical significance of alterations using publicly available resources. These knowledge bases, while extremely powerful, have their limitations, including the scope of annotated genes and alterations, as well as identifying precise therapies for specific combinations of a patient's genomic and clinical profiles. In this proposal, we plan to develop new computational methodologies that will integrate (i) the broad range of implicit cancer knowledge accrued by Large Language Models (LLMs) with (ii) the explicit structured clinical, pathological, and genomic knowledge derived from cancer patients in the Memorial Sloan Kettering Cancer Center’s (MSKCC) Clinical Sequencing cohort and AACR Project GENIE cohort. This will further be reinforced by expert curation, with the aim to predict combinations of genomic alterations and clinical or pathological profiles that can be matched to a specific cancer therapy. The goal of this research is to develop computational models fundamentally anchored around knowledge graphs and LLMs to bridge the gap between clinical and functional risk factors of cancer and cancer therapeutics, and to inform and enhance personalized therapies. The first aim of this proposal is to develop a knowledge graph, MSK-CancerKG, based on patient-specific clinical, pathological, and genomic alteration information from more than 100,000 patients from the MSKCC Clinical Sequencing Cohort and the AACR GENIE Project cohort. This multi-relational knowledge graph will integrate a wide spectrum of clinical features associated with each patient, abstracted features from pathological reports corresponding to the patient-derived tumor samples, along with comprehensive characterization of genomic alterations and the implicated genes. The second aim will be geared towards the fine-tuning of pre-trained Large Language Models (LLMs) using the structured, detailed and more reliable cancer-specific knowledge from MSK- CancerKG. We will meticulously benchmark these fine-tuned models against 4 state-of-the art pre-trained language models, ultimately deriving an optimized combined predictive model, coined MSK-CancerLLM. The benchmarking step will include successful clinical, alteration and treatment prediction accuracy on held-out patient data. The third aim of the proposal will be to further fine-tune MSK-CancerLLM using clinical practice guidelines and feedback to model output from cancer domain experts. The resulting model will be integrated into an AI chatbot, called MSK-Assistant, to facilitate seamless integration and interaction between the backend model and a frontend chatbot interface. Like the ChatGPT application, this will allow the research community to query about cancer biology and personalized drug recommendations and therapeutic interventions.

View original record on NIH RePORTER →
Leveraging large language models and knowledge graphs on clinical, pathological, and sequencing data to inform precision cancer therapy · GrantIndex