Leveraging large language models and knowledge graphs on clinical, pathological, and sequencing data to inform precision cancer therapy

$299,999P30FY2023CANIH

Sloan-Kettering Inst Can Research, New York NY

Investigators

Linked publications & trials

Paper 39739622 Paper 39738196 Paper 39737413 Paper 39726772 Paper 39725745 Paper 39719512 Paper 39715901 Paper 39715484 Paper 39711552 Paper 39711539 Paper 39711206 Paper 39711135 Paper 39707616 Paper 39699236 Paper 39695225 Paper 39681721 Paper 39678127 Paper 39677721 Paper 39671171 Paper 39660045 Paper 39658684 Paper 39657258 Paper 39656390 Paper 39655835 Paper 39648687 Paper 39644064 Paper 39644055 Paper 39643732 Paper 39642329 Paper 39638993 Paper 39636079 Paper 39630912 Paper 39629931 Paper 39627554 Paper 39617530 Paper 39610659 Paper 39607536 Paper 39605729 Paper 39605608 Paper 39605552 Paper 39605530 Paper 39605503 Paper 39605444 Paper 39605368 Paper 39605320 Paper 39605035 Paper 39604747 Paper 39589616 Paper 39589333 Paper 39584538 Paper 39579222 Paper 39576455 Paper 39576446 Paper 39575119 Paper 39575113 Paper 39570652 Paper 39570297 Paper 39567740 Paper 39567689 Paper 39562780 Paper 39556172 Paper 39551602 Paper 39548864 Paper 39548022 Paper 39539260 Paper 39536799 Paper 39533260 Paper 39532885 Paper 39532065 Paper 39524528 Paper 39514841 Paper 39513941 Paper 39511421 Paper 39510072 Paper 39509846 Paper 39508931 Paper 39508482 Paper 39506116 Paper 39505858 Paper 39502445 Paper 39499902 Paper 39498218 Paper 39491719 Paper 39489814 Paper 39484591 Paper 39484455 Paper 39482298 Paper 39481074 Paper 39480534 Paper 39478232 Paper 39475167 Paper 39471810 Paper 39470729 Paper 39468367 Paper 39468290 Paper 39467396 Paper 39464104 Paper 39464061 Paper 39463937 Paper 39462976

Abstract

Project Summary: Precision medicine and targeted therapy are emerging domains in cancer biology that aim to incorporate individual-level clinical, pathological and genomic profiles to tailor treatment strategies for cancer patients. Several precision oncology knowledge bases, like OncoKB, My Cancer Genome, have been established to democratize clinical decision-making by leveraging expert curation of biological and clinical significance of alterations using publicly available resources. These knowledge bases, while extremely powerful, have their limitations, including the scope of annotated genes and alterations, as well as identifying precise therapies for specific combinations of a patient's genomic and clinical profiles. In this proposal, we plan to develop new computational methodologies that will integrate (i) the broad range of implicit cancer knowledge accrued by Large Language Models (LLMs) with (ii) the explicit structured clinical, pathological, and genomic knowledge derived from cancer patients in the Memorial Sloan Kettering Cancer Centerâs (MSKCC) Clinical Sequencing cohort and AACR Project GENIE cohort. This will further be reinforced by expert curation, with the aim to predict combinations of genomic alterations and clinical or pathological profiles that can be matched to a specific cancer therapy. The goal of this research is to develop computational models fundamentally anchored around knowledge graphs and LLMs to bridge the gap between clinical and functional risk factors of cancer and cancer therapeutics, and to inform and enhance personalized therapies. The first aim of this proposal is to develop a knowledge graph, MSK-CancerKG, based on patient-specific clinical, pathological, and genomic alteration information from more than 100,000 patients from the MSKCC Clinical Sequencing Cohort and the AACR GENIE Project cohort. This multi-relational knowledge graph will integrate a wide spectrum of clinical features associated with each patient, abstracted features from pathological reports corresponding to the patient-derived tumor samples, along with comprehensive characterization of genomic alterations and the implicated genes. The second aim will be geared towards the fine-tuning of pre-trained Large Language Models (LLMs) using the structured, detailed and more reliable cancer-specific knowledge from MSK- CancerKG. We will meticulously benchmark these fine-tuned models against 4 state-of-the art pre-trained language models, ultimately deriving an optimized combined predictive model, coined MSK-CancerLLM. The benchmarking step will include successful clinical, alteration and treatment prediction accuracy on held-out patient data. The third aim of the proposal will be to further fine-tune MSK-CancerLLM using clinical practice guidelines and feedback to model output from cancer domain experts. The resulting model will be integrated into an AI chatbot, called MSK-Assistant, to facilitate seamless integration and interaction between the backend model and a frontend chatbot interface. Like the ChatGPT application, this will allow the research community to query about cancer biology and personalized drug recommendations and therapeutic interventions.

View original record on NIH RePORTER →