ClinGen AI Data Delivery Supplement
Broad Institute, Inc., Cambridge MA
Investigators
Linked publications & trials
Abstract
PROJECT SUMMARY The National Library of Medicineâs ClinVar repository is the worldâs most robust source of curated clinical variant assessments. It contains classifications of over 2 million variants with 96% from clinical laboratories and ~5,000 curated via the Clinical Genome Resourceâs (ClinGen) expert panels. In addition, ClinGen maintains a database of expert-curated gene-disease relationships, which are accessible through ClinGenâs website, as well as deposited in the Gene Curation Coalition (GenCC) database that allows submissions from any source and contains nearly 17,000 claims on ~4,700 genes. Further, ClinGen has developed open-source interfaces for both variant and gene curation that collect the detailed information needed for classification. While the data from these sources is incredibly valuable to accurate identification of disease-causing variation, in its current state it contains some important limitations. Namely, due to the repository nature of these data sources, many of the laboratory submissions to ClinVar are outdated and some of the gene-disease submission to GenCC and variant submissions to ClinVar have missing evidence or used old or misapplied current classification frameworks. This can lead to discrepancies in the classifications across labs, decreasing the utility of the resources. While efforts are underway to resolve discrepancies by having laboratories reclassify discordant interpretations, this work is time-consuming and therefore slow to resolve at scale. The parent ClinGen awardâs specific aims are: (1) develop and implement standards to support clinical annotation and classification of genes and variants, (2) share genomic and phenotypic data between clinicians, researchers, and patients through enhanced knowledge bases for clinical and research use, (3) enhance and accelerate expert review of the clinical relevance of genes and variants, and (4) disseminate and integrate ClinGen knowledge and resources to the broader community. The work proposed here is within the scope of the parent grant, though represents work not planned and will support all of these aims by developing methods and software to transform ClinVar, ClinGen and GenCC data extracts into formats that can be easily targeted by machine learning algorithms. Our aims are threefold. We will: (1) develop formats that that are easily digestible by traditional machine learning approaches, (2) make it easier to use these data in generative AI approaches and (3) evaluate the ability of Large Language Models (LLM) to extract data used in variant classification and write accurate, robust variant evidence summaries. We will also implement the GA4GH Variant Representation Specification (VRS) to lay the groundwork for integration with other genomic data sources.
View original record on NIH RePORTER →