Transforming dbGaP genetic and genomic data to FAIR-ready by artificial intelligence and machine learning algorithms
University Of Texas Hlth Sci Ctr Houston, Houston TX
Investigators
Linked publications & trials
Abstract
dbGaP is a repository for NIH funded projects and it contains many genetic and genomic data. However, data there are not ready for AI and machine learning applications. This application proposes methods to address this issue. We have two aims: 1). Develop and standardize procedures to transfer genetic and genomic data into image like objects and tokenized custom vocabulary so that the data can be utilized by advanced AI algorithms such CNN, autoencoder and transformer. To transform genetic data into image, we recode allele dosage value as pixel intensity and arrange a collection of genetic markers such as SNPs and CNVs into an artificial image object so that it can be analyzed by CNN algorithms. Genetic markers can also be used to define haplotypes, which can be tokenized into custom vocabularies for use in NLP models. 2). Use Alzheimer's disease and schizophrenia as case studies to demonstrate the utilities of transformed data for the discovery and identification of risk variants/genes for both conditions. We plan to impute genetically controlled gene expression using brain specific eQTLs and individual genotypes for an AD dataset, and transform the expression data into image objects for analyses by CNN model with self attention mechanism. For schizophrenia, we plan to use k- mer tokenizer to break haplotypes into a collection of small haplotype blocks and treat them as tokens for analyses by NLP models. We use both CNN and NLP models as screen tools to select promising candidates using the attention weights, and then directly test these candidates for their association with AD/schizophrenia using logistic regression. Due to the selection effect, we can dramatically reduce the number of testing, significantly increase our statistical power to detect risk variants/genes to AD/schizophrenia.
View original record on NIH RePORTER →