Genomic Technology Development for Diverse Populations
Morehouse School Of Medicine, Atlanta GA
Investigators
Abstract
ABSTRACT: The human reference genome was derived from 20 people, with an individual contributing most of the genetic information. This linear reference does not represent the structural variation of the human population. Pan-genome projects have attempted to address this missing variation, but the use of pangenome references for whole transcriptome RNA-sequencing analysis is not clear, nor is the impact this has on transcriptome data from different racial/ genetic ancestry populations. Here we will perform a benchmarking project, whereby we will compare RNA-sequencing mapping statistics from a cohort of RNA-seq data when mapped/ aligned to various reference genome constructs including Grch38, the Telomere-to-telomere reference genome (chm13), reference genomes obtained from the human pangenome research project, both as individual genomes, and as a combined graphical reference, and finally de-novo assembled individual reference genomes generated by long read sequencing. The goal of this project is to understand the tradeoff of individualized references vs. population-based references with respect to computational burden and mapping accuracy. We will specifically ask whether populations with European vs. African genetic ancestry benefit from more personalized references. We hypothesize that reference genomes with related ancestry composition will improve accuracy of RNA- seq mapping and our estimation of gene function. To test this hypothesis, we propose the following Specific Aims: SPECIFIC AIM ONE: Test the utility of reference genomes by determining the differential metrics of short and long read RNA seq data aligned to the GRCh38 reference genome and the T2T reference genome (CHM13). SPECIFIC AIM TWO: Determine utility of prior ancestry estimation for identifying the appropriate pangenome reference to further improve quality of short and long read RNA seq alignments, compared to the GRCh38 reference genome, across HPRC reference genomes (currently 47). SPECIFIC AIM THREE: Measure the QC metrics of short and long -read RNA seq data aligned to the GRCh38 reference genome compared to individualized reference genomes. This project will provide training opportunities for faculty, post docs and students at MSM to enhance workforce training in genomics. The implication of this project will help determine whether there is a benefit of determining an individuals reference genome or using population based reference genomes enhances the information available in RNA-seq datasets.
View original record on NIH RePORTER →