GGrantIndex
← Search

GTEx engagement with the CFDE-CC and other DCCs towards building a data ecosystem spanning the Common Fund projects

$779,157OT2FY2023ODNIH

Broad Institute, Inc., Cambridge MA

Investigators

Linked publications & trials

Abstract

Detailed Engagement Plan The National Institutes of Health Common Fund’s Genotype-Tissue Expression (GTEx) project was launched in 2010 with a goal of providing the scientific community with a resource for the study of human gene expression and regulation across multiple tissues, to specifically provide insights into the mechanisms of gene regulation and disease-related perturbations, and to further our understanding of the role that inherited genetic variation plays in susceptibility to complex diseases. The project enrolled 960 recently deceased, adult donors and collected close to 49,000 tissue samples. Core data generation was completed at the end of 2017, with the primary data types including whole genome (WGS, 30X) and whole exome (WES, 100X) sequence data on all donors, and RNA-sequence data from at least 25,000 samples spanning 53 human tissues/organs. This dataset constitutes the largest multi-tissue RNA sequence data resource generated to date (a previous study of genetic effects on gene expression, TwinsUK/EUROBATS, generated ~2,700 RNA-seq samples from four accessible tissue sites). The GTEx resource also includes a rich and well annotated collection of donor, sample, and experiment metadata. Furthermore, additional molecular data types, aimed at enhancing the core data sets, are still being produced, including mass spectrometry-based proteomics, measurements of DNA methylation, histone marks (ChIP-seq), somatic DNA sequencing, and DNase I hypersensitivity sites. The GTEx resource includes both protected-access and open-access data (Fig. 1). The protected-access data include extensive sample, subject and technical metadata and raw sequence BAM files from RNAseq, whole genome (WGS) and whole exome (WES) sequencing, ChIP-seq and m6A RNA-seq, as well as protected data derived from these such as genotype calls in VCF format. An approved dbGaP application is required to obtain all protected-access data, including access to the raw sequence data, which are accessible on the AnVIL platform (on Google Cloud Platform; GCP). The GTEx data also include a large amount of open-access data, such as gene and transcript expression quantifications, cis- and trans-expression and splicing QTLs, histology images of every tissue, some eGTEx data summaries, the sample biobank, and a very limited set of de-identified sample and subject metadata. All of these public data are available for download, and as interactive visualizations and summary tables on the GTEx portal. The GTEx project has developed an extensive suite of tools and analysis pipelines that have been benchmarked, optimized and implemented in GCP for the project (such as the RNA-seq alignment, quantification, and QC pipeline, and the QTL analysis pipeline). These pipelines were also selected by the TOPMed project to produce a harmonized resource of RNA sequence data across the large number of cohorts being sequenced for that project (>20,000 samples to date); our team was involved in initial benchmarking and harmonization tests of our pipeline across TOPMed sequencing centers and are actively involved in ongoing data production and analyses. Moreover, very similar pipelines are used by the ENCODE project, thus facilitating comparisons across large datasets that would be prohibitive in terms of costs and computational resources in the absence of harmonized pipelines. We have also created numerous visualizations developed specifically for the open access data on the GTEx portal. The GTEx project has a very large user community: the GTEx data have the second largest number of Data Access Requests for protected data in dbGaP (behind TCGA), and it is the most frequently downloaded dbGaP project. An even larger number of users access the data, tools and interactive visualizations on the GTEx portal: in the 2019 calendar year, the GTEx portal had 135,000 users (~12,000-18,000/month) worldwide, with users spiking in October 2019 following the release of the V8 data. The GTEx consortium has published numerous papers describing the dataset and analyses of the data, and two additional data releases are still planned.

View original record on NIH RePORTER →