Statistical Methods For Annotating Repetitive Genomic Regions Through ENCODE-deri

$404,900U01FY2013HGNIH

University Of Wisconsin-Madison, Madison WI

Investigators

Sunduz Kelescontact Emery H. Bresnick Colin Noel Dewey

Linked publications & trials

Abstract

DESCRIPTION (provided by applicant): The ENCODE projects have generated a wealth of high-quality genomic datasets with the applications of high- throughput next generation sequencing (NGS) to create a catalog of functional elements in the human and model organism genomes. Although the NGS technologies, embraced by ENCODE, are enabling interrogation of genomes in an unbiased manner, the data analysis efforts of the ENCODE projects have thus far focused on mappable regions of the genomes and thereby have not fully leveraged these data to their full advantage. A major bottleneck to a comprehensive understanding of data from the ENCODE projects is the lack of statistical and computational methods that can identify functional elements in repetitive regions. We will address this critical impediment in four specifi aims by building on our expertise in ChIP-seq and RNA-seq analysis. In Aim 1, we will develop probabilistic models and accompanying software for utilizing reads that map to multiple locations on the genome (multi-reads) from multiple types of *-seq datasets (ChIP-, DNase-, MeDIP-, and FAIRE-seq). This will enable cataloging of regulatory elements in repetitive regions. In Aim 2, we will improve the specificity of the discoveries in repetitive regions from ou probabilistic models by utilizing multiple related *- seq datasets simultaneously. Specifically, we will devise methods to supervise analysis of ChIP- and RNA-seq datasets by external ChIP-seq datasets. This will facilitate accurate inference for repetitive elements with near identical sequences, e.g., segmental duplications, long interspersed nuclear elements, and boost accuracy of gene and isoform quantification with RNA-seq. In Aim 3, we will focus on identifying co-occupied/enriched regions to infer cell-specific modules of regions/genes and their regulatory profiles. We will also develop a formal differential co-enrichment framework to study cell-specific wiring and interactions of regulatory factors. This will elucidate how interactions among regulatory factors vary across cells/tissues/conditions. Aim 4, we will apply our methods from Aims 1-3 to relevant ENCODE data to understand GATA factor functions in hematopoiesis and vascular biology. The GATA system in human and mouse will serve as a training and validation platform for our methods. Statistical and computational resources generated from the project, which will be disseminated as modular and robust software, will help to enhance and maximize the impact of ENCODE-derived data on the biomedical research community.

View original record on NIH RePORTER →