Population Structure Admixture and Selection across the 1000 Genomes Data Set

$436,083U01FY2011HGNIH

Stanford University, Stanford CA

Investigators

Carlos D. Bustamantecontact Andrew G Clark

Linked publications & trials

Paper 26432245 Paper 26110529 Paper 24836452 Paper 24568772 Paper 24385924 Paper 24379384 Paper 23979573 Paper 23468652 Paper 23151206 Paper 23128226 Paper 22792059 Paper 22654671 Paper 22582263 Paper 22491189 Paper 22456605 Paper 21935354 Paper 21803767 Paper 21775991 Paper 21753830

Abstract

DESCRIPTION (provided by applicant): The 1000 Genomes Project (TGP) has tremendous potential to answer fundamental questions in human population genetics and shape the future design of medical genomic studies. Key to realizing this potential is the development of efficient, robust, and powerful computational methods for analysis of the copious amounts of data generated by the project. Here, we propose novel approaches for characterizing population structure, analyzing patterns of admixture, and localizing signatures of selection across the 2,000 samples of the TGP. Our project has three primary aims. First, we will construct detailed models of human demographic history based on the TGP. To accomplish this, we develop approaches for analyzing the joint allele frequency spectrum of rare and common SNPs, copy number variants (CNVs), and haplotypes across all the populations being surveyed. Having full sequence data will render these approaches dramatically better at making inferences about the recent past, where distortions in frequency spectra are particularly important for testing associations with rare variants. Second, we will characterize patterns of population structure and admixture in the four Hispanic/Latino and three African-American TGP samples. The TGP presents a tremendous opportunity for catalyzing population and medical genomics research for these important and understudied ethnic minority groups. We will develop novel statistical genomic approaches for reconstructing the genetic history of admixed populations and apply these methods to the TGP samples. Our methods will be tailored for short-read sequence data and will leverage the trio design of the sampling. Third, we will detect signatures of balancing, purifying, and positive selection in the full TGP data set. We will develop software tools to integrate signatures of natural selection based on a new approach that uses numerical methods to fit a diffusion approximation to the multi-dimensional site frequency spectrum. This approach allows identification of distortions caused by positive, balancing, or negative selection. The method is especially well suited to low coverage short-read sequence data. These inferences will be integrated with the maps of GWAS hits to accelerate discovery of disease-associated variants. RELEVANCE: Medical genetics research provides a vehicle for uncovering the heritable basis of complex disease. The 1000 Genomes project is an international effort to sequence the genomes of approximately 2,000 diverse human subjects. We propose to analyze these data in order to characterize differences among genomes and catalyze medical and population genomic research throughout the world.

View original record on NIH RePORTER →