EAGER: Haplotype Phasing Algorithms and Clark Consistency Graphs

$199,999FY2011CSENSF

Brown University, Providence RI

Investigators

Abstract

Currently, most SNP assay data is made available from genome-wide association studies (GWAS), which proceed by identifying a number of individuals carrying a disease or a trait and comparing them to those that do not or are not known to carry the disease/trait. Both sets of individuals are then genotyped for a large number of single-nucleotide polymorphism (SNP) genetic variants that are then tested for association to the disease/trait. GWAS studies using tens of thousands of individuals are becoming commonplace and are increasingly the norm in the association of genetic variants to disease. These studies generally proceed by pooling large amounts of genome-wide data from multiple studies, for a combined total of tens of thousands of individuals in a single meta-analysis study. It can be expected that hundreds of thousands, if not millions, of individuals will soon be studied for association to a single disease or trait. Although SNPs are the most abundant form of variation between two individuals, other forms of variation exist, such as copy-number variation ? large-scale chromosomal deletions, insertions, and duplications. These variations, which have been shown to be influential factors in many diseases, are not probed using the current technology of SNP arrays. An emerging trend in accounting for ?missing heritability? is ?parent-of-origin? effects, where genetic variants confer risk only when inherited from a specific parent. Long-range haplotype phasing is key to identifying the association of the haplotype pattern to the specific parent of origin. The premise of this research is that long tracts are unlikely (to be shared) unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts, which may be identical by state (IBS). A difficult algorithmic challenge is that of tract finding in genotype matrices of a sample of m people genotyped at n SNPs. The premise of our research is that long tracts are unlikely (to be shared) unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts, which may be identical by state (IBS). A difficult algorithmic challenge is that of tract finding in genotype matrices of a sample of m people genotyped at n SNPs. To apply such a long-range phasing algorithm to the U.S. population, it is estimated that 2 million individuals must be genotyped. Algorithmic strategies proposed here show promise that the combinatorial structure of Clark Consistency Graphs can provide the basis for powerful algorithms that will decrease this number substantially. The primary output of this project will be new long-range phasing software, documentation, and source code, all to be immediately and continually available to the scientific community as open-source for research and education.

View original record on NSF Award Search →