Development of statistical genetics methodology
National Human Genome Research Institute
Investigators
Linked publications & trials
Abstract
A major project of this section is the development of new statistical genetics methodology as prompted by the needs of our applied studies and the testing and comparison of novel and existing statistical methods. We continue to develop machine learning methods in genome-wide association studies and in analyses of whole-exome sequence data, and to explore their utility, particularly with respect to power and detection of gene-gene and gene-environment interactions. We previously published a study using GWAS genotype data from the Framingham Heart Study data repository with computer simulated trait data, thus allowing us to show that these methods may be able to detect interaction effects in suitably-powered studies. In the past, we have evaluated the power of several of these methods in whole-exome sequence data from the 1000 Genomes Project using computer simulated phenotypes as part of Genetic Analysis Workshop 17 (GAW17). We published several papers concerning data mining in the GAW17 data in late 2011. We have published a paper showing that our novel recurrency method in Random Forests seems to better differentiate between variables of high importance vs. low importance than other current methods. We have also used this recurrency approach to detect low quality SNVs in whole exome and whole genome sequence data and applied this method to GAW19 data. Ongoing studies have also shown that this method can detect epistatic interactions in the absence of main effects in simulated genetic data, with these results presented at several scientific meetings. We have further developed and tested a limited permutation method that allows estimation of false positive rates in conjunction with our recurrency approach. Simulations further suggest that our new recurrency method is powerful in multiple situations and controls false positives and that it allows the detection of epistatic interactions in a more powerful fashion than is possible with parametric methods when there are no main effects. We have developed and released a software package, r2VIM, which is available on Dr. Bailey-Wilsons website for broad access and have published three papers describing this method. We are currently developing extension of r2VIM. In this reporting period we have further improved control of false positive rates while retaining excellent power, adding multistep, weighted approaches that we are calling recurrent weighted replanting to increase power when number of features is very large and there are only interaction effects on risk of disease or when such interactions exist in the presence of large numbers of other additive polygenic risk variants. Simulations confirm this excellent power and false positive control under a wide variety of realistic genome-wide scenarios. A manuscript presenting this new method, RWR, is under review and results have been presented at scientific meetings. We have also been developing and testing a new approach, Entanglement Mapping (EM), for specifically identifying which selected features are actually interacting with each other as opposed to acting independently. A paper presenting this work has recently been accepted for publication and it has also just been presented at a machine learning conference. The EM software packages has recently been posted on Dr. Bailey-Wilsons NHGRI website and the NHGRI software page. The RWR code will be posted soon. The massive number of simulations for this project relied heavily on the Biowulf computational resource at NIH. We also published a review paper of machine learning methods for identifying important features contributing to a good prediction from a machine (1). We have also been developing a novel method to analyze matched case-control, or case-parent trio data using Random Forests. By combining results from a large number of classification trees, we have a flexible solution to analyze matched datasets and a paper was published (Li et al., 2015) presenting some of this work along with an applied analysis of oral cleft GWAS data. Work to efficiently implement this method for large-scale genomic data is ongoing and additional manuscripts are in development. We have developed novel tools for analysis and interpretation of whole exome sequence (WES) and whole genome sequence (WGS) data, including strategies for combining linkage and sequence results, various schemes of collapsing rare variants in genes and gene networks to improve the power of sequence analysis, and methods for integrating sequence analyses with existing genomics databases. Development of these analysis methods and tools are ongoing, driven by our own WES and WGS sequence data from multiple studies of complex traits. We have recently updated our sequence data quality assurance pipeline and scripts to automate two-point linkage analysis (parametric and non-parametric) of whole exome and whole genome sequence data. We have worked on optimizing methods for performing multipoint analyses using extremely dense WES, WGS and exome chip data sets. Also this year, we continued to make improvements to our pipelines for application of family-based methods for improved quality control in whole genome sequence data. In collaboration with Dr. Ruzong Fan (s Guest Researcher who is a Professor at Georgetown University) and Dr. Chi-Yang Chiu (a Guest Researcher who is a faculty member at University of Tennessee Health Sciences Center), we have contributed to the development of new generalized functional linear models for gene-based tests of both quantitative and qualitative traits as well as mixed effects models. These new methods have been shown to be more powerful than other gene-based tests while retaining good control of false positive rates. We have published multiple papers in this area in previous years. In this reporting period, we have published three papers reporting on novel gene-based association analysis methods (2-4) and have another paper under review. This work, along with the extensive simulation studies showing its good power and false positive control was relied heavily on the Biowulf computational resource at NIH. We began a new collaboration with Dr. Kathleen Vazzana and Laura Lewandowski at NIAMS, who had performed TDT analysis using whole exome sequence data on a moderately sized sample of parent-parent-affected child trios for lupus. This study used both single variant and gene-based association tests. We developed a permutation-based approach to determine the significance of the results of these analyses, using 10,000 permuted data replicates that we then analyzed in the same manner as the original TDT analyses. We used the p-values from the permuted datasets (under the null hypothesis of no association) to determine new significance thresholds for the applied analyses and to evaluate the empirical significance of the most significant results. This is being done separately for the single variant and gene-based tests. A manuscript presenting this work is in preparation
View original record on NIH RePORTER →