Methods development for "Omics" data

$879,602ZIAFY2021ESNIH

National Institute Of Environmental Health Sciences

Investigators

Linked publications & trials

Paper 38872937 Paper 38344436 Paper 36909352 Paper 36638125 Paper 35786392 Paper 35192692 Paper 32966559 Paper 31236709 Paper 30779729

Abstract

There are a number of challenges in conducting proper variable and statistical inference in working across high dimensional biological data. My methods develop work this year has focused improving computational and statistical approaches for genome-wide association studies, high throughput metabolomics data, and integrative genomics studies. With my recently graduated student Dr. Tao Jiang, we developed a new upper bound of the regularization parameter in sparse group Lasso based on an estimated lower bound of the proportion of false null hypotheses with confidence. The bound is estimated by applying the empirical distribution of dependent or independent p-values from single marker/variable analysis, where a second-level significance testing, the higher criticism statistic, is used. An upper bound of the tuning parameter in Lasso is decided corresponding to the lower bound of the proportion of false null hypotheses. Thus, the tuning range is narrow since the upper bound of is lower. The final decision of non-zero estimates will contain more variables so that the power of modified GWAS is higher than or equal to the original sparse group Lasso. We demonstrate the performance of our method using both simulation experiments and a real data application in lipid trait genetics from the Action to Control Cardiovascular Risk in Diabetes clinical trial. An R package was developed. Current applications of knockoff methods use linear regression models and conduct variable selection only for variables existing in model functions. In a project with Dr. Tao Jiang, and with Dr. Yuanyuan Li, we extended the use of knockoffs for machine learning with boosted trees, which are successful and widely used in problems where no prior knowledge of model function is required. We developed a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. We extended the current knockoff method to model-free variable selection through the use of tree-based models. We tested and compared these methods with the original knockoff method regarding their ability to control type I errors and power. In simulation tests, we compared the properties and performance of importance test statistics of tree models. Combination drug therapy has been a mainstay of cancer treatment for decades and has been shown to reduce host toxicity and prevent the development of acquired drug resistance. Therefore, it is crucial to develop computational approaches to predict drug synergy and guide experimental design for the discovery of rational combinations for therapy. With my student Jun Ma, we developed a new deep learning approach to predict synergistic drug combinations by integrating gene expression profiles from cell lines and chemical structure data. Specifically, we use principal component analysis to reduce the dimensionality of the chemical descriptor data and gene expression data. We then propagate the low-dimensional data through a neural network to predict drug synergy values. The use of dimension reduction dramatically decreases the computation time, without losing accuracy. Additionally, my recently graduated PhD student Dr. Jun Ma we worked on developing an approach that addresses challenges in nonlinear dose-response relationships. Nonlinear dose-response relationships exist extensively in the cellular, biochemical, and physiologic processes that are affected by varying levels of biological, chemical, or radiation stress. Nonlinear dose-response relationships exist extensively in the cellular, biochemical, and physiologic processes that are affected by varying levels of biological, chemical, or radiation stress. Therefore, we propose the use of an EA for dose-response modeling for a range of potential response model functional forms. This new method can not only fit the most commonly used nonlinear dose-response models (eg, exponential models and 3-, 4-, and 5-parameter logistic models) but also select the best model if no model assumption is made, which is especially useful in the case of high-throughput curve fitting. An R package to implement the method was developed. Ongoing projects building onto methods for detecting gene-environment interactions are currently ongoing, using variance QTLs to prioritize single nucleotide polymorphisms for detecting gene-gene interactions. Additionally, Dr. Ziyue Wang is working on developing new normalization approaches for microbiome data.

View original record on NIH RePORTER →