Adaptation of New Statistical Ideas for Medicine

$224,250R37FY2010EBNIH

Stanford University, Stanford CA

Investigators

Linked publications & trials

Abstract

Our MERIT award work will continue to have two main components: involvement in .specific biomedical reseai-ch projects sucli as NHBLI's FEHGAS study, and development of new statistical methods appropriate for the analysis of large, complex data sets. These efforts are complementary, with the speciflc projects [unreadable]suggesting which statistical rnethods are mofit needed, and also serving as test cases for new methodology. The FEHGAS study, for exarhple,- seeks to predict age of onset of hypertiension from SNP data (and background variables such as age and gender). There are 550,000 SNPs available for prediction, most of which will turn out to be useless, making the problem an ijrder of magnitude more challenging, than in expression microarray situations. Efron plans to extend the empirical Bayes liiethodology from his recent paper to this context, hopefully overcoming the difficulties caused by the usually weak predictive power of individual SNPs. Olshen plans to extend CART (Computer Assisted Regre.s.sion Trees) and bootstrap methodology to the selection of groups of promising predictive SNPs. Large-scale significance testing, for instance selecting 'significant'genes in a microarray cancer study, has become an area of iiitense statistical development. Nevertheless, crucial questions of appropriate implomentation remain vague in the literature: the choice of an appropriate null hypothesis;the selection of a comparison set (Should all 550,000 SNPs be tested together or sepai-ately by chromosome?);and the effects of correlation. We have made some headway in answering thescf questions, as described in the Progress Report. Our continuing efforts are a combination of methodological implementation and theoretical development. Correlatiion can have particularly dra.stic effects on staiidard statistical techniques. Iii "Are a .set of microarrays independent of each other?" it is shovyn that a study involving 20,000 genes has its effective sample size reduced to about 17 because of severe gene-wise correlation. We are currently developing diagnostic methods to spot correlation difficulties in massive data sets, and to assess their effects on hypothesis tests, estimates, and predictions. A 20,000 gene microarray study produces 200,000,000 correlations, which sounds oppressively large for practical insight. But we are making progress on an empirical Ba5'es approximation that summarizes correlation, effects in a single number, suitable for simple analysis. Twentieth Centiiry biostatistical applications were overwhelmingly frequentist in nature. Pure: frequentism, though, becomfSi impra<;tical for analyzing the large, complex data sets produced by modem biomedical devices, where the relationships of thousands of parameters and millions of data points have to be considered together. We are continuing to develop empirical Bayes methods that allow Bayesian ideas to be brought to bear on questions of multiple inference, without requiring specific prior distributions from the .scientist. A long-term project is to understand how quickly empirical Bayes information accrues in a medical study. A False Discovery Rate is an estimate of the Bayes posterior probabiUty that a gene (or a SNP, br a voxel) is 'null', given the observed data. How many subjects and how many genes do we need to observe in order to get an acciurate empirical Bayes estiinate of the posterior probability? hi our own version of Moore's law, biomedical data sets have increased an order of magnitude in size every few years since the 1990s. Emerging technologies (tiling arrays, bead arrays, aptamer chips, methylation arrays, exon chips, and a variety of new imaging devices) promise further increases, taxing both computational equipment and statistical inethodology. Our long-term MERIT goal is to provide algorithms and theory appropriate tp massive-data biomedical requirements.

View original record on NIH RePORTER →