Monte Carlo and Quasi-Monte Carlo Methods for Statistics

$659,759FY2009MPSNSF

Stanford University, Stanford CA

Investigators

Abstract

This project has two major components. The first is an investigation of visualization and analysis methods for data sets in high dimensions, with a focus on categorical variables whose number of unique levels is comparable to the total sample size. Examples of such variables include search query strings, ISBNs, song titles, author names, URLs, genotypes, environments, and customer ID numbers. The visualization methods are designed to show broad trends and to highlight anomalies. The inferential methods are of the sample reuse type: the bootstrap and cross-validation. New methods are necessary here because the data sets have complicated interlocking patterns that invalidate any IID sampling assumptions. The second component is better statistical inference by improving on their numerical methods. This includes calibration of empirical likelihood methods to get better coverage and to extend confidence regions for the mean beyond the convex hull of the data points. It also includes the embedding of quasi-Monte Carlo sampling methods into Markov chain Monte Carlo algorithms to combine the accuracy of the former and the wide applicability of the latter. Exploratory data analysis of categorical variables is useful to see broad patterns including small groups of customers that have similar tastes for a small list of songs or books or movies. It is also useful to identify anomalies that may indicate abusive behavior, including cyber-attack, and what is commonly called spam in the online context. One of the original motivations for the sample reuse methods is in crop science. In some of those problems, a large number of plant varieties (genotypes) are grown under many different environmental conditions. A statistical model is used to determine which varieties to use in each environment. Earlier statistical methods were based on assumptions that don't fit this setting and they often did not select the best model. New methods from this project may therefore be used to select better models which then result in increased production of food and fiber. The empirical likelihood work is basic research aimed at removing unnecessary mathematical assumptions from statistical models in order to widen their applicability. The Monte Carlo sampling component of the project is basic research on a computational technique used extensively in physics as well Bayesian statistical inference.

View original record on NSF Award Search →