Nonparametric Location and Scatter Functionals

$173,000FY2005MPSNSF

Massachusetts Institute Of Technology, Cambridge MA

Investigators

Abstract

Abstract: Nonparametric location and scatter functionals Given n independent observations, each with distribution P, one can form an empirical measure by taking the average of point masses at the observations. For a functional T of P, evaluating T at the empirical measure gives an estimator of T(P) which will converge to it almost surely if T is suitably continuous at P. The project will continue a search for functionals continuous at as many P's as possible, which may be all P in one dimension, but in higher dimensions an open dense set. Then if T is also differentiable at P in a sufficient sense, the estimators will be asymptotically normal and converge to T(P) at a rate of 1 over square root of n. Functionals of location include the mean mu when it is finite. In one dimension, another location functional is the median m, and scale functionals include the standard deviation sigma and median absolute deviation. The classical mu and sigma can be undefined or infinite at laws decreasing too slowly at infinity. For some laws P there is an interval of medians, whose midpoint gives a unique choice of m, but m is discontinuous at such P. Simultaneous maximum likelihood estimation of location and scale for t distributions with degrees of freedom nu larger than 1 extends to location and scale functionals defined and continuous at every probability distribution on the line, and infinitely differentiable at distributions having no atom as large as nu over nu plus one. On multidimensional spaces, the square of a scale parameter is replaced by a scatter matrix analogous to a covariance matrix, and the continuity and differentiability will be sought on a dense open set of distributions, via t functionals and by other methods. Location and scale or scatter functionals are some of the most basic in statistics. But by far the two best known functionals of location, the mean and median, each have serious drawbacks. Using the mean, one assumes in effect that all data are correct. The mean can be overly influenced by outlying, extreme observations such as gross errors. Using the median, on the other hand, guards against the possibility that nearly half the data may be incorrect. If a distribution has more than half its probability in an interval, the median must be in that interval and may very poorly represent the rest of the distribution. Functionals combining some of the advantages of the mean and median while avoiding the worst drawbacks of either can and should be more widely studied and used. Then one can have kinds of averages that guard against a more realistic possibility that some small fraction of the data may be incorrect.

View original record on NSF Award Search →