Nonparametric Outlyingness and Descriptive Measures in Multivariate and General Data Settings

$107,438FY2008MPSNSF

University Of Texas At Dallas, Richardson TX

Investigators

Abstract

This project extends foundations in two closely interactive areas of core statistical science: nonparametric outlier identification, and robust descriptive measures. Multivariate and more complex data types are emphasized. Data points apart from the main body ("outliers") can adversely affect statistical analyses unless identified and taken into account. This concern is arising in new contexts calling for updated and broadened formulations of current methods. Multivariate modeling with heavy tailed data and with skewness and kurtosis descriptive measures in addition to location and skewness involves increased concern with outliers. Diverse new data types (functional, shape, image, set, symbolic, sensor, stream, tree, graph, etc.) being treated by sophisticated but ad hoc computer science data mining approaches need more systematic treatment. Shape-fitting problems in computational geometry impose new forms of outlier issues. The study develops new general foundational approaches to outlier detection, eliminates reliance on algorithms that only handle outliers without actually identifying them in the input space, eliminates undue reliance upon elliptical outlyingness contours, and strengthens the accommodation of heavy tailed data. The overall project goals are to establish extended conceptual statistical foundations for outlier detection and to develop new structures for robust descriptive measures of location, dispersion, skewness, kurtosis, etc., with the aim of broad application across general data settings. With advancing computational resources, the scope of statistical data analysis and modeling is widening to accommodate pressing new arenas of application. Data in all areas of science and engineering has complex multidimensional structure, typically with large sample sizes and involving curves, images, text, and other objects, often within astream or network structure. This is generating major new problems in detection and handling of "anomalous" data points ("outliers"). Which cases stand apart? How do the "unusual" cases impact statistical analyses on the full data set? What computational steps efficiently find the outliers when the data is massive and involves many variables? What general principles apply across diverse new situations such as fraud detection, intrusion detection, network analysis, and data mining? How to define "outlier" relative to a fusion of several related data sets, for example image, text, and sensor data, as might arise in Homeland Security? This study addresses these basic practical questions with the aim of developing new methodological approaches soundly based upon established statistical principles.

View original record on NSF Award Search →