Geometry Aware Exploratory Data Analysis and Inference Methods for Complex Data
University Of Southern California, Los Angeles CA
Investigators
Abstract
Complex big data appear routinely in the sciences and has become standard fare in contemporary data science. It is known to be difficult to analyze data that live in metric spaces, lacking fundamental vector space operations like addition and scalar multiplication and with no ordering between the data elements. Such data show up in the form of samples of histograms, networks, images, phylogenetic trees, and so on, and in multitudes of fields such as health monitoring, neuroscience, business and economics research, climate and environmental studies, evolutionary genetics, social sciences, and demography. Challenges are magnified when the observed complex data are dynamic, for example, when the data are time-varying or observed on other continuous domains. This project will push the frontiers in the state of the art of modern data analysis by creating a theoretically sound and user-friendly practical toolkit that will overcome these challenges for several important data analysis tasks. The new methods, being rooted only in pairwise distances between the data elements and tuning free by design, will immediately cater to the needs of scientists and engineers working with diverse representations of data, for example in longitudinal fMRI studies, online detection of the mutations in the virus phylogeny, understanding microbial diversity compositions, monitoring daily blood glucose distributions in electronic health analytics, time-varying gene-regulatory networks, understanding trends in social evolution and many more, offering practitioners a bundle of off-the-shelf tools to carry out exploratory analysis on the complex data before moving on to the downstream modeling tasks. The award will also support graduate students' training and offer research opportunities to undergraduates. Model-free distance-based approaches drive the success of developing statistical methods oriented to complex non-Euclidean data with minimal requirements on the ambient data space or the data distribution. This research aims to expand the arsenal of methodology in object data analysis by developing new rigorously justified algorithms for common data analysis jobs and building inference procedures that lie at the heart of statistics and constitute the basis of what most scientists attempt to answer with data. To address the key challenge of the lack of a vector space structure in object data and the absence of ordering among the data elements, the new developments will be based on the concepts of depth profiles, which are the distributions of distances as dictated by the law of the data, and the transport ranks, that are center-outward ordering schemes for object data constructed using optimal transport maps between the depth profiles. Specific sub-projects will focus on rank-based object data clustering and classification, outlier detection, and mode-centric data analysis procedures. Inferential frameworks with rigorous theory will be designed for novel two-sample tests, independence tests, change point detection, and localization, all of which will be distance-based and easily implementable. Finally, the new tools will be broadened to include exploratory analysis and dimension reduction for time-varying object data, both when the observations are dense in time and the more challenging case when only sparse measurements in time are observed irregularly. Theory and methodology development will involve tools from the empirical process and U-process theory, M-estimation, and functional data analysis. Efficient and scalable software implementations together with codes for appealing visualizations, which are extremely challenging for object data, will be made freely available for practitioners. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →