Statistical Models and Methods for Complex Data in Metric Spaces
University Of California-Davis, Davis CA
Investigators
Abstract
Big data in the form of random objects are increasingly encountered across sciences and society. Adequate and principled approaches for their analysis are important to extract relevant information, to make predictions, and to assess whether there are differences between samples. However, the complexity of such data poses major challenges for statistical analysis and traditional methods cannot be applied. In this project, these challenges will be addressed and overcome through the development of advanced nonparametric statistical methodology that can handle the complexity of object data. Specific objects that will be studied include networks and distributions, with applications for brain imaging and genomics data, climate change data and data from other areas of current interest. For situations where random objects are repeatedly observed over time, such as repeatedly observed MRI brain scans for the same subject, quantifications of the underlying time dynamics will also be developed. The research will include methods for testing and estimation, theory, efficient computational implementation, and data applications, which are expected to lead to substantial new insights. The project will provide training for the next generation of data analysts and research statisticians and user-friendly code implementing the methods will be made available. The project will contribute to the foundations of the emerging field of metric statistics as a comprehensive framework for statistical methodology and theory for samples of random objects, which are random variables/data that take values in a metric space. Random objects encompass data in the form of distributions, networks, trees, covariance matrices and surfaces, and data on Riemannian manifolds such as spheres. The statistical analysis of such data is challenging as one cannot apply traditional statistical methods due to the absence of a vector space structure. Specifically, for random objects that are situated in a geodesic space, a general class of regression models for transports from the barycenter to specific objects will be developed. In these models both predictors and responses are transports and specific examples are regression models for spherical data and distributions. The project will also include the study of regression models for multivariate distributions as responses, paired with Euclidean predictors, using slice-optimization in Wasserstein space. This approach will be illustrated with applications in climatology and life sciences data. Another goal is the development of transport processes in a geodesic space, constituting a new type of stochastic process. For such processes, anchor point representations for general types of random objects and latent Gaussian process representations for distributional objects in Wasserstein space will be obtained. Additionally, a random effects model for Frechet regression will be developed, providing the first such model for longitudinal object data, with applications in brain imaging and distributional data analysis. Throughout, the challenges resulting from the absence of Euclidean and algebraic structures will be addressed with empirical process theory and other tools. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →