CAREER: Fast and Accurate Statistical Learning and Inference from Large-Scale Data: Theory, Methods, and Algorithms
University Of Pennsylvania, Philadelphia PA
Investigators
Abstract
This project will develop statistical methods for analyzing large datasets. Such massive datasets are emerging as an important challenge in many areas of science, engineering, and business. The research will pursue a multi-pronged approach to addressing several fundamental questions in the analysis of such datasets, focusing on three key areas. The first one is sketching and random projections, which is a powerful randomized approach to data analysis used when the data must be analyzed on a single machine. The second area is distributed statistical learning and inference, where datasets are spread across multiple locations, with limited communication among them. The third is model retraining, where statistical or machine learning models must be updated efficiently after data has been added or deleted from the original training set. In addition, the project will have a significant educational component, with the PI developing a new course on statistical machine learning. This project will also train a graduate student. The PI is committed to diversity and inclusion, including women and underrepresented minorities in all aspects of the project. The methods developed for the project will be made freely available as software, which will allow others to directly use and benefit from the results. In the area of sketching, the project will leverage powerful tools from asymptotic random matrix theory and free probability to analyze fundamental problems, such as regression and clustering. In the area of distributed learning, the PI plans to develop and analyze statistical methods for distributed learning via gradient based optimization. For model retraining, the PI aims to study the connections between retraining and conformal prediction, with the goal of developing improved and broadly applicable methods for predictive inference. On a technical level, the work will involve advanced tools from probability theory, such as random matrix theory, as well as tools from numerical optimization. By carefully analyzing computational aspects of large-scale statistical analysis, the work will aim to bridge gaps between the statistical and computational perspectives. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →