GGrantIndex
← Search

Collaborative Research: Randomization Based Machine Learning Methods in a Bayesian Model Setting for Data From a Complex Survey or Census

$337,271FY2022SBENSF

University Of California-Santa Cruz, Santa Cruz CA

Investigators

Abstract

Official federal statistical system data often have complex sampling and design features that limit advanced statistical analyses. Examples of such surveys and census data are the Survey of Graduate Students and Postdoctorates in Science and Engineering (GSS), the Survey of Earned Doctorates (SED), the National Survey of Recent College Graduates (NSRCG), and the National Survey of College Graduates (NSCG). This project will develop Bayesian statistical and machine learning methods that are tailored to these types of federal data to improve computational efficiency and advance these methods to allow for data integration, multiple imputation, and data privacy. Importantly, the results of this research will be of value to the work of government agencies as well as within many subject-matter disciplines that deal with complex data, including demography, econometrics, and political science, among others. Software packages will be developed and made publicly available, and the investigators will educate and train both graduate and undergraduate students. Using a randomization-based approach, this research project will develop Bayesian statistical and machine learning methodologies for unit- and area-level data from a complex survey or census. This project has three aims. In Aim 1 the investigators will focus on several extensions to existing models using data reduction methods. Specifically, this aim will leverage random projection techniques, within a Bayesian hierarchical modeling framework, to provide useful tools for analyzing federal data. Subsequently, in Aim 2, the investigators will take advantage of the wide-applicability of random weight feed-forward neural networks as a Bayesian nonlinear regression tool for complex survey data. This approach will include mechanisms for data integration using social media, administrative data, and other structured data sources. Finally, in Aim 3, the investigators will use recurrent neural networks and their random weight variants as a tool to model temporally correlated complex survey or census data within a Bayesian hierarchical model. Ultimately, this project will develop principled methodologies that are useful for both the scientific and federal statistical communities. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →