Semiparametric Efficient and Robust Inference on High-Dimensional Data

$175,000FY2023MPSNSF

University Of Washington, Seattle WA

Investigators

Abstract

Most data collected for scientific or industrial purposes suffer from missingness, lacking information about certain features that were lost at the sampling stage due to factors such as experimental design, non-compliance, or technical problems. This issue is particularly prevalent in high-dimensional data sets, where each observation comprises many features. Failing to effectively address the issue of missing and incomplete data results in inefficient and biased inference. Moreover, when traditional statistical methods, originally designed for low-dimensional incomplete data, are applied to high-dimensional data, they can yield misleading scientific discoveries. Therefore, there is an urgent need for innovative statistical methodologies specifically tailored to inference on incomplete data in high dimensions. For instance, in cancer research, next-generation sequencing technology allows for comprehensive genomic profiling. However, due to technical limitations and tumor heterogeneity, this profiling is typically incomplete. The ability to conduct valid inference after imputing incomplete profiles holds significant implications for advancing cancer treatment. This research project involves (1) developing methodology for inference on high-dimensional incomplete and missing data; (2) disseminating the resulting techniques to the statistics community via publications, seminars, and public release of software; (3) training PhD students in high-dimensional statistics and probability theory; and (4) increasing the exposure of high school students, undergraduates, and members of underrepresented groups to statistics and probability theory via introductory reading groups, conference presentations, and other outreach activities. The project will provide a broad range of mentoring, educational and professional development opportunities to train the next generation of statisticians and data scientists at various career levels. The proposed research has two main objectives, namely (1) to develop a semiparametric efficient approach to inference after adjusting for incomplete data in high dimensions, and (2) to develop one- and two-sample bootstrap tests for high-dimensional hypotheses that retain correct size and power under incomplete data. Inference with incomplete data requires careful adjustments through inverse probability weighting or single/ multiple imputation. Any estimation error in these adjustments propagates into subsequent analyses. To address this challenge and achieve the first objective, the project will develop combined inference and adjustment procedures which treat imputation/ re-weighting not as a separate nuisance step, but as an integral part of the inference process. Specific solutions to several canonical incomplete data problems will be provided. For the second objective, the main challenge lies in designing a bootstrap procedure that accurately accounts for the variability of imputation/ re-weighting. To meet this objective, the project will develop a new parametric high-dimensional bootstrap procedure that can leverage such information. Different bootstrap tests for discrete/ categorical and continuous data will be provided. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →