Feature selection in several challenging directions

$225,000FY2023MPSNSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

Feature selection plays a crucial role in many statistical problems, such as cancer classification and analysis of text and network data. This project will study feature selection in several challenging, understudied directions. First, MDAStat is a recent large-scale data set on the publications of statisticians between 1971 and 2015, which provides a rich resource for research on network analysis and text analysis. The project will expand the scope of data set to 1971-2025 by collecting new data. Second, the project will develop a family of correlation metrics, which provide a better way to measure the nonlinear relationship between the response and predictive variables. The metrics provide more accurate feature selection results in many application problems in cancer and biomedical study. Last, the project will develop new approaches to extracting features from social networks and text documents and generate better results in applications such as analysis of health care data and author attribution (i.e., identifying the right authors of a possibly ancient text document). The research will generate new ideas and methods to address many challenging problems in modern statistical research, and will substantially increase the understanding of many problems in science and engineering, such as cancer and biomedical research, network analysis, text analysis, and natural language processing. Feature selection is an important approach in high dimensional data analysis. The project will study feature selection in several challenging directions and will make contributions on the following topics. First, despite many studies on the rare/strong signal regime, the property of the lasso remains largely unknown in the more challenging rare/weak signal regime. The project will develop new techniques and use them to derive sharp rates of the Hamming selection errors of the lasso, especially for the rare/weak signal regime. Second, a challenging problem in feature selection is how to measure the nonlinear relationship between the response and predictive variables. The project will develop a family of nonlinear correlation metrics and use them to derive sharp phase transitions in nonlinear rare/weak models for cancer classification and cancer clustering. Third, despite that there are more than a handful of models for social networks, it remains unclear which model fits the best with real networks, partially because network goodness-of-fit is a challenging problem. The project will develop a novel goodness-of-fit approach and use it to identify the most appropriate models for social networks. Last, feature extraction and embedding with text documents and networks is a challenging problem. The project will develop novel approaches for feature extraction and embedding and using them for predicting future citation counts of a published paper and for author attribution. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →