REU Site: University of North Carolina at Greensboro in Complex Data Analysis using Statistical and Machine Learning Tools
University Of North Carolina Greensboro, Greensboro NC
Investigators
Abstract
In this age of big data, technological innovations allow the collection of massive amounts of data at low cost. Sizes of hundreds of gigabytes, terabytes, and even petabytes are no longer uncommon. Appropriate handling and analysis of such data is vital training for the future generation of graduates. With this goal in mind, the proposed REU program aims to provide 10-week sophisticated training in “Complex Data Analysis using Statistical and Machine Learning Tools” to eight (8) highly motivated nationally selected undergraduates from Mathematical Sciences during summers of 2020-2022. The faculty mentors bring a rich and diverse experience to this training program. PI Gupta, Co-PI Gao, Senior Personnel Richter and Stufken are statisticians; Senior Personnel Mohanty is a Computer Science faculty specializing in machine learning tools; and Senior Personnel Sun is a statistical geneticist in the Department of Mathematics and Statistics. The training program will motivate the student participants, particularly those from under-represented minorities, to go on to graduate programs in mathematical sciences and become better trained professionals capable of handling societal data analytics needs. As part of broader professional training, students will undertake trips to major research centers in North Carolina such as SAS, SAMSI (Statistical and Applied Mathematical Sciences Institute), and the Joint School of Nano Science and Nano Engineering. We expect that the research completed as part of this training will be of very high quality and will lead to journal articles and conference presentations. Complexity in data can come in a variety of ways. High dimensionality of the data is one such complexity where the number of variables can be relatively large as compared to the data size. Data contamination is another type of complexity. Students will learn the art of simultaneous handling of dimensionality-reduction and outlier detection as part of one of the projects. Noise-added data (to create confidentiality in data before public release) is another type of complexity. It is becoming common for researchers to have access only to scrambled data, and not the real data. In one of the projects, we will talk about why and how data are scrambled and de-scrambled using randomized response models, retaining aggregate level properties and ensuring anonymity to respondents. The machine learning component of the REU program will focus on leveraging the capabilities of sophisticated tools such as Unsupervised and Supervised Machine Learning, and Deep Learning for identification of social media posts related to disease symptoms, and for prediction of temporal trends in disease propagation. Violation of model assumptions is another source of complexity that necessitates the use of nonparametric techniques such as the resampling methods. In many matched-pairs design situations, a mixture of complete and incomplete pairs of data are available. Rather than ignoring data from incomplete pairs, we will train students in methods designed for analyzing such data. In another project, students will be trained in the subdata selection techniques which are very helpful in dealing with data of enormous size. This project will address questions of the type (1) what size should the subdata have to ensure a reliable analysis; and (2) for a given size, how should the subdata be selected? The overarching goal in all of these projects will be to train students in recognizing various complexities in data and finding the right techniques to handle such data. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →