Reduction of Infinite Data Dimension via B Spline Smoothing

$221,525FY2007MPSNSF

Michigan State University, East Lansing MI

Investigators

Abstract

This research project develops B spline smoothing methods for: (1) reducing dimension in machine learning and (2) non- and semi parametric GARCH volatility model, with fast computing and explicit formulae. Asymptotically simultaneous confidence band are provided for all nonparametric estimation. The proposal aims to develop the underlying theory as a crucial guide to practical implementation. For dimension reduction in machine learning, the focuses are on the generalized additive model (GAM) and the single index model (SIM), with dimensions tending to infinity. For dimensions from low to moderately high (400-D), spline-backfitted kernel smoothing procedure for additive model and direct spline smoothing procedure for SIM are theoretically reliable, intuitively appealing with extremely fast computing. The current project extends these procedures to GAM and SIM with dimension going to infinity, preserving the theoretical, intuitive and computing benefits. The investigator also studies B spline smoothing algorithms for non- and semi- parametric GARCH model, achieving the same asymptotics as kernel smoothing. As typical applications of GARCH model involve sample sizes from thousands to millions and equally large number of lagged values, B spline smoothing can compute in seconds what kernel smoothing would need days. Thus the proposed methods satisfy both theoreticians and financial analysts. In the age of information overload, researchers in nearly all areas of biological, medical, physical and social sciences are routinely confronted with large data sets. With tens of thousands of characteristics called variables or features, these large data sets are treasure troughs of valuable scientific information. The methods developed by the investigator are powerful new tools for drawing such useful information out of large data sets. Typical examples of such data include but are not limited to, environmental and global change studies, high frequency financial data, state and federal demographic surveys, federal biometric database, etc. Codes written in free software R are made publicly available for wide dissemination. Practitioners from industry and government can analyze their own large data sets with these user-friendly modules, in real time, with confidence and precision. A distinctive feature of the project is the active integration of cutting-edge research with the education and training of graduate students, especially those from underrepresented groups. This is consistent with the education goal of NSF and fulfills NSF's commitment to the principle of fostering diversity in science.

View original record on NSF Award Search →