CAREER: Structure Learning and Forecasting of Large-Scale Time Series

$269,183FY2023MPSNSF

Cornell University, Ithaca NY

Investigators

Abstract

In many areas of modern biological and social sciences, researchers and practitioners seek to gain insight into the dynamics of a complex system using large-scale time series data sets. Examples include gene regulatory network reconstruction using time-course gene expression data sets, functional connectivity analysis of brain network architecture using neurophysiological signals, and monitoring systemic risk in the financial market using historical data on many firms' stock prices. The overarching goal of this project is to develop scalable statistical methods for learning such dynamic relationships using high-dimensional time series (HDTS) data sets, and provide a rigorous analysis of their properties. These methods, upon successful completion, are expected to aid data-driven testable hypothesis generation in systems biology, imaging-based biomarker search in computational neuroscience, and inform regulatory policy for financial risk management and monitoring.The research outcomes will be integrated into a number of education and outreach activities, including development of a modern data science curriculum with an accompanying online textbook as well as training of graduate and undergraduate students. Existing algorithms for analyzing HDTS data sets rely primarily on using modern regularization in machine learning coupled with a squared error loss designed for independent data. This is in sharp contrast with the core modeling philosophy of classical time series, where temporal dependence among observations is explicitly encoded in the likelihood or loss function to increase the accuracy of structure learning and prediction. This project will narrow the gap by designing new algorithms where temporal dependence and regularization inform each other using dependence-aware machine learning methods. In particular, impulse response and quantile-specific graphical models in the time domain, adaptively regularized graphical models in the frequency domain, and random forests that explicitly incorporate temporal dependence in building regression trees, will be developed. These methods will be validated on real data sets from genomics, neuroscience and financial economics in consultation with domain experts. Results will be disseminated to public by publishing peer-reviewed articles in statistics, machine learning and other scientific journals. Software implementations of algorithms developed in this project will be made publicly available in the form of R packages. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →