CRII: III: Spatio-Temporal Data Mining to Learn From Unlabeled Climate Data
Icahn School Of Medicine At Mount Sinai, New York NY
Investigators
Abstract
Global climate change and its potential societal impacts have become one of our era's greatest challenges. Climate science is experiencing tremendous data growth (~100s of exabytes by 2020), yet our ability to learn from such data is limited because most of the existing data science methods are not well-suited for the noisy, heterogeneous, incomplete, and spatio-temporal data in climate science. This project will develop novel data mining methods to enhance our ability to learn from large and complex datasets with scientific and societal importance. Specifically, this project will focus on foundational advances to traditional pattern mining by incorporating information from multiple uncertain and incomplete datasets to discover statistically significant patterns in continuous data with an application to oceanography. We propose the notion of "spatio-temporal redundancy" where patterns are discovered in multiple related datasets to offset the lack of target labels in the data. This approach will address two major challenges in Big Data analysis for scientific discovery. First, as the size and complexity of the data grow, the likelihood of false discoveries increases significantly. Second the lack of ground truth verification data makes objective validation a challenge. Finally, we rely on First Principles and domain theory to constrain our data mining approaches to produce tractable and scientifically consistent results. These methods can generalize to various domains where no ground truth data are available but one has access to several information-rich, yet uncertain datasets.
View original record on NSF Award Search →