CAREER: A Framework for Ad Hoc Model Construction in Data Streaming Environments
Colorado State University, Fort Collins CO
Investigators
Abstract
Over the past decade there has been an exponential growth in data volumes driven in part by data streams generated by computer programs and observational equipment such as satellites, radars, and ecological sensors. Given the data volumes, it can be difficult to harness the data to understand phenomena and/or to make forecasts. Fitting models to the observational data is one way to accomplish this. A precursor to building such models is extracting features from the data. Models constructed using such features can then be used to predict what the outcome will be and when it is likely to happen. This research will provide scientists and researchers the tools needed to make sense of data streams generated in streaming environments. Domains where this research is broadly applicable include smart cities, traffic planning, homeland security, and ecological monitoring. The project includes an educational component focused on increasing female student participation in college STEM majors. Therefore, this project aligns with the NSF mission to promote the progress of science, to advance the national health, prosperity and welfare, and to secure the national defense. The project aims to carry out research to create an enabling infrastructure to support the generation, assessment, and refinement of ad hoc models from voluminous, multidimensional, time-series observational data at scale. Challenges in ad hoc model creation stem from the combinatorially explosive number of ways in which models can be realized. The framework, Synapse, aims to support and simplify the naturally iterative and interactive model building process over voluminous streaming data. Modelers will only need to specify a basic set of bootstrap parameters; the framework will manage complexities relating to: (1) how streams are dispersed, (2) how data accesses are managed, (3) coping with I/O and memory contentions, and (4) dispersion of model generation workloads. The research involves scalable techniques for data dispersion employing distributed hash table data structures, map-reduce-based workflows and orchestration of model creation workloads, training data management, and interactive visual assessment of model performance. A visualization component will allow modelers to quickly and effectively assess the quality of a multiplicity of models each possibly covering a different portion of the input feature space and to use these assessments to guide decisions about selection, updates or replacements of models. If successful, the framework will scale with increases in data volumes, the number of available data streams, model generation workloads, and live model evaluations.
View original record on NSF Award Search →