SHF:Small:Language Support for Ad Hoc Data Processing
Princeton University, Princeton NJ
Investigators
Abstract
In every business, engineering endeavour and scientific discipline, workers are digitizing their knowledge with the hope of using computational methods to categorize, query, filter, search, diagnose, and visualize their data. While this effort is leading to remarkable industrial and scientific advances, it is also generating enormous amounts of ad hoc data (i.e., that data for which standard data processing tools such as query engines, statistical packages, graphing tools, or other software is not readily available). Ad hoc data poses tremendous challenges to its users because it is often highly varied, poorly documented, filled with errors, and continuously evolving --- yet ad hoc data also contains much valuable information. The goal of this research is to develop general-purpose software tools and techniques capable of managing ad hoc data efficiently. This research has the potential for a broad impact on society by dramatically improving the productivity of industrial data analysts, computer systems administrators and academics who must deal with ad hoc data on a day-to-day basis. The central technical challenge of the research involves designing, implementing and evaluating a new domain-specific programming language that facilitates the management of ad hoc data sets. This new programming language will allow data analysts to specify the structure of ad hoc data files, how those files are arranged in a file system and what meta-data is associated with them. Once a specification is complete, it will be possible to use it as documentation for the data set or for generating data-processing tools. The research will also involve developing new methods for enabling users to generate specifications quickly and accurately, without actually having to write down all of the details by hand. Finally, the research will develop new algorithms for implementing the generated data-processing tools efficiently.
View original record on NSF Award Search →