III-COR-Medium: Providing Provenance through Workflows and Database Transformations

$914,494FY2008CSENSF

University Of Pennsylvania, Philadelphia PA

Investigators

Susan B Davidsoncontact Sanjeev Khanna Val Tannen

Abstract

Data provenance is a fundamental issue in the processing of scientific information and beyond. Two lines of research have been pursued in recent years with direct bearing on the issues of data provenance. In one of them, provenance in workflows, the emphasis is on extracting provenance from logs of events marking the execution of different modules to various intial and derived datasets. In the other line of research, provenance in databases, the emphasis is on the propagation of provenance through the operators that make up database views, or on propagation of provenance through copy/cut-and-paste operations within and among databases. These two bodies of work employ different techniques and at first glance their results appear quite different. However, in many scientific applications, database manipulations co-exist with the execution of workflow modules, and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm. By analyzing the work on data provenance in workflows and in databases, the PIs identify what they believe are the main difficulties in unifying and integrating these two different kinds of data provenance: (1) the lack of a data model that is rich enough to capture the interaction between the structure of the data and the structure of the workflow; and (2) the lack of a high-level specification framework in which database operators and workflow modules can be treated uniformly. In this project, the PIs aim to overcome these difficulties and thus provide concepts and tools that allow a truly comprehensive approach to the provenace of scientific data. The project's approach relies on a data model that supports nested collections and on a functional language approach to workflow specification. Based on this, the project aims to deliver a framework and tools for defining, managing and querying data provenance in complex scientific workflows that include database manipulations. The project is expected to impact bioinformatics (through interdisciplinary collaborations in the Penn Center for Bioinformatics and the Penn Genome Frontiers Institute) and phyloinformatics (through contributions to the NSF AToL program) as well as ongoing standardization work on provenance in workflows and in the business processes (eg., BPEL) community. The results of this project are disseminated as publications, through direct collaborations and through the project website: http://db.cis.upenn.edu/research/UNIPROVE.html.

View original record on NSF Award Search →