EAGER: Full Disclosure of Data Preparation and Use in Retrospective Studies
Portland State University, Portland OR
Investigators
Abstract
Although an overnight shipping company allows one to track the shipment of a package from coast to coast and a credit card company can track your purchases throughout the world, integrating, cleaning and using the various kinds of data available for engineering and planning studies in health, transportation and public infrastructure systems is extremely difficult. Such "retrospective" studies require intense scrutiny of the data and involve a myriad of decisions concerning the data and the definitions of the concepts involved to properly "clean" or prepare the data for the study. These decisions are typically written in English, if at all, and thus not automatically processable by any future user of the data. In a similar way, integrating data from multiple sources is difficult because of the details of the technology used to capture the data. The critical problems are that data from multiple sources can be hard to integrate and studies are almost never able to use data (that was carefully cleaned and scrutinized for one study) in another study. This seriously limits the opportunity to conduct larger or broader scale studies or to "re- run" studies on new data. One can envision a world where retrospective studies can be easily described, decisions about data use, data cleaning, and data integration can be precisely recorded, and operations, management, engineering, and planning activities can be informed by the results from studies as easily as we can currently track our packages as they wend their way across the country. This project envisions one solution to these problems; to devise methods and tools that will declaratively (i.e., at a high level, described in a formal language) document data manipulation activities in retrospective studies. Analysts will then be able to use this declarative specification to greatly facilitate the specification and conduct of new studies, making their jobs much easier. For example, if an analyst wants to repeat a study with new data, or with different parameters, the analyst will be able to use the declarative specifications saved from the original study, instead of having to decipher the low-level notes which may or may not have been saved from the original study. The declarative specifications will also be useful for combining the results of previous studies to create new studies. This project will focus three kinds of retrospective research studies: 1) studies from the Clinical Outcomes Research Institute (CORI), based on data concerning patient endoscopic procedures, 2) studies based on the Portland Oregon Regional Transportation Archive Listing (PORTAL), data containing years of highway loop-detector data as they conduct a study to determine the factors that correlate with certain kinds of congestion, and 3) studies based on data from the Portland, Oregon Water Bureau containing 8 years of water consumption data reported every 15 minutes from households across the city. Intellectual Merit By attempting to bring the capability of state-of-the-art schema and data integration and data cleaning systems into a set of tools that can be used easily by analysts and can be interfaced seamlessly with existing analysis tools, the research team will make contributions to the database field by identifying separable concerns (within integration and cleaning) and by generalizing functions that are currently available in more complex, all encompassing tools. Broader Impacts The results of this project will have broad impact because they are advancing the science of retrospective studies significantly. The results will be applicable beyond the three areas being studied and will enable researchers and analysts to perform their studies more efficiently and to perform more studies. Results will be disseminated broadly, not only by the PI and co-PI through the usual publication venues, but by researchers in the three areas being studied. Major Themes/Keywords: Computer Science/Information Technology. Engineering. Social Science. Intelligent Transportation Systems. Health Systems. Water Consumption.
View original record on NSF Award Search →