III:Small:Enabling Technology for Best-Effort Data Integration Systems

$499,292FY2010CSENSF

University Of Wisconsin-Madison, Madison WI

Investigators

Abstract

Over the past few decades, the problem of data integration (DI) has received significant attention. Much of the initial attention was directed at integrating business data. Such data typically requires exact integration, because anything less is clearly not usable. Today, such exact DI systems continue to play an important role. But they are ill-suited for many emerging domains, such as personal information management, building Web community portals, scientific data management, text management for business intelligence, public safety, and military intelligence analysis. First, they are typically constructed in ``one shot'' in that the system is substantially unusable until it is completed in all of its envisioned generality. Second, when presented with new data, these systems often incur long delays before making the data available to users. Third, they typically are not designed to benefit from user feedback, even though opportunities for such feedback often exist in today's Web 2.0 world. Fourth, exact DI systems provide little or no assistance in explaining answers to users. In response, this project explores a paradigm shift, from precise DI systems to best-effort ones. Instead of being constructed in one shot, these systems are constructed incrementally. Their data is always queryable some fashion. They tolerate mistakes in the data, and can leverage user feedback to improve over time. Finally, they can explain their answers to the users, thereby allowing them to understand, verify, and trust query results. To build best-effort DI systems, researchers will pursue the following technical thrusts. (1)Increasing support for incremental development through the specification and implementation of a declarative, semantically transparent extraction/integration language, together with an effective optimization and execution framework. (2) Leveraging the power of a user community through the design and implementation of techniques that allow users to correct errors in the extraction/integration process as they are encountered, that consistently propagates these corrections throughout the extracted and integrated data, and that use these corrections to improve the quality of extraction/integration modules. (3) Developing and implementing techniques to capture information that will help users reason about the system's data along with support for exploring the implications of this information. The team will combine the technology to build a prototype end-to-end best-effort DI system and evaluate the system on three real-world applications: the DBLife portal, the GLEON limnology project, and the madison.com Web portal. This research will be integrated with ongoing efforts in educating students on techniques for extracting and integrating structured data. Inclusion of underrepresented minorities in the projects will be continued. The results from this project will be incorporated into a textbook on data integration to be published in 2010-2011. The project will facilitate the widespread deployment of data integration systems, thus resulting in more effective information management and access for society. It will play an integral part in educating next-generation professional workers and researchers. The research will also help domain scientists in limnology in the context of the GLEON project. It also has the potential to help the developers of madison.com build a system of much greater use to the greater Madison community. Finally, data and system artifacts from the project will be disseminated broadly in the research community to significantly enhance the data management infrastructure for research and education.

View original record on NSF Award Search →