III: Medium: Enabling Technologies for 21st Century Entity Matching Applications

$1,085,906FY2016CSENSF

University Of Wisconsin-Madison, Madison WI

Investigators

Anhai Doancontact Jeffrey F Naughton Miron Livny Paul C Hanson

Abstract

Entity matching (EM) decides if different data instances (such as "UW-Madison" and "Univ of Wisc Madison") refer to the same real-world entity. This problem is critical for numerous Big Data and data science applications. This project will develop solutions and tools for EM which will significantly advance the state of the art. Compared to the current solutions, the proposed solutions will consider the entire EM pipeline, will be usable by lay users (such as domain scientists and journalists), will scale to large amounts of data, and will exploit crowdsourcing to maximize matching accuracy. Technologies developed in this project will be evaluated in three domains: managing scientific data for limnology (the study of lakes and other bodies of freshwater), product matching for e-commerce at WalmartLabs, and developing the Internet of Buildings at Johnson Control Inc. As such, the project will facilitate the widespread deployment of EM tools, thus resulting in more effective information management and access for society. Through its release of open-source software it will help educate next-generation workers and researchers. The research will help domain scientists in limnology, and can potentially impact hundreds of thousands of buildings and millions of users via collaboration with Johnson Controls and WalmartLabs. Finally, a planned textbook and open-source system artifacts from the project will be disseminated broadly in the research community, to significantly enhance the data management infrastructure for research and education. The project will introduce both conceptual and technical novelties. Conceptually, instead of focusing on just the matching step (and studying how to match accurately and scalably, as much of current work has done), the project advocates developing solutions for the entire raw-data-to-matches EM pipeline. Further, it advocates using the matching step to guide the execution of the remaining steps in the EM pipeline. Technically, the project will develop novel solutions for non-matching steps in the EM pipeline, and do so in a matching-driven fashion. It also introduces new important problems, such as EM debugging. Finally, it develops novel solutions to scale up the entire EM pipeline and to exploit crowdsourcing. As described, the project takes the next logical step in EM research, and can significantly advance the state of the art. Further, many problems underlying this research have commonalities with other data management scenarios. Hence, the research has the potential to contribute to those areas as well. For more information, see the project's homepage at https://sites.google.com/site/anhaidgroup/projects/nsf-em-project-2016.

View original record on NSF Award Search →