III-Medium: Reading the Web: Utilizing Markov Logic in Open Information Extraction
University Of Washington, Seattle WA
Investigators
Abstract
This project adresses the challenge of automatically extracting high-quality knowledge bases from text corpora. Previous work, led by Prof. Etzioni, developed KnowItAll (http://www.cs.washington.edu/research/knowitall), an unsupervised, domain-independent, scalable system that learns from the Web in an open-ended fashion. Another project, led by Prof. Domingos, has formalized and fully implemented a powerful framework called Markov Logic Networks (MLNs) (see http://www.cs.washington.edu/ai/srl.html) that enable inference and learning in large, first-order models. This project integrates KnowItAll and MLNs to build large-scale ontologies from text corpora: extracting relational tuples, using joint inference to merge and validate the tuples, mapping extracted phrases to a taxonomy, and using probabilistic inference rules to answer queries about the ontology. Consider, for example, the query "how many Nobel Laureates where born in Europe?" In response, Google merely provides documents matching the keywords in the query. KnowItAll can only identify people who are explicitly identified as Nobel Laureates and Europeans. This project investigates a system that utilizes both information extraction and probabilistic reasoning to identify candidate answers, not explicitly stated in the text, and their likelihood of being correct. As a simple example, the system concludes that Einstein was born in Europe based on the sentence "Einstein was born in Ulm, Germany". The query "what foods help prevent osteoporosis?" is answered using a multi-step reasoning chain regarding the ingredients of the food and their ability to prevent the disease. The broader impact of this research includes novel methods of building knowledge bases automatically. Such knowledge bases (after some manual tuning, perhaps) could be used to support a wide range of applications from question-answering systems, to knowledge-based systems for medical applications, to background knowledge in support of machine reading of text. The knowledge bases created by this project will be made freely available to the research community as a Web-site and also as a Web-based API via the project Web site (http://www.cs.washington.edu/research/knowitall/ReadingTheWeb/).
View original record on NSF Award Search →