CAREER: Knowledge Extraction and Discovery from Massive Text Corpora via Extremely Weak Supervision

$550,000FY2023CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Abstract

Automated knowledge extraction and discovery methods can address the range of needs of different users (e.g., governments for decision making and scientists for literature summary). A fundamental open problem is how much user effort automated methods require to obtain useful knowledge. This project aims to minimize such required user effort with a newly proposed paradigm, extremely weak supervision – It includes only brief natural-language user input to define the task (e.g., a list of topics when classifying news articles; location names when classifying events), guidance similar to task-specific guidelines that might be provided to human annotators. By using brief natural-language input instead of labor-intensive annotated training samples, this new paradigm will help democratize knowledge extraction and discovery, and extend its application beyond rich companies to ordinary, relatively untrained users with a broad range of needs (e.g., domain scientists and small business owners). Project outcomes will be disseminated via top conferences and scholarly publications and integrated into new courses. This project will also support a group of graduate, undergraduate, and high school students. This project focuses on four fundamental, interconnected knowledge extraction and discovery tasks, i.e., text classification, phrase mining, named entity recognition, and relation extraction. Following the extremely weak supervision paradigm, this project will develop a series of novel methods, including (1) an unsupervised phrase tagging method for both multi-gram and unigram (emerging) phrases, (2) a text classification method that can take only the most popular (e.g., top-50%) class names as input to discover novel classes (i.e., new classes are not explicitly defined by the user) and build a classifier for all the classes; (3) a named entity recognition method that can take a few popular entity types and mentions of interest to recognize (emerging) entity mentions of the same/similar types; and (4) a relation extraction method that can take a few popular relation types and tuples of interest to discover relations of similar semantics and extract relevant tuples. All these methods, by design, will be agnostic to domains and languages and require only the availability of pre-trained neural language models in a particular domain and language. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →