CAREER: Achieving Quality Information Extraction from Scientific Documents with Heterogeneous Weak Supervisions

$499,948FY2023CSENSF

Iowa State University, Ames IA

Investigators

Abstract

The volume and breadth of the scientific literature is growing at an astonishing pace, making it challenging for researchers to keep up. Information extraction systems that can automatically extract structured information from this unstructured text are in high demand. Benefits from automated information extraction (IE) are multi-fold: it is easier to search and organize scientific documents, it results in efficiency gains for curators, and it reduces curation costs, among others. Although supervised deep learning-based IE methods achieve curation-level performance on some applications, large training datasets with accurate annotations are necessary to achieve these results. The goal of this project is to develop an adaptable and flexible information extraction framework that learns from existing resources and does not rely on costly and time-consuming expert annotations, and bridges the performance gap in real applications addressing extraction quality concerns and unique requirements of IE tasks in the scientific literature. Success in this project will benefit many domains by providing mechanisms for processing massive unlabeled textual datasets, speeding up literature understanding and the curation process, and promoting new scientific discoveries. The investigator will engage in departmental Broadening Participation in Computing (BPC) activities and create educational materials based on results from this project for outreach programs to local k-12 schools and communities. This project is focused on three complementary research thrusts, each of which addresses one key obstacle of information extraction on scientific documents: 1) advancing IE models to work with heterogeneous supervisions such as distant supervision and indirect supervision while taking advantage of all existing resources, 2) developing new semi-open information extraction tasks to extract detailed context and uncertainties at the document level, and 3) developing a novel learn-from-mistake paradigm that integrates first-order logic rules and new annotations from domain users to refine the IE models and results. The proposed research will address a variety of problems drawn from different information extraction settings, which will lead to new principles, methods, and technologies for machine learning, data mining, and natural language processing. The research thrusts will be applied to extract information from STEM textbooks to construct concept networks for education purposes. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →