CAREER: Long Document Summarization with Question-Summary Hierarchy and User Preference Control

$555,582FY2021CSENSF

Regents Of The University Of Michigan - Ann Arbor, Ann Arbor MI

Investigators

Abstract

In an era when long documents are produced at an overwhelming speed, a reader may not have time even to skim over a document to decide which topics deserve a detailed look. The goal of this CAREER project is to build text summarization systems that can understand and aggregate information from long documents, so as to allow users to explore their content with summaries that are generated in styles they prefer. The summarization tools will make long documents more accessible and comprehensible, easing the knowledge learning experience of the general public. Researchers and practitioners can also use the tools to summarize long documents relevant to their work, and educators can incorporate them in their classes to bolster students' reading and writing skills. The project also broadens the investigator’s efforts of engaging young students in immersive research opportunities, allowing them to participate in the design and implementation of advanced summarization systems. This project develops a new summarization framework for long documents in which article-level abstractive summaries provide an overview, and a question-summary hierarchy presents different levels of details. The technical contributions of this project are three-fold. First, the quadratic time complexity of state-of-the-art summarization (e.g., Transformer) is reduced by using adaptively predicted sparse attentions and augmented with a knowledge encoder. Second, an open-ended question generation model fills automatically learned question templates to produce concrete questions that are coherent within the question-summary hierarchy. Third, summaries are tailored to user-specified styles via iterative adjustments during generation, reflecting important advice in plain-language guidelines. This project experiments with new datasets collected from government reports, since their length, topic diversity, and formulaic verbiage embody many common challenges for long document summarization. New evaluation methods are also designed, with cloze questions to target common erroneous generations, and with model confidence metrics to pinpoint errors without using references. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →