RI: Small: Learning to Retrieve Structured Information for Summarization and Translation of Unstructured Text

$515,982FY2022CSENSF

University Of Notre Dame, Notre Dame IN

Investigators

Abstract

Computers are becoming ever more adept at generating natural language, in settings that range from totally unconstrained (tell a random story) to highly constrained (translate a text from one language to another). In more constrained generation tasks, like translation and summarization, the status quo is for computers to be trained primarily, if not exclusively, on example input-output pairs, which can lead to natural-sounding but incorrect outputs. For example, a news summarizer could easily, but erroneously, replace the name of a victim in a terror attack with the name of his or her spouse. By contrast, when humans learn to translate and summarize, example input-output pairs make up only a small fraction of our "training data"; we also draw on a vast amount of background knowledge that we've either learned or can look up in sources. This project is building automatic translation and summarization systems that use knowledge sources to improve faithfulness and factual correctness, increasing the usability of such systems, which are already widely used for information access. In contrast to many previous approaches that try to shoehorn knowledge into the data (e.g., by inserting dictionary definitions into the training data as ersatz parallel sentences) or into the model (e.g., by trying to improve word embeddings), this project's approach is to make knowledge available to the generation system directly. It focuses on adding table data to summarization and dictionary data (which can be thought of as a kind of table) to translation, and on adding knowledge graphs to both summarization and translation. The approach has three stages, which mirror similar setups in many question-answering and dialogue systems. First, the project is developing novel methods for learning how to retrieve useful information from these sources. Second, retrieved knowledge is made available to the generation system by directly integrating it into the system's input using a graph-structured representation. Finally, novel extensions of graph-to-text transformers generate text from these augmented inputs. The project is also investigating systems that generate translated text augmented with information from their knowledge sources, which may improve information access by helping to bridge national and cultural barriers in ways that conventional MT has not been able to. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →