RI: Medium: Broad-Coverage Semantic Parsing: Linguistic Representation Learning from Crowd-Scale Data

$1,203,556FY2016CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

Automated understanding of text is a capability that will advance a wide range of language technologies, including information extraction, question answering, opinion analysis, and translation between languages. Such technologies have been in demand in the intelligence and defense communities for many years, and they now underlie many commercially available information-management tools. This project develops robust algorithms that understand natural language expressions by mapping them to formal representations of their meaning, a technique known as semantic parsing. For semantic parsing to be employed in technologies like those listed above, it needs to overcome the fundamental challenge of broad coverage, the ability to handle any text input, in multiple languages. This project meets this challenge by creating new methods for gathering large repositories of semantically annotated data at greatly reduced cost; these are then used to train much more accurate broad-coverage parsing models. The results of this project include open-source implementations, high-quality annotated corpora on an unprecedented scale, and reusable distributed semantic representations for use by the community of natural language processing researchers and practitioners. The goal of broad-coverage semantic parsing can only be achieved by simultaneously focusing on new, large scale sources of data with semantically meaningful annotations and new learning algorithms for inducing models with the representational capacity to make full use of such data. For scalable data collection, this project introduces new techniques that rely on two key complementary insights: (1) any reader who understands a text can answer questions about it, and (2) questions can be constructed whose answers probe any aspect of semantics that need to be recovered. These observations allow designing new data collection techniques that reduce the burden of semantic annotation by providing simple questions and answers about texts. This QA-style annotation can be done for any text in any language, given only native speakers, bypassing the significant effort that currently goes into defining detailed annotation standards. It also allows gathering new datasets on a much larger scale, and for more diverse text types, than ever before. In addition, the project develops new representation learning techniques that tie together a wide range of semantic annotation styles, including the new crowdsourced ones, in a multitask learning setup. Continuous representations (e.g., of word types) provide a powerful way to allow sharing of statistical strength across a large vocabulary, many of whose elements are sparsely observed. While past work has emphasized learning word embeddings, this project employs a shared continuous space ("framespace") that can capture abstract frames and roles used in predicate-argument (and logical) semantics. The usefulness of these representations depends on the tasks they are trained to perform, and using multiple related tasks can lead to benefits on all of them, by sharing of statistical strength across task-specific representations, across elements of the semantic lexicon, and even across languages.

View original record on NSF Award Search →