CRI - Towards a Comprehensive Linguistic Annotation of Language

$905,598FY2006CSENSF

Brandeis University, Waltham MA

Investigators

James Pustejovskycontact Adam L Meyers Janyce M Wiebe Martha S Palmer Mitchell P Marcus

Abstract

This project, developing a Unified Linguistic Annotation (ULA) that integrates in one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions), provides a large word corpus with balanced and annotated data. The effort involves the integration of several existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank, and coreference and opinion annotations. The work addresses the lack of progress in automatically producing semantic representations, a current major obstacle for natural language processing. The project aims at -Achieving an international consensus on a meta-specification framework allowing individual annotations to cohabit with one another (consistency) and specification components from different schemas to refer to merged information (integration); -Improving techniques for producing high performing systems for the reliable types of semantic annotation; -Producing a stable and language-independent methodology for the process of unified linguistic annotation, complete with widely accessible and broadly applicable tools and guidelines; -Validating (with workshops) generality and robustness of ULA by incorporating additional annotation schemas, new genres, and additional languages; and -Actively promoting dissemination of the techniques embodied in the ULA, as well as the resulting annotated corpora throughout the community, for use and further evaluation. The incorporation of automatic taggers into natural language processing (NLP) systems is expected to improve performance in question answering, information extraction, and machine translation, among others, by moving NLP to a new level of richer, deeper processing. The annotated data provides training material for automatic taggers and a wealth of data for further corpus linguistic studies. This project should bring us one step closer to understanding the nature of meaning. Broader Impact: The project offers the opportunity to produce an abstract representation of a vast amount of text, and therefore a vast amount of knowledge, which may be key to turning knowledge representation in general into a more tractable problem. The activity enhances infrastructure for research and education by providing a resource that could lead to major advances in robust, broad coverage semantic processing. The workshops provide tremendous learning opportunities for the students.

View original record on NSF Award Search →