RI: Small: Collaborative Research: Research Leading to Comprehensive Guidelines for Discourse Relation Annotation

$200,005FY2014CSENSF

University Of Pennsylvania, Philadelphia PA

Investigators

Abstract

Machine Translation, automated question answering, dialogue systems -- the many useful, emerging language technologies -- depend on recognizing patterns in text. Right now, the only patterns that can dependably be recognized are very local, no bigger than a sentence clause. Enabling patterns to be recognized across clauses in a text by identifying what links them and what the link conveys was the goal of the NSF-supported Penn Discourse TreeBank (PDTB), a nearly 1-million word text resource labelled with text-linking devices ("discourse connectives" and adjacency), the spans of text they link, and what the link conveys. In the five years since the release of the PDTB, computational linguistics researchers from around the world have used the format it pioneered,to develop similar resources for other languages and to use these resources for recognizing larger patterns in text. The current PDTB, however, lacks the full range of explicit and implicit text-linking devices in English and what they convey; the information which is badly needed by many forward-looking language technology applications. The goal of this project is to conduct research with the purpose to enrich the PDTB with these additional devices and to develop ways for authoritatively annotating other texts with similar information, but with less manual effort, as a basis for extending the range of texts whose larger, cross-clausal patterns can be recognized automatically. This project is a response to calls (from both the language technology and computational psycholinguistics communities) for increased coverage and continuity of discourse relation annotation, both across and within the sentences of a text. To ensure a systematic annotation scheme grounded in evidence, the project starts by addressing some foundational questions about the properties of additional linguistic signals of discourse relations and how to capture these properties consistently and completely through manual annotation. From this follows systematic, evidence-grounded annotation of Entity Relations; constructions (other than discourse connectives) that reliably signal discourse relations; implicit intra-sentential discourse relations (building on PropBank annotation of the Penn TreeBank, and concurrent discourse relations (where implicit relations hold in addition to ones signaled explicitly). The project also explores the use of crowd-sourcing to support sub-tasks in discourse relation annotation that would lead to a reduction in the manual effort needed for expert annotation of other corpora, or enable large-scale experiments on aspects of human understanding of discourse relations. As with the Penn Discourse TreeBank 2.0, the enhanced corpus resulting from the project will be disseminated by the Linguistic Data Consortium (LDC), a well-established institution for world-wide distribution of language resources.

View original record on NSF Award Search →