Testing and improving methods for efficient annotation through the construction of a large parsed corpus

$360,225FY2012SBENSF

University Of Pennsylvania, Philadelphia PA

Investigators

Abstract

Electronic corpora annotated with linguistic information play a crucial role in natural language processing (NLP) and in linguistic research. Treebanks (corpora annotated with syntactic information) are especially important since they mark the grammatical structure necessary for understanding sentence and discourse meaning. For NLP, treebanks provide testbeds for developing language understanding systems. For linguistic research, they provide the basis for precise and replicable studies of the patterns of use of syntactic forms. Unfortunately, accurate annotation is difficult. Automatic parsers have relatively high error rates and the correction of these errors by human annotators is both slow and itself error-prone. Based on recent advances that Dr. Kroch and his collaborators have made in the creation and quality control of three large treebanks for different languages, Dr. Kroch proposes a major effort to improve corpus construction through the creation of a two-million-word English treebank. Along with this useful and substantial result, the project will develop and test hypotheses on speeding up treebank construction. The work will be guided by two complementary strategies. The first aims to reduce the parser's error rate by enhancing the part-of-speech (POS) tagged input to the parser while the second aims to make the correction of residual errors more efficient by shifting some of the burden from human to automatic error detection and correction. Speeding up the construction of accurate, consistent treebanks will improve the size and quality of training data for parsers, leading to improved performance in real-world NLP applications that rely on parsing. The availability of larger treebanks and of better methods for constructing them will also improve linguistic research. Moreover, as treebanks grow in size, they will become more useful in literary and historical studies, where the rhetorical structure of texts will become investigable in a more precise way than is currently possible. In addition to the intellectual merit of the proposed research and the impact it can be expected to have on text-based research that relies on automated processing techniques, the project will provide valuable training opportunities for graduate and undergraduate students. In contributing to improvements in automated techniques for language processing, this project may also benefit the analytic needs in industry and government security.

View original record on NSF Award Search →