Reducing the Corpus Annotation Bottleneck for Natural Language Learning

$500,000FY2002CSENSF

Cornell University, Ithaca NY

Investigators

Abstract

Progress in the field of natural language processing (NLP) is currently limited, at least in part, by the speed with which new annotated corpora can be created. In addition, there is evidence that achieving the next level of performance in automated text understanding will require annotated training corpora that are orders of magnitude larger than those currently available. In short, there exists a corpus annotation bottleneck in building robust, accurate NLP system components. The PI proposes, therefore, to investigate machine learning paradigms that will significantly reduce human annotation costs while maintaining or improving the accuracy of the natural language learning algorithms that are trained on the acquired corpora. The project will (1) study the application of active learning (Cohn et al., 1994) and weakly supervised bootstrapping algorithms like co-training (Blum & Mitchell, 1998) on a set of representative problems in natural language processing, (2) identify the benefits and limitations of these approaches for reducing the manual annotation burden during the creation of large training corpora for natural language learning, and (3) develop a cooperative learning framework (Pierce & Cardie, 2002) that combines active and weakly supervised learning in an attempt to more effectively interleave manual and automated linguistic annotation efforts.

View original record on NSF Award Search →