GGrantIndex
← Search

EAGER: PARTIAL: An Exploratory Study on Practical Approaches for Robust NLP Tools with Integrated Annotation Languages

$100,000FY2013CSENSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

In order to develop natural language processing (NLP) technologies for text in a wider range of languages, dialects, genres, and styles, this Early Grant for Exploratory Research investigates a novel methodological approach. Conventionally, linguistic experts are employed to create gold-standard linguistically annotated datasets to which supervised machine learning algorithms are applied. This project frees annotators from the requirement that annotations be complete by moving more of the burden to learning algorithms. Algorithms are developed that are robust to partial evidence, annotator variation, and noise due to errors. As a result, any language enthusiast (not just trained experts) can provide annotations so that NLP can be developed for more kinds of text in more languages for less money. In this exploration, the focus is on dependency parsing, a fundamental NLP component that predicts the grammatical relationships between words in sentences, with experimentation on data in English (two genres), Chinese, and Farsi. The formal basis for the approach is a framework called Graph Fragment Language (GFL). The project assesses the quality of parsers learned from GFL and the productivity of annotators accorded this new flexibility. Beyond documentation and assessment of the new methodology, this project produces open-source software tools for gathering annotated data and constructing NLP tools using the data. It emphasizes the usability of these tools in classrooms, contributing exercises that can be used in NLP and linguistics courses to allow students to engage directly with data, with the models that make use of the data, and with the technological goals that data annotation supports.

View original record on NSF Award Search →