Reducing Annotation Effort in the Documentation of Languages using Machine Learning and Active Learning

$79,106FY2007SBENSF

University Of Texas At Austin, Austin TX

Investigators

Abstract

The work of documenting and describing the endangered languages of the world is urgent yet time-consuming. An important part of the documentation process is the annotation of transcribed texts with linguistic information, a process full of redundant, repetitive decisions. Although there are excellent natural language processing tools and techniques available which could be used to automate a great deal of annotation, they have not previously been brought to bear on the problem of linguistic analysis of language data from documentation projects. One difficulty in applying these tools in documentation projects is that their accuracy and coverage are highly correlated with the amount of data which they have to learn from. This presents a chicken-and-egg problem: the tools could be used to aid the linguist to produce annotated material, but they require annotated material for learning in order to produce it well. With funding from the National Science Foundation, Jason Baldridge and colleagues at the University of Texas as Austin will investigate machine learning techniques geared toward maximizing performance using as little human annotated material as possible to directly address this bottleneck. Pilot studies on already annotated data from a large corpus of Portuguese newspaper texts will be used to develop an appropriate methodology for effectively leveraging these techniques; this methodology will then be applied to and tested on data from a documentation project for the Mayan language Q'anjob'al. There are three expected products of this research: an expanded XML format for representing analyzed texts, a methodology for incorporating machine learning into the language documentation workflow, and an annotated corpus of Q'anjob'al texts. This project also provides a needed case study of techniques for semi-automated annotation of transcribed texts using natural language processing components. This project is of practical importance to the computational linguistics community because it considers machine learning methodologies in an actual annotation context rather than a simulated one. It is significant for linguistic science as a whole in that many language documentation projects address documentation of language families about which our general linguistic knowledge is minimal compared to that for more widely-spoken languages. The recordings and documents collected by language documentation projects certainly contain an untapped wealth of linguistic and typological data. A major aim of the project is to deliver methods for reducing the time and effort required to produce machine-readable, linguistically analyzed texts of documented languages. Producing such texts more efficiently is important since it has the potential to enable more languages to be documented and for each of those languages to be documented in greater depth and detail. If these methods are successful in significantly reducing annotation time, more time, money, and human effort can be devoted to more important and interesting tasks like continued documentation and linguistic analysis of the data collected.

View original record on NSF Award Search →