CAREER: Information Engineering and Synthesis for Resource-poor Languages

$614,532FY2008CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

For the majority of the world's languages, the amount of linguistic resources (e.g., annotated corpora and parallel data) is very limited. Consequently, supervised methods and many unsupervised methods cannot be applied directly, leaving these languages largely untouched and unnoticed. Another crucial issue, which has received little attention from the natural language processing (NLP) community, is that to date there have been very few studies that examine a large number of languages and incorporate cross-lingual information into NLP systems. As a result, languages are researched and processed in isolation rather than being looked at as part of a big language family. This proposed research has two intertwined goals. The first goal is to create a framework that allows the rapid development of resources for resource-poor languages. This goal will be accomplished by bootstrapping NLP tools with initial seeds created by projecting syntactic information from resource-rich languages to resource-poor ones. The second goal is to use the automatically created resources to perform cross-lingual study on a large number of languages to discover linguistic knowledge. The knowledge will not only deepen our understanding on languages, but also provide additional information that can be incorporated into the bootstrapping module to produce better NLP tools. The research explores two key ideas: The first idea is to take advantage of resource-rich languages by using them to create seeds for bootstrapping NLP tools. The second idea is to identify the relation between languages and use this information to help machine learning. Both ideas point to the same direction; that is, languages are related to one another and should be treated as such.

View original record on NSF Award Search →