Automatic grammar engineering for endangered languages based on cross-linguistic resources

$429,864FY2016CSENSF

University Of Washington, Seattle WA

Investigators

Abstract

Grammar engineering is the process of creating computer models of the grammar of languages, including how words are formed from smaller meaningful parts, how words are put together into sentences, and how the meaning of sentences is built based on their structure and the meaning of their parts. This project is automatically creating computational grammars by combining computational techniques developed for well-studied languages, data collected and annotated by field linguists and a cross-linguistic grammar resource (the LinGO Grammar Matrix). Computational grammars enrich the results of language documentation because they can be used to automatically create further annotations (of word structure, sentence structure and meaning). Text annotated in this way can be searched both for word forms or structures of interest as well as for examples which fall outside of current hypotheses, helping linguists more rapidly zero in on the data of interest. Broader impacts include the training of graduate students and the development of computational tools of potential use to groups ranging from linguists to endangered or low resource language communities. The AGGREGATION Project aims to bring the benefits of grammar engineering to the urgent task of documenting endangered languages. In Phase II, the AGGREGATION Project will pursue two related sets of goals: (1) Expanding the coverage of the cross-linguistic resource and the resulting computational grammars and (2) creating interfaces to realize the potential of the grammars, the annotations they produce, and the intermediate outputs of our grammar creation system as analytical tools for field linguists. The overall system and its interfaces are general tools, meant to bring the power of computational processing to field linguists. In order to ensure their broad applicability, the tools will be developed using three languages from different language families and different parts of the world as case studies: Chintang (a Kiranti language of Nepal), Matsigenka (an Arawak language of Peru), and Abui (an Alor-Pantar language of Indonesia). This project is supported by NSF's Robust Intelligence Program in CISE.

View original record on NSF Award Search →