Bootstrapping a Corpus of Endangered Languages

$436,423FY2023SBENSF

Boston College, Chestnut Hill MA

Investigators

Abstract

Understanding how language works requires a much better understanding of languages other than English. The vast bulk of research to date has focused on English and a few closely-related languages. Many of the outstanding questions – even about English itself - cannot be answered with data only from English and related languages. This project greatly expands the information available to scientists about 16 theoretically-important languages, chosen because they appear to be impossible according to leading theories. Because language is central to much of human activity, potential Broader Impacts of a better understanding of language are vast, including impact on second language education, language technologies such as speech recognition and AI, and rehabilitation of language-related disorders such as aphasia and dyslexia. The Broader Impacts of this project include supporting language preservation and revival as well as related goals of the communities that speak the 16 languages. Specifically, this project produces 'mid-scale' corpora on the order of one million words per language for each language. While much recent focus is on massive corpora with billions of words, mid-scale corpora played a critical role in computational, psycholinguistic, and acquisition studies of English and other high-resource languages. They are also more feasible. This project takes a 'bootstrapping' approach, first compiling, formatting, and redistributing existing materials for all 16 languages, including both text-only resources and audio paired with transcriptions. It then uses cutting edge machine learning to develop Automatic Speech Recognition for two of the languages and assess its usefulness in speeding up transcription of new corpus materials. These new materials are then used to refine the Automatic Speech Recognition, building a 'virtuous cycle' that speeds further work. The method can also be expanded to other languages. All materials and code are distributed for free in order to stimulate research and industry. This award is made as part of a funding partnership between the National Science Foundation and the National Endowment for the Humanities for the NSF Dynamic Language Infrastructure – NEH Documenting Endangered Languages Program. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →