EAGER: Creating Speech Synthesizers for Low Resource Languages

$150,000FY2015CSENSF

Columbia University, New York NY

Investigators

Abstract

Recent advances in speech technology have resulted in wide use of Spoken Dialogue Systems (SDS) such as Siri (iPhone) and Voice Search (Android). These systems support major improvements in information access by voice for High Resource Languages (HRLS) such as English, French, Mandarin, Japanese, and Spanish. For these HRLs, researchers have built dictionaries, parsers, part-of-speech taggers, language models, search engines, and machine translation engines to support speech technologies. However, there are ~6500 world languages, including Tagalog, Tamil, Swahili, Vietnamese and Pashto, many of which are spoken by millions of people, but which do not enjoy the computational resources necessary to build SDS. These are termed Low Resource Languages (LRLs). Speakers of LRLs do not enjoy the same communication and search capabilities speakers of HRLs do. In particular, there is little research and few resources supporting the development of Text-to-Speech Synthesis (TTS) systems to produce Siri-like speech for SDS in these languages. New paradigms for TTS synthesis are now being developed which make it theoretically possible to build systems quickly and cheaply without recording large, special-purpose speech corpora using data recorded for other purposes such as training speech recognizers. This EArly Grant for Exploratory Research investigates the use of these techniques to produce TTS systems for LRL. Three major problems will be explored: 1) Can one develop automatic techniques to filter found data (removing data that is too loud, too noisy or disfluent, for example) to obtain intelligible and natural-sounding results? 2) Can one obtain pronunciation dictionaries from online sources that, with crowd-sourced validation, suffice to generate intelligible and natural speech? 3) Can one use clustering techniques on found data to identify pitch contours that can be crowd-sourced to identify meanings such as question vs. statement contours without prior knowledge of a language's phonology? These methods are tested on two languages: Standard American English, to develop the techniques rapidly, and a language similar in writing system and phonology, Lithuanian, to evaluate on an initial LRL. Both evaluations are made in terms of intelligibility and naturalness using crowd-sourcing techniques with native speakers of each language. The ultimate goal of this exploratory work will be to test these techniques on a broad variety of LRLs which have been collected for purposes of developing speech recognizers.

View original record on NSF Award Search →