EAGER: Matching Non-Native Transcribers to the Distinctive Features of the Language Transcribed
University Of Illinois At Urbana-Champaign, Urbana IL
Investigators
Abstract
Automatic speech recognition (ASR) systems must be trained using hundreds of hours of speech, with synchronized text transcriptions. Transcribing that much speech is beyond the means of most language communities; therefore ASR systems do not exist for most languages. To overcome this bottleneck, this exploratory EAGER project asks people who don't understand a particular language to transcribe it as if they were listening to nonsense syllables. Of course, when people try to transcribe speech in a language they don't understand, they make mistakes. However there are patterns to those mistakes which can be modeled using decoding strategies developed for telephone and wireless communication, and used to route each transcription task to people whose native language helps them to perform it. The resulting transcriptions are then fused in order to recover correct transcriptions. Five different languages are to be tested, including languages with lexical tone, and languages with a variety of consonant contrasts very different from English. The resulting transcriptions can then train ASR systems in all five languages, and the quality of the research evaluated based on its ability to train those systems without using transcriptions produced by native speakers. Mismatched crowdsourcing is formalized as a noisy channel; the talker encodes meaning in a string of symbols (phonemes) not all of which are reliably distinguishable by the perceiver. Models of second-language speech perception for each transcriber can be initialized using a perceptual assimilation model, then specialized. In particular, this proposal seeks increases in the scale and robustness of mismatched crowdsourcing by using error-correcting codes to divide the transcription task, and by then distributing each sub-task to transcribers whose native language contains the distinctive feature requested. It also seeks to develop new theory at the intersection of the current fields of crowdsourcing (the learnability of a function under conditions of label noise) and grammar induction (the learnability of a function from one language to another), and to perform grammar induction under conditions of label noise. Preliminary bounds exist for some aspects of this problem; the proposed research is designed to develop more detailed theoretic results, and test and apply them to determine the feasibility of creating serviceable ASR systems for under-resourced languages without having to use fluent speakers of those languages to transcribe speech in those languages.
View original record on NSF Award Search →