Identifying Temporal Cues for Speech Recognition

$435,086FY2008SBENSF

University Of Illinois At Urbana-Champaign, Urbana IL

Investigators

Abstract

Arguably the most recalcitrant problem in speech perception has been identifying invariant aspects of the acoustic or neural signals that correspond to speech segments. This is an especially difficult problem because of the large variation in speech produced by different speakers. For decades it has been assumed that the main cues for speech recognition come from the most salient frequencies in our voices and how these frequencies change as we produce consonants and vowels. However, recent results using a form of speech that mimics what is heard by cochlear implant users have pointed to the primary importance of temporal cues, especially for the recognition of consonants. Temporal features in the responses of the auditory nerve have been identified after presentation of the American English stop consonants /d/, /t/, /p/ and /b/. For each of these stop consonants, the temporal features are unique, relatively invariant despite large acoustic differences in the speech sounds, and could, therefore, provide the temporal cues necessary for speech recognition. The present work extends this research to all American English consonants (including fricatives such as /f/) and nasals (such as /n/ and /m/) produced by many different speakers. The hypothesis is that for each consonant there are unique temporal patterns in the responses of the auditory nerve and these are unchanged by variations in the acoustics of speech. The proposed experiments will examine the representation of consonant-vowel syllables in the auditory nerve of chinchillas, which hear over the same frequency range as humans. Syllables produced by 12 talkers will be taken from a public corpus, and also synthesized using a noise vocoder (which mimics what cochlear implant patients hear). The responses of individual auditory nerve fibers to a syllable will be pooled to create an ensemble response. Dynamic time warping, which correlates highly with the psychoacoustic recognizability of a speech token, will provide a quantitative measure of similarity and difference between ensemble responses. The study of temporal cues in ensemble responses is a new and fundamentally different approach to speech recognition which will provide important insights into how recognition is achieved despite acoustic variability. The results from these experiments will be necessary for developing better speech recognition algorithms, improving speech rehabilitation strategies and for enhancing speech coding in cochlear implants.

View original record on NSF Award Search →